Crawler4j authentication not working
我正在尝试使用
这是我的控制器,具有访问URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | public class Controller { public static void main(String[] args) throws Exception { String crawlStorageFolder ="/data/"; int numberOfCrawlers = 1; CrawlConfig config = new CrawlConfig(); config.setCrawlStorageFolder(crawlStorageFolder); PageFetcher pageFetcher = new PageFetcher(config); RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); String formUsername ="session_key"; String formPassword ="session_password"; String session_user ="[email protected]"; String session_password ="myPasswordHere"; String urlLogin ="https://www.linkedin.com/uas/login"; AuthInfo formAuthInfo = new FormAuthInfo(session_password, session_user, urlLogin, formUsername, formPassword); config.addAuthInfo(formAuthInfo); config.setMaxDepthOfCrawling(0); controller.addSeed("https://www.linkedin.com/vsearch/f?keywords=java"); controller.start(Crawler.class, numberOfCrawlers); controller.shutdown(); } |
} ??
这是我的Crawler类:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | public class Crawler extends WebCrawler { private final static Pattern FILTERS = Pattern.compile(".*(\\\\.(css|js|gif|jpg" +"|png|mp3|mp3|zip|gz))$"); @Override public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("https://www.linkedin.com"); } @Override public void visit(Page page) { String url = page.getWebURL().getURL(); System.out.println("URL:" + url); if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); String text = htmlParseData.getText(); String html = htmlParseData.getHtml(); System.out.println(html); Set<WebURL> links = htmlParseData.getOutgoingUrls(); System.out.println("Text length:" + text.length()); System.out.println("Html length:" + html.length()); System.out.println("Number of outgoing links:" + links.size()); } } |
} ??
使用Auth运行此应用时,出现以下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ADVERTêNCIA: Cookie rejected [JSESSIONID="ajax:3637761943332982524", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute".www.linkedin.com". Domain of origin:"www.linkedin.com" jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute"linkedin.com" violates RFC 2109: domain must start with a dot jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603959:t=1466690359:s=AQEc3R_6kIhooZN1RsDNkO2DaYEqzUWp"; Expires=Thu, 23 Jun 2016 13:59:19 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:19 GMT jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Cookie rejected [JSESSIONID="ajax:4912042947175739413", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute".www.linkedin.com". Domain of origin:"www.linkedin.com" jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute"linkedin.com" violates RFC 2109: domain must start with a dot jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT |
这与我的http客户端处理LInkedIn返回的cookie的方式有关吗?
有什么建议吗?
谢谢!
首先:这不是
但是,您的方法行不通,因为
如果查看robots.txt,您将看到该搜寻器将不会搜寻任何内容。