关于Java:Crawler4j身份验证不起作用

Crawler4j authentication not working

我正在尝试使用Crawler4J中的FormAuthInfo身份验证来爬网到特定的LinkedIn页面。只有在我正确登录后,才能呈现此页面。

这是我的控制器,具有访问URL:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
public class Controller {

public static void main(String[] args) throws Exception {

    String crawlStorageFolder ="/data/";
    int numberOfCrawlers = 1;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);

    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

    String formUsername ="session_key";
    String formPassword ="session_password";
    String session_user ="email@email.com";
    String session_password ="myPasswordHere";
    String urlLogin ="https://www.linkedin.com/uas/login";
    AuthInfo formAuthInfo = new FormAuthInfo(session_password, session_user, urlLogin, formUsername, formPassword);

    config.addAuthInfo(formAuthInfo);
    config.setMaxDepthOfCrawling(0);

    controller.addSeed("https://www.linkedin.com/vsearch/f?keywords=java");

    controller.start(Crawler.class, numberOfCrawlers);
    controller.shutdown();
}

} ??

这是我的Crawler类:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public class Crawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\\\.(css|js|gif|jpg" +"|png|mp3|mp3|zip|gz))$");

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
    String href = url.getURL().toLowerCase();
    return !FILTERS.matcher(href).matches() && href.startsWith("https://www.linkedin.com");
}

@Override
public void visit(Page page) {
    String url = page.getWebURL().getURL();
    System.out.println("URL:" + url);

    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String text = htmlParseData.getText();
        String html = htmlParseData.getHtml();
        System.out.println(html);
        Set<WebURL> links = htmlParseData.getOutgoingUrls();

        System.out.println("Text length:" + text.length());
        System.out.println("Html length:" + html.length());
        System.out.println("Number of outgoing links:" + links.size());
    }
}

} ??

使用Auth运行此应用时,出现以下错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
    ADVERTêNCIA: Cookie rejected [JSESSIONID="ajax:3637761943332982524", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute".www.linkedin.com". Domain of origin:"www.linkedin.com"
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute"linkedin.com" violates RFC 2109: domain must start with a dot
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603959:t=1466690359:s=AQEc3R_6kIhooZN1RsDNkO2DaYEqzUWp"; Expires=Thu, 23 Jun 2016 13:59:19 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:19 GMT
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Cookie rejected [JSESSIONID="ajax:4912042947175739413", version:1, domain:.www.linkedin.com, path:/, expiry:null] Illegal domain attribute".www.linkedin.com". Domain of origin:"www.linkedin.com"
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Cookie rejected [lang="v=2&lang=en-us", version:1, domain:linkedin.com, path:/, expiry:null] Domain attribute"linkedin.com" violates RFC 2109: domain must start with a dot
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT
jun 22, 2016 10:59:14 AM org.apache.http.client.protocol.ResponseProcessCookies processCookies

ADVERTêNCIA: Invalid cookie header:"Set-Cookie: lidc="b=TGST09:g=87:u=1:i=1466603960:t=1466690360:s=AQE100NLG_uPIcJSJ7GLtRVkH7j_Ylu9"; Expires=Thu, 23 Jun 2016 13:59:20 GMT; domain=.linkedin.com; Path=/". Invalid 'expires' attribute: Thu, 23 Jun 2016 13:59:20 GMT

这与我的http客户端处理LInkedIn返回的cookie的方式有关吗?

有什么建议吗?
谢谢!


首先:这不是crawler4j的问题。这是Linkedin的问题,根据最新的google条目,他们很长时间没有解决。

但是,您的方法行不通,因为crawler4j尊重爬虫的道德规范。
如果查看robots.txt,您将看到该搜寻器将不会搜寻任何内容。