关于网络爬虫:beautiful Soup中发生错误,该如何继续运行

How to move on if the error occur in response on python in beautiful Soup

我制作了一个网络爬虫,它从一个文本文件中获取数千个URL,然后在该网页上爬行数据。现在它有了许多URL;一些URL也被破坏了。所以它给了我一个错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Traceback (most recent call last):  
File"C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 57, in <module>

crawl_data("http://www.foasdasdasdasdodily.com/r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show")  

  File"C:/Users/khize_000/PycharmProjects/untitled3/new.py", line 18, in crawl_data  

 data = requests.get(url)  

File"C:\Python27\lib\site-packages
equests\api.py"
, line 67, in get  
return request('get', url, params=params, **kwargs)  

File"C:\Python27\lib\site-packages
equests\api.py"
, line 53, in request  
return session.request(method=method, url=url, **kwargs)

File"C:\Python27\lib\site-packages
equests\sessions.py"
, line 468, in request  
 resp = self.send(prep, **send_kwargs)  

File"C:\Python27\lib\site-packages
equests\sessions.py"
, line 576, in send  
r = adapter.send(request, **kwargs)  

File"C:\Python27\lib\site-packages
equests\adapters.py"
, line 437, in send  
  raise ConnectionError(e, request=request)  

requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.foasdasdasdasdodily.com', port=80): Max retries exceeded with url: /r/126e7649cc-sweetssssie-pies-mac-and-cheese-recipe-by-the-dr-oz-show (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x0310FCB0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',))

以下是我的代码:

1
2
3
4
5
6
7
8
def crawl_data(url):
    global connectString
    data = requests.get(url)
    response = str( data )
    if response !="<Response [200]>":
        return
    soup = BeautifulSoup(data.text,"lxml")
    titledb = soup.h1.string

但它仍然给了我同样的例外或错误。

I simply want it to ignore that Urls from which there is no response
and move on to the next Url.


您需要了解异常处理。忽略这些错误的最简单方法是用try-except构造围绕处理单个URL的代码,使您的代码读取如下内容:

1
2
3
4
try:
    <process a single URL>
except requests.exceptions.ConnectionError:
    pass

这意味着,如果发生指定的异常,您的程序将只执行pass语句(不执行任何操作),并继续执行下一个语句。


使用try-except

1
2
3
4
5
6
7
8
9
10
def crawl_data(url):
    global connectString
    try:
        data = requests.get(url)
    except requests.exceptions.ConnectionError:
        return

    response = str( data )
    soup = BeautifulSoup(data.text,"lxml")
    titledb = soup.h1.string