关于python：使用lxml解析html文档时编码存在问题

Problems with encoding while parsing html document with lxml

我正在尝试从某些网页获取纯净的文本。
我已经阅读了很多教程，最后得到了python lxml + beautifulsoup + requests模块。
使用lxml进行此任务的原因是它比漂亮的汤更好地清理html文件。

我最终得到了这样的测试脚本：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

from bs4 import UnicodeDammit
import re
import requests
import lxml
import lxml.html
from time import sleep

urls = [
"http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html",
"http://ru.onlinemschool.com/math/assistance/statistician/",
"http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html",
"http://universarium.org/courses/info/332",
"http://compsciclub.ru/course/wordscombinatorics",
"http://ru.onlinemschool.com/math/assistance/statistician/",
"http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/",
"http://www.youtube.com/watch?v=SLPrGWQBX0I"
]

def check(url):
print"That is url {}".format(url)
r = requests.get(url)
ud = UnicodeDammit(r.content, is_html=True)
content = ud.unicode_markup.encode(ud.original_encoding,"ignore")
root = lxml.html.fromstring(content)
lxml.html.etree.strip_elements(root, lxml.etree.Comment,
"script","style")
text = lxml.html.tostring(root, method="text", encoding=unicode)
text = re.sub('\s+', ' ', text)
print"Text type is {}!".format(type(text))
print text[:200]
sleep(1)

if __name__ == '__main__':
for url in urls:
check(url)

由于html页面可能包含与大多数其他字符编码不同的字符，因此需要进行中等程度的解编码并重新编码为原始编码。这种情况打破了进一步的lxml tostring方法。

但是我的代码不能在所有测试中正常工作。有时(尤其是最后两个网址)会输出混乱：

1
2
3
4
5
6
7
8
9
10

...
That is url http://ru.onlinemschool.com/math/assistance/statistician/
Text type is <type 'unicode'>!
Онлайн решение задач по математике. Комбинаторика. Теория вероятности. Close Авторизация на сайте Введите логин: Введитепароль: Запомнить меня Регистрация Изучение математики онлайн.Изучайтематемат
That is url http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/
Text type is <type 'unicode'>!
DD°?DμD?D°?D?DoD°. D?D?D?D2? DoD?D?D±D?D?D°?D??D?DoD? D? ?DμD??D?D? ?D??DμD? / DD?D′DμD?D?DμDo?D?D? D¤D?D·?Dμ?D°: DDμDo?D??D?D1 DD¤D￠D - D2D?D′DμD?D?DμDo?D?D? D?D? ?D?D·D?DoDμ,
That is url http://www.youtube.com/watch?v=SLPrGWQBX0I
Text type is <type 'unicode'>!
D?D?D?D2D??Dμ ?D??D??D?? DoD?D?D±D?D?D°?D??D?DoD? - bezbotvy - YouTube D?D?D????D??? RU DD?D±D°D2D??? D2D?D′DμD?DD?D1?D?DD?D??Do DD°D3??D·DoD°... D?D±Dμ?D??Dμ ?D·?Do.

这个烂摊子以某种方式与编码ISO-8859-1有关，但是我不知道如何。
对于最后两个网址，我得到：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

In [319]: r = requests.get(urls[-1])

In [320]: chardet.detect(r.content)
Out[320]: {'confidence': 0.99, 'encoding': 'utf-8'}

In [321]: UnicodeDammit(r.content, is_html=True).original_encoding
Out[321]: 'utf-8'

In [322]: r = requests.get(urls[-2])

In [323]: chardet.detect(r.content)
Out[323]: {'confidence': 0.99, 'encoding': 'utf-8'}

In [324]: UnicodeDammit(r.content, is_html=True).original_encoding
Out[324]: u'utf-8'

所以我猜lxml根据输入字符串的错误假设进行内部解码。我认为它甚至不会尝试猜测输入字符串的编码。在lxml的核心中似乎发生了这样的事情：

1 2	In [339]: print unicode_string.encode('utf-8').decode("ISO-8859-1","ignore") ???D?DoD°

我如何解决我的问题并清除html标签中的所有网址？
也许我应该使用其他python模块或以其他方式使用它？
请给我您的建议。

相关讨论

我终于弄明白了。
解决方法是不使用

1	root = lxml.html.fromstring(content)

但是配置一个显式的Parser对象，可以告诉它使用特定的编码enc：

1 2	htmlparser = etree.HTMLParser(encoding=enc) root = etree.HTML(content, parser=htmlparser)

另外，我发现即使UnicodeDammit在决定页面编码时也会犯明显的错误。所以我添加了另一个if块：

1	if (declared_enc and enc != declared_enc):

以下是结果片段：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

from lxml import html
from lxml.html import etree
import requests
from bs4 import UnicodeDammit
import chardet

try:
self.log.debug("Try to get content from page {}".format(url))
r = requests.get(url)
except requests.exceptions.RequestException as e:
self.log.warn("Unable to get page content of the url: {url}."
"The reason: {exc!r}".format(url=url, exc=e))
raise ParsingError(e.message)

ud = UnicodeDammit(r.content, is_html=True)

enc = ud.original_encoding.lower()
declared_enc = ud.declared_html_encoding
if declared_enc:
declared_enc = declared_enc.lower()
# possible misregocnition of an encoding
if (declared_enc and enc != declared_enc):
detect_dict = chardet.detect(r.content)
det_conf = detect_dict["confidence"]
det_enc = detect_dict["encoding"].lower()
if enc == det_enc and det_conf < THRESHOLD_OF_CHARDETECT:
enc = declared_enc
# if page contains any characters that differ from the main
# encodin we will ignore them
content = r.content.decode(enc,"ignore").encode(enc)
htmlparser = etree.HTMLParser(encoding=enc)
root = etree.HTML(content, parser=htmlparser)
etree.strip_elements(root, html.etree.Comment,"script","style")
text = html.tostring(root, method="text", encoding=unicode)