Problems with encoding while parsing html document with lxml
我正在尝试从某些网页获取纯净的文本。
我已经阅读了很多教程,最后得到了python
使用
我最终得到了这样的测试脚本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | from bs4 import UnicodeDammit import re import requests import lxml import lxml.html from time import sleep urls = [ "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html", "http://ru.onlinemschool.com/math/assistance/statistician/", "http://mathprofi.ru/zadachi_po_kombinatorike_primery_reshenij.html", "http://universarium.org/courses/info/332", "http://compsciclub.ru/course/wordscombinatorics", "http://ru.onlinemschool.com/math/assistance/statistician/", "http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/", "http://www.youtube.com/watch?v=SLPrGWQBX0I" ] def check(url): print"That is url {}".format(url) r = requests.get(url) ud = UnicodeDammit(r.content, is_html=True) content = ud.unicode_markup.encode(ud.original_encoding,"ignore") root = lxml.html.fromstring(content) lxml.html.etree.strip_elements(root, lxml.etree.Comment, "script","style") text = lxml.html.tostring(root, method="text", encoding=unicode) text = re.sub('\s+', ' ', text) print"Text type is {}!".format(type(text)) print text[:200] sleep(1) if __name__ == '__main__': for url in urls: check(url) |
由于html页面可能包含与大多数其他字符编码不同的字符,因此需要进行中等程度的解编码并重新编码为原始编码。这种情况打破了进一步的lxml
但是我的代码不能在所有测试中正常工作。有时(尤其是最后两个网址)会输出混乱:
1 2 3 4 5 6 7 8 9 10 | ... That is url http://ru.onlinemschool.com/math/assistance/statistician/ Text type is <type 'unicode'>! Онлайн решение задач по математике. Комбинаторика. Теория вероятности. Close Авторизация на сайте Введите логин: Введитепароль: Запомнить меня Регистрация Изучение математики онлайн.Изучайтематемат That is url http://lectoriy.mipt.ru/course/Maths-Combinatorics-AMR-Lects/ Text type is <type 'unicode'>! DD°?DμD?D°?D?DoD°. D?D?D?D2? DoD?D?D±D?D?D°?D??D?DoD? D? ?DμD??D?D? ?D??DμD? / DD?D′DμD?D?DμDo?D?D? D¤D?D·?Dμ?D°: DDμDo?D??D?D1 DD¤D¢D - D2D?D′DμD?D?DμDo?D?D? D?D? ?D?D·D?DoDμ, That is url http://www.youtube.com/watch?v=SLPrGWQBX0I Text type is <type 'unicode'>! D?D?D?D2D??Dμ ?D??D??D?? DoD?D?D±D?D?D°?D??D?DoD? - bezbotvy - YouTube D?D?D????D??? RU DD?D±D°D2D??? D2D?D′DμD?DD?D1?D?DD?D??Do DD°D3??D·DoD°... D?D±Dμ?D??Dμ ?D·?Do. |
这个烂摊子以某种方式与编码
对于最后两个网址,我得到:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | In [319]: r = requests.get(urls[-1]) In [320]: chardet.detect(r.content) Out[320]: {'confidence': 0.99, 'encoding': 'utf-8'} In [321]: UnicodeDammit(r.content, is_html=True).original_encoding Out[321]: 'utf-8' In [322]: r = requests.get(urls[-2]) In [323]: chardet.detect(r.content) Out[323]: {'confidence': 0.99, 'encoding': 'utf-8'} In [324]: UnicodeDammit(r.content, is_html=True).original_encoding Out[324]: u'utf-8' |
所以我猜
1 2 | In [339]: print unicode_string.encode('utf-8').decode("ISO-8859-1","ignore") ???D?DoD° |
我如何解决我的问题并清除html标签中的所有网址?
也许我应该使用其他python模块或以其他方式使用它?
请给我您的建议。
我终于弄明白了。
解决方法是不使用
1 | root = lxml.html.fromstring(content) |
但是配置一个显式的Parser对象,可以告诉它使用特定的编码
1 2 | htmlparser = etree.HTMLParser(encoding=enc) root = etree.HTML(content, parser=htmlparser) |
另外,我发现即使
1 | if (declared_enc and enc != declared_enc): |
以下是结果片段:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | from lxml import html from lxml.html import etree import requests from bs4 import UnicodeDammit import chardet try: self.log.debug("Try to get content from page {}".format(url)) r = requests.get(url) except requests.exceptions.RequestException as e: self.log.warn("Unable to get page content of the url: {url}." "The reason: {exc!r}".format(url=url, exc=e)) raise ParsingError(e.message) ud = UnicodeDammit(r.content, is_html=True) enc = ud.original_encoding.lower() declared_enc = ud.declared_html_encoding if declared_enc: declared_enc = declared_enc.lower() # possible misregocnition of an encoding if (declared_enc and enc != declared_enc): detect_dict = chardet.detect(r.content) det_conf = detect_dict["confidence"] det_enc = detect_dict["encoding"].lower() if enc == det_enc and det_conf < THRESHOLD_OF_CHARDETECT: enc = declared_enc # if page contains any characters that differ from the main # encodin we will ignore them content = r.content.decode(enc,"ignore").encode(enc) htmlparser = etree.HTMLParser(encoding=enc) root = etree.HTML(content, parser=htmlparser) etree.strip_elements(root, html.etree.Comment,"script","style") text = html.tostring(root, method="text", encoding=unicode) |