使用Python从HTML文件中提取文本

Extracting text from HTML file using Python

我想使用Python从HTML文件中提取文本。如果我从浏览器复制文本并将其粘贴到记事本中，我想要的输出基本相同。

我想要比使用可能在格式不正确的HTML上失败的正则表达式更强大的东西。我见过很多人推荐Beautiful Soup，但是我使用它时遇到了一些问题。首先，它选择了不需要的文本，例如JavaScript源代码。此外，它没有解释HTML实体。例如，我希望HTML源代码可以在文本中转换为撇号，就像我将浏览器内容粘贴到记事本中一样。

更新html2text看起来很有希望。它正确处理HTML实体并忽略JavaScript。但是，它并不完全产生纯文本;它会产生降价，然后必须将其转换为纯文本。它没有示例或文档，但代码看起来很干净。

相关讨论

我找到的最好的代码，用于提取文本而不需要获取javascript或不需要的东西：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

import urllib
from bs4 import BeautifulSoup

url ="http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

# kill all script and style elements
for script in soup(["script","style"]):
script.extract() # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '
'.join(chunk for chunk in chunks if chunk)

print(text)

你必须先安装BeautifulSoup：

1	pip install beautifulsoup4

相关讨论

html2text是一个Python程序，在这方面表现相当不错。

相关讨论

注意：NTLK不再支持clean_html功能

下面的原始答案，以及评论部分的替代方案。

使用NLTK

我浪费了4-5个小时来修复html2text的问题。幸运的是我可以遇到NLTK。
它神奇地工作。

1
2
3
4
5
6
7

import nltk
from urllib import urlopen

url ="http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html)
print(raw)

相关讨论

发现自己今天面临同样的问题。我编写了一个非常简单的HTML解析器来删除所有标记的传入内容，仅使用最少的格式返回剩余的文本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.__text = []

def handle_data(self, data):
text = data.strip()
if len(text) > 0:
text = sub('[ \t

]+', ' ', text)
self.__text.append(text + ' ')

def handle_starttag(self, tag, attrs):
if tag == 'p':
self.__text.append('

')
elif tag == 'br':
self.__text.append('
')

def handle_startendtag(self, tag, attrs):
if tag == 'br':
self.__text.append('

')

def text(self):
return ''.join(self.__text).strip()

def dehtml(text):
try:
parser = _DeHTMLParser()
parser.feed(text)
parser.close()
return parser.text()
except:
print_exc(file=stderr)
return text

def main():
text = r'''
<html>
<body>
Project: DeHTML
Description:
This small script is intended to allow conversion from HTML markup to
plain text.
</body>
</html>
'''
print(dehtml(text))

if __name__ == '__main__':
main()

相关讨论

这是xperroni答案的一个版本，它更完整。它跳过脚本和样式部分并翻译charref(例如)和HTML实体(例如＆)。

它还包括一个简单的纯文本到html逆转换器。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

"""
HTML <-> text conversions.
"""
from HTMLParser import HTMLParser, HTMLParseError
from htmlentitydefs import name2codepoint
import re

class _HTMLToText(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self._buf = []
self.hide_output = False

def handle_starttag(self, tag, attrs):
if tag in ('p', 'br') and not self.hide_output:
self._buf.append('
')
elif tag in ('script', 'style'):
self.hide_output = True

def handle_startendtag(self, tag, attrs):
if tag == 'br':
self._buf.append('
')

def handle_endtag(self, tag):
if tag == 'p':
self._buf.append('
')
elif tag in ('script', 'style'):
self.hide_output = False

def handle_data(self, text):
if text and not self.hide_output:
self._buf.append(re.sub(r'\s+', ' ', text))

def handle_entityref(self, name):
if name in name2codepoint and not self.hide_output:
c = unichr(name2codepoint[name])
self._buf.append(c)

def handle_charref(self, name):
if not self.hide_output:
n = int(name[1:], 16) if name.startswith('x') else int(name)
self._buf.append(unichr(n))

def get_text(self):
return re.sub(r' +', ' ', ''.join(self._buf))

def html_to_text(html):
"""
Given a piece of HTML, return the plain text it contains.
This handles entities and char refs, but not javascript and stylesheets.
"""
parser = _HTMLToText()
try:
parser.feed(html)
parser.close()
except HTMLParseError:
pass
return parser.get_text()

def text_to_html(text):
"""
Convert the given text to html, wrapping what looks like URLs with tags,
converting newlines to tags and converting confusing chars into html
entities.
"""
def f(mo):
t = mo.group()
if len(t) == 1:
return {'&':'&',"'":'', '"':'"', '<':'<', '>':'>'}.get(t)
return '%s' % (t, t)
return re.sub(r'https?://[^] ()"\';]+|[&\'"<>]', f, text)

相关讨论

我知道已经有很多答案，但我发现的最优雅和pythonic解决方案部分地在这里描述。

1
2
3

from bs4 import BeautifulSoup

text = ''.join(BeautifulSoup(some_html_string,"html.parser").findAll(text=True))

更新

根据弗雷泽的评论，这里是更优雅的解决方案：

1
2
3

from bs4 import BeautifulSoup

clean_text = ''.join(BeautifulSoup(some_html_string,"html.parser").stripped_strings)

相关讨论

您也可以在条形图库中使用html2text方法。

1 2	from stripogram import html2text text = html2text(your_html_string)

要安装条带图运行sudo easy_install条形图

相关讨论

有用于数据挖掘的Pattern库。

http://www.clips.ua.ac.be/pages/pattern-web

您甚至可以决定要保留哪些标记：

1
2
3

s = URL('http://www.clips.ua.ac.be').download()
s = plaintext(s, keep={'h1':[], 'h2':[], 'strong':[], 'a':['href']})
print s

PyParsing做得很好。 PyParsing wiki被杀了所以这里是另一个有PyParsing使用示例的位置(示例链接)。投入一点时间进行pyparsing的一个原因是他还写了一篇非常简洁，非常有条理的O'Reilly Short Cut手册，价格便宜。

话虽如此，我使用BeautifulSoup并不是很难处理实体问题，你可以在运行BeautifulSoup之前转换它们。

祝好运

相关讨论

如果您需要更快的速度和更低的准确性，那么您可以使用原始lxml。

1
2
3
4
5
6
7

import lxml.html as lh
from lxml.html.clean import clean_html

def lxml_to_text(html):
doc = lh.fromstring(html)
doc = clean_html(doc)
return doc.text_content()

使用安装html2text

pip install html2text

然后，

1
2
3
4
5
6
7
8

>>> import html2text
>>>
>>> h = html2text.HTML2Text()
>>> # Ignore converting links from HTML
>>> h.ignore_links = True
>>> print h.handle("
Hello, world!")
Hello, world!

这不完全是一个Python解决方案，但它会将Javascript生成的文本转换为文本，我认为这很重要(E.G. google.com)。浏览器Links(不是Lynx)有一个Javascript引擎，并使用-dump选项将源转换为文本。

所以你可以这样做：

1
2
3
4
5
6

fname = os.tmpnam()
fname.write(html_source)
proc = subprocess.Popen(['links', '-dump', fname],
stdout=subprocess.PIPE,
stderr=open('/dev/null','w'))
text = proc.stdout.read()

相关讨论

而不是HTMLParser模块，请查看htmllib。它有一个类似的界面，但为你做更多的工作。 (这是非常古老的，所以它在摆脱javascript和css方面没什么帮助。你可以创建一个派生类，但是添加名称如start_script和end_style的方法(详见python docs)，但这很难为格式错误的html可靠地执行此操作。)无论如何，这里有一些简单的东西将纯文本打印到控制台

1
2
3
4
5

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hellothere'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

相关讨论

美丽的汤确实转换html实体。考虑到HTML经常出错并充满了unicode和html编码问题，这可能是你最好的选择。这是我用来将html转换为原始文本的代码：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

import BeautifulSoup
def getsoup(data, to_unicode=False):
data = data.replace("","")
# Fixes for bad markup I've seen in the wild. Remove if not applicable.
masssage_bad_comments = [
(re.compile('<!-([^-])'), lambda match: ''),
]
myNewMassage = copy.copy(BeautifulSoup.BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(masssage_bad_comments)
return BeautifulSoup.BeautifulSoup(data, markupMassage=myNewMassage,
convertEntities=BeautifulSoup.BeautifulSoup.ALL_ENTITIES
if to_unicode else None)

remove_html = lambda c: getsoup(c, to_unicode=True).getText(separator=u' ') if c else""

我推荐一个名为goose-extractor的Python包
Goose将尝试提取以下信息：

一篇文章的正文
文章的主要形象
文章中嵌入了任何Youtube / Vimeo电影
元描述
元标记

更多：https：//pypi.python.org/pypi/goose-extractor/

另一种选择是通过基于文本的Web浏览器运行html并将其转储。例如(使用Lynx)：

1	lynx -dump html_to_convert.html > converted_html.txt

这可以在python脚本中完成，如下所示：

1
2
3
4

import subprocess

with open('converted_html.txt', 'w') as outputFile:
subprocess.call(['lynx', '-dump', 'html_to_convert.html'], stdout=testFile)

它不会完全提供HTML文件中的文本，但根据您的使用情况，它可能比html2text的输出更可取。

另一个非python解决方案：Libre Office：

1	soffice --headless --invisible --convert-to txt input1.html

我更喜欢这个替代其他替代方案的原因是每个HTML段落都被转换为单个文本行(没有换行符)，这正是我所寻找的。其他方法需要后处理。 Lynx确实产生了不错的输出，但并不完全是我想要的。此外，Libre Office可用于转换各种格式......

我知道这里有很多答案，但我认为报纸也值得一提。我最近需要完成类似的任务，从网上的文章中提取文本，这个库在我的测试中迄今为止做得很好。它会忽略菜单项和侧栏中的文本以及OP请求时页面上显示的任何JavaScript。

1
2
3
4
5
6

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

如果您已经下载了HTML文件，则可以执行以下操作：

1
2
3
4

article = Article('')
article.set_html(html)
article.parse()
article.text

它甚至还有一些NLP功能，用于总结文章的主题：

1 2	article.nlp() article.summary

我在Apache Tika上取得了不错的成绩。其目的是从内容中提取元数据和文本，因此底层解析器可以相应地进行调整。

Tika可以作为服务器运行，在Docker容器中运行/部署很简单，并且可以通过Python绑定访问它。

有人用漂白剂试过bleach.clean(html,tags=[],strip=True)吗？它对我有用。

相关讨论

在Python 3.x中，您可以通过导入'imaplib'和'email'包以非常简单的方式完成它。虽然这是一篇较老的帖子，但也许我的回答可以帮助这些帖子的新人。

1
2
3
4
5
6
7
8
9
10
11
12
13

status, data = self.imap.fetch(num, '(RFC822)')
email_msg = email.message_from_bytes(data[0][1])
#email.message_from_string(data[0][1])

#If message is multi part we only want the text version of the body, this walks the message and gets the body.

if email_msg.is_multipart():
for part in email_msg.walk():
if part.get_content_type() =="text/plain":
body = part.get_payload(decode=True) #to control automatic email-style MIME decoding (e.g., Base64, uuencode, quoted-printable)
body = body.decode()
elif part.get_content_type() =="text/html":
continue

现在你可以打印身体变量，它将是纯文本格式:)如果它对你来说足够好，那么选择它作为公认的答案会很好。

相关讨论

以一种简单的方式

1
2
3
4

import re

html_text = open('html_file.html').read()
text_filtered = re.sub(r'<(.*?)>', '', html_text)

此代码查找html_text的所有部分以"<"开头并以">"结尾，并替换所有由空字符串找到的部分

@ PeYoTIL使用BeautifulSoup回答并删除样式和脚本内容对我不起作用。我尝试使用decompose而不是extract，但它仍然无法正常工作。所以我创建了自己的文本，它也使用

标签格式化文本，并用href链接替换标签。还可以处理文本中的链接。可以在此要点上获得嵌入的测试文档。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

from bs4 import BeautifulSoup, NavigableString

def html_to_text(html):
"Creates a formatted text email message as a string from a rendered html template (page)"
soup = BeautifulSoup(html, 'html.parser')
# Ignore anything in head
body, text = soup.body, []
for element in body.descendants:
# We use type and not isinstance since comments, cdata, etc are subclasses that we don't want
if type(element) == NavigableString:
# We use the assumption that other tags can't be inside a script or style
if element.parent.name in ('script', 'style'):
continue

# remove any multiple and leading/trailing whitespace
string = ' '.join(element.string.split())
if string:
if element.parent.name == 'a':
a_tag = element.parent
# replace link text with the link
string = a_tag['href']
# concatenate with any non-empty immediately previous string
if ( type(a_tag.previous_sibling) == NavigableString and
a_tag.previous_sibling.string.strip() ):
text[-1] = text[-1] + ' ' + string
continue
elif element.previous_sibling and element.previous_sibling.name == 'a':
text[-1] = text[-1] + ' ' + string
continue
elif element.parent.name == 'p':
# Add extra paragraph formatting newline
string = '
' + string
text += [string]
doc = '
'.join(text)
return doc

相关讨论

对我来说最好的是文件。

https://github.com/weblyzard/inscriptis

1
2
3
4
5
6
7
8

import urllib.request
from inscriptis import get_text

url ="http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)
print(text)

结果非常好

您只能使用BeautifulSoup从HTML中提取文本

1
2
3
4
5

url ="https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

虽然很多人提到使用正则表达式去除html标签，但还有很多缺点。

例如：

1
2
3

helloworld
I love you

应解析为：

1 2	Hello world I love you

这是我提出的一个片段，你可以根据自己的特定需求进行定制，它就像一个魅力

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

import re
import html
def html2text(htm):
ret = html.unescape(htm)
ret = ret.translate({
8209: ord('-'),
8220: ord('"'),
8221: ord('"'),
160: ord(' '),
})
ret = re.sub(r"\s","", ret, flags = re.MULTILINE)
ret = re.sub("| |
||</h\d>","
", ret, flags = re.IGNORECASE)
ret = re.sub('<.*?>', ' ', ret, flags=re.DOTALL)
ret = re.sub(r" +","", ret)
return ret

另一个在Python 2.7.9+中使用BeautifulSoup4的例子

包括：

1 2	import urllib2 from bs4 import BeautifulSoup

码：

1
2
3
4
5
6
7
8
9
10
11

def read_website_to_text(url):
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for script in soup(["script","style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '
'.join(chunk for chunk in chunks if chunk)
return str(text.encode('utf-8'))

解释：

将url数据读入html(使用BeautifulSoup)，删除所有脚本和样式元素，并使用.get_text()获取文本。分成几行并删除每行上的前导和尾随空格，然后将多个标题分成一行，每行chunks =(phrase.strip()，用于line.split("")中短语的行。然后使用text =' n'.join，删除空行，最后返回为制裁的utf-8。

笔记：

由于SSL问题，某些运行此系统的系统将因https：//连接而失败，您可以关闭验证以解决该问题。示例修复：http：//blog.pengyifan.com/how-to-fix-python-ssl-certificate_verify_failed/
Python <2.7.9可能会遇到一些问题
text.encode('utf-8')可以留下奇怪的编码，可能只想返回str(文本)。

这是我定期使用的代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

from bs4 import BeautifulSoup
import urllib.request

def processText(webpage):

# EMPTY LIST TO STORE PROCESSED TEXT
proc_text = []

try:
news_open = urllib.request.urlopen(webpage.group())
news_soup = BeautifulSoup(news_open,"lxml")
news_para = news_soup.find_all("p", text = True)

for item in news_para:
# SPLIT WORDS, JOIN WORDS TO REMOVE EXTRA SPACES
para_text = (' ').join((item.text).split())

# COMBINE LINES/PARAGRAPHS INTO A LIST
proc_text.append(para_text)

except urllib.error.HTTPError:
pass

return proc_text

我希望有所帮助。

LibreOffice作者评论具有优点，因为应用程序可以使用python宏。它似乎为回答这个问题和进一步推动LibreOffice的宏观基础提供了多种好处。如果此解决方案是一次性实现，而不是用作更大的生产程序的一部分，则在编写器中打开HTML并将页面保存为文本似乎可以解决此处讨论的问题。

Perl方式(抱歉妈妈，我永远不会在制作中这样做)。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

import re

def html2text(html):
res = re.sub('<.*?>', ' ', html, flags=re.DOTALL | re.MULTILINE)
res = re.sub('
+', '
', res)
res = re.sub('
+', '', res)
res = re.sub('[\t ]+', ' ', res)
res = re.sub('\t+', '\t', res)
res = re.sub('(
)+', '
', res)
return res

相关讨论

我实现了这样的事情。

1
2
3
4

>>> import requests
>>> url ="http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> res = requests.get(url)
>>> text = res.text

相关讨论