用Python字符串解码HTML实体？

Decode HTML entities in Python string?

我正在使用Beautiful Soup 3解析一些HTML，但是它包含HTML实体，Beautiful Soup 3不会自动为我解码：

1
2
3
4
5
6
7

>>> from BeautifulSoup import BeautifulSoup

>>> soup = BeautifulSoup("£682m")
>>> text = soup.find("p").string

>>> print text
£682m

如何解码text中的HTML实体以获得"￡682m"而不是"£682m"。

相关讨论

Python 3.4

使用html.unescape()：

1 2	import html print(html.unescape('£682m'))

FYI html.parser.HTMLParser.unescape已被弃用，尽管错误地将其保留在3.5中，但应该将其删除。它将很快从语言中删除。

Python 2.6-3.3

您可以使用标准库中的HTMLParser.unescape()：

对于Python 2.6-2.7，它位于HTMLParser中
对于Python 3，它位于html.parser中

1
2
3
4
5
6
7
8
9
10

>>> try:
... # Python 2.6-2.7
... from HTMLParser import HTMLParser
... except ImportError:
... # Python 3
... from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
￡682m

您还可以使用six兼容性库来简化导入：

1
2
3
4

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
￡682m

相关讨论

Beautiful Soup处理实体转换。在Beautiful Soup 3中，您需要为BeautifulSoup构造函数指定convertEntities参数(请参阅存档文档的\\'Entity Conversion \\'部分)。在Beautiful Soup 4中，实体会自动解码。

美丽的汤3

1
2
3
4

>>> from BeautifulSoup import BeautifulSoup
>>> BeautifulSoup("£682m",
... convertEntities=BeautifulSoup.HTML_ENTITIES)
￡682m

美丽汤4

1
2
3

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("£682m")
<html><body>￡682m</body></html>

您可以使用w3lib.html库中的replace_entities

1
2
3
4
5
6
7

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("£682m")
Out[203]: u'\\xa3682m'

In [204]: print replace_entities("£682m")
￡682m

Beautiful Soup 4允许您将格式化程序设置为输出

If you pass in formatter=None, Beautiful Soup will not modify strings
at all on output. This is the fastest option, but it may lead to
Beautiful Soup generating invalid HTML/XML, as in these examples:

1
2
3
4
5
6
7
8
9
10
11
12

print(soup.prettify(formatter=None))
# <html>
# <body>
# 
# Il a dit <<Sacré bleu!>>
# 
# </body>
# </html>

link_soup = BeautifulSoup('A link')
print(link_soup.a.encode(formatter=None))
# A link

我有一个类似的编码问题。我使用了normalize()方法。将数据框导出到另一个目录中的.html文件时，使用pandas .to_html()方法时出现Unicode错误。我最终做到了，它奏效了...

1	import unicodedata

数据框对象可以是您喜欢的任何东西，我们称之为表...

1 2	table = pd.DataFrame(data,columns=['Name','Team','OVR / POT']) table.index+= 1

对表格数据进行编码，以便我们可以将其导出到模板文件夹中的.html文件中(可以是您希望的任何位置：))

1 2	#this is where the magic happens html_data=unicodedata.normalize('NFKD',table.to_html()).encode('ascii','ignore')

将规范化的字符串导出到html文件

1
2
3
4
5

file = open("templates/home.html","w")

file.write(html_data)

file.close()

参考：unicodedata文档

这可能与这里无关。但是，要从整个文档中消除这些html实体，您可以执行以下操作：(假设document = page，请原谅草率的代码，但是如果您有关于如何使其变得更好的想法，恕我直言，此)。

1
2
3
4
5
6
7
8
9

import re
import HTMLParser

regexp ="&.+?;"
list_of_html = re.findall(regexp, page) #finds all html entites in page
for e in list_of_html:
h = HTMLParser.HTMLParser()
unescaped = h.unescape(e) #finds the unescaped value of the html entity
page = page.replace(e, unescaped) #replaces html entity with unescaped value