在BeautifulSoup中处理包含标记的字符串(上半部分)

美丽的汤

BeautifulSoup是一个用于从诸如HTML和XML之类的结构中检索数据的库。
它通常用于从网站上抓取信息。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

我不想尽可能多地使用抓取，也不认为应该使用它，但是不幸的是，有时候我必须这样做。
另外，要谨慎抓取并遵守规则。

本文内容

在以下两个主题中，检查前者的行为。
单击此处获取下半部分

Tag.text和Tag.string之间的区别。 Tag.string将为None，尤其是如果字符串包含 标记时。

Tag.find(string ='hoge')和Tag.find(text ='hoge')之间的区别。两者将是相同的。

环境

1
2
3
4
5
6
7
8
9

$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.14.2
BuildVersion: 18C54
$ python --version
Python 3.7.2
$ pip show bs4
Name: bs4
Version: 0.0.1

Tag.text和Tag.string的故事

用例1

我想分析请求或硒获取的html字符串并提取其中的数据。特别是，假设您想从html的一部分中获取文本部分hogefuga，如下所示：

目标HTML

1
2
3

html = '''
hogefuga
'''

可以通过类似以下的python脚本获得： (请在实施之前通过pip安装bs4)

sample_code1.py

1
2
3
4
5
6
7
8
9

from bs4 import BeautifulSoup

html = '''
hogefuga
'''

soup = BeautifulSoup(html, 'html.parser')
text = soup.find('p').string
print(text)

让我们实际执行它。

执行结果

1 2	$ python sample_code1.py hogefuga

完成。
这是美丽汤的基本用法。

用例2

那么如果提取的字符串包含 标签怎么办？
这次是目标html。

目标HTML

1
2
3

html = '''
hoge fuga
'''

如果使用与以前相同的方式编写，它将失败。

sample_code2.py

1
2
3
4
5
6
7
8
9

from bs4 import BeautifulSoup

html = '''
hoge fuga
'''

soup = BeautifulSoup(html, 'html.parser')
text = soup.find('p').string
print(text)

执行结果

1 2	$ python sample_code2.py None

如果提取的字符串中包含 ，请使用Tag.text而不是Tag.string

如此处所述，似乎可以通过使用Tag.text属性来获取它。
无论如何，我会尝试任何事情。

sample_code3.py

1
2
3
4
5
6
7
8
9

from bs4 import BeautifulSoup

html = '''
hoge fuga
'''

soup = BeautifulSoup(html, 'html.parser')
text = soup.find('p').text
print(text)

执行结果

1 2	$ python sample_code3.py hogefuga

我得到了换行标记以外的文本。

Tag.string和Tag.text有什么区别？

如果阅读

源代码，则可以看到行为上的差异。

标签.string

Tag.string的定义如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

@property
def string(self):
"""Convenience property to get the single string within this tag.

:Return: If this tag has a single string child, return value
is that string. If this tag has no children, or more than one
child, return value is None. If this tag has one child tag,
return value is the 'string' attribute of the child tag,
recursively.
"""
if len(self.contents) != 1:
return None
child = self.contents[0]
if isinstance(child, NavigableString):
return child
return child.string

如您在

注释中所见，以下规则返回该值。

如果标签包含单个字符串(例如hogefuga)，则返回该字符串
如果标签为空(例如)或包含多个元素(例如hogefuga)，则返回None。
如果标签的内容是单个标签(例如hogefuga)，则将子元素的标签的字符串作为该标签的字符串返回。它由return child.string部分递归执行。例如，如果子元素只有一个标签，则返回孙元素的字符串。

我认为该代码比日语更易于理解，因此我将举一个代码示例。

sample_code4.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

from bs4 import BeautifulSoup

html = '''
hogefuga

hogefuga
hogefuga
hogefuga
'''

soup = BeautifulSoup(html, 'html.parser')
print('中身が単一文字列: ', soup.find_all('p')[0].string)
print('中身が無い: ', soup.find_all('p')[1].string)
print('中身が複数要素: ', soup.find_all('p')[2].string)
print('中身が単一タグ: ', soup.find_all('p')[3].string)
print('孫要素: ', soup.find_all('p')[4].string)

执行结果

1
2
3
4
5
6

$ python sample_code4.py
中身が単一文字列: hogefuga
中身が無い: None
中身が複数要素: None
中身が単一タグ: hogefuga
孫要素: hogefuga

那么为什么不为包含换行标记的字符串hoge fuga返回None？

让我们使用Tag.children属性检查有问题的html的内容，该属性显示

Tag的子元素。由于儿童会返回发电机，因此将其显示为列表。

sample_code5.py

1
2
3
4
5
6
7
8

from bs4 import BeautifulSoup

html = '''
hoge fuga
'''

soup = BeautifulSoup(html, 'html.parser')
print(list(soup.find_all('p')[0].children))

执行结果

1 2	$ python sample_code5.py ['hoge', <br/>, 'fuga']

从上面，我们可以看到hoge fuga是一个包含3个元素的标记，因此Tag.string的行为与返回None相同。

Tag.text

在上一节中，我们确认Tag.string属性不能很好地处理包含 标签的字符串。
另一方面，即使字符串中有标签，Tag.text也可以提取字符串。检查此行为。

源代码显示

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37

def get_text(self, separator=u"", strip=False,
types=(NavigableString, CData)):
"""
Get all child strings, concatenated using the given separator.
"""
return separator.join([s for s in self._all_strings(
strip, types=types)])
getText = get_text
text = property(get_text)

def _all_strings(self, strip=False, types=(NavigableString, CData)):
"""Yield all strings of certain classes, possibly stripping them.

By default, yields only NavigableString and CData objects. So
no comments, processing instructions, etc.
"""
for descendant in self.descendants:
if (
(types is None and not isinstance(descendant, NavigableString))
or
(types is not None and type(descendant) not in types)):
continue
if strip:
descendant = descendant.strip()
if len(descendant) == 0:
continue
yield descendant

@property
def descendants(self):
if not len(self.contents):
return
stopNode = self._last_descendant().next_element
current = self.contents[0]
while current is not stopNode:
yield current
current = current.next_element

调用

Tag.text时，将执行get_text函数，并且可以看到Tag下的所有元素的字符串都是按顺序连接的。

例如，对hoge fuga执行Tag.text时发生的情况是返回hogefuga，因为三个元素['hoge', , 'fuga']中 标记中没有字符串。

结论

事实证明，

Tag.string和Tag.text在做完全不同的事情。
作者似乎希望您尽可能使用字符串，因为Tag.text是递归处理的，但是如果字符串中可能包含其他html标签，则使用文本更安全。

这次，我将 标记提升为矛状球，但是要小心，因为字符串中还包含其他内联标记，例如标记。

参考

官方文档https://www.crummy.com/software/BeautifulSoup/bs4/doc/
源代码https://code.launchpad.net/~leonardr/beautifulsoup/bs4
https://qiita.com/booleanoid/items/211820516eb7a2191b32

码农家园

在BeautifulSoup中处理包含标记的字符串(上半部分)

用例1

用例2

如果提取的字符串中包含<br>，请使用Tag.text而不是Tag.string

Tag.string和Tag.text有什么区别？

标签.string

Tag.text