关于python：测试BeautifulSoup中的标签中是否存在属性

Test if an attribute is present in a tag in BeautifulSoup

我想获取文档中的所有标记，然后根据某些属性的存在(或不存在)来处理每个标记。

例如，对于每个标签，如果存在属性for，则执行某些操作；否则，如果存在属性bar，则执行其他操作。

这是我目前正在做的事情：

1 2	outputDoc = BeautifulSoup(''.join(output)) scriptTags = outputDoc.findAll('script', attrs = {'for' : True})

但是通过这种方式，我用for属性过滤了所有标签...但是我丢失了其他标签(没有for属性的标签)。

相关讨论

如果我理解得很好，您只需要所有脚本标记，然后检查其中的某些属性？

1
2
3
4

scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()

相关讨论

供将来参考，beautifulsoup 4已弃用has_key。现在您需要使用has_attr

1
2
3
4

scriptTags = outputDoc.findAll('script')
for script in scriptTags:
if script.has_attr('some_attribute'):
do_something()

相关讨论

您不需要任何lambda即可按属性过滤，只需在find或find_all中使用some_attribute=True。

1
2
3
4
5

script_tags = soup.find_all('script', some_attribute=True)

# or

script_tags = soup.find_all('script', {"some-data-attribute": True})

以下是使用其他方法的更多示例：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

soup = bs4.BeautifulSoup(html)

# Find all with a specific attribute

tags = soup.find_all(src=True)
tags = soup.select("[src]")

# Find all meta with either name or http-equiv attribute.

soup.select("meta[name],meta[http-equiv]")

# find any tags with any name or source attribute.

soup.select("[name], [src]")

# find first/any script with a src attribute.

tag = soup.find('script', src=True)
tag = soup.select_one("script[src]")

# find all tags with a name attribute beginning with foo
# or any src beginning with /path
soup.select("[name^=foo], [src^=/path]")

# find all tags with a name attribute that contains foo
# or any src containing with whatever
soup.select("[name*=foo], [src*=whatever]")

# find all tags with a name attribute that endwith foo
# or any src that ends with whatever
soup.select("[name$=foo], [src$=whatever]")

您还可以对find或find_all使用正则表达式：

1
2
3
4
5
6
7

import re
# starting with
soup.find_all("script", src=re.compile("^whatever"))
# contains
soup.find_all("script", src=re.compile("whatever"))
# ends with
soup.find_all("script", src=re.compile("whatever$"))

相关讨论

如果只需要获取带有属性的标签，则可以使用lambda：

1	soup = bs4.BeautifulSoup(YOUR_CONTENT)

具有属性的标签

1	tags = soup.find_all(lambda tag: 'src' in tag.attrs)

要么

1	tags = soup.find_all(lambda tag: tag.has_attr('src'))

具有属性的特定标签

1	tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs)

等等...

认为它可能有用。

相关讨论

您可以检查是否存在某些属性

1
2
3

scriptTags = outputDoc.findAll('script', some_attribute=True)
for script in scriptTags:
do_something()

通过使用pprint模块，您可以检查元素的内容。

1
2
3

from pprint import pprint

pprint(vars(element))

在bs4元素上使用此命令将打印类似于以下内容的内容：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

{'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']},
'can_be_empty_element': False,
'contents': [u'
\t\t\t\tNESNA
\t'],
'hidden': False,
'name': u'span',
'namespace': None,
'next_element': u'
\t\t\t\tNESNA
\t',
'next_sibling': u'
',
'parent': <h1 class="pie-compoundheader" itemprop="name">
<span class="pie-description">Bedside table</span>
<span class="pie-productname size-3 name global-name">
\t\t\t\tNESNA
\t</span>
,
'parser_class': <class 'bs4.BeautifulSoup'>,
'prefix': None,
'previous_element': u'
',
'previous_sibling': u'
'}

要访问一个属性(可以说是类列表)，请使用以下命令：

1	class_list = element.attrs.get('class', [])

您可以使用以下方法过滤元素：

1
2
3
4
5
6
7

for script in soup.find_all('script'):
if script.attrs.get('for'):
# ... Has 'for' attr
elif"myClass" in script.attrs.get('class', []):
# ... Has class"myClass"
else:
# ... Do something else