Test if an attribute is present in a tag in BeautifulSoup
我想获取文档中的所有
例如,对于每个
这是我目前正在做的事情:
1 2 | outputDoc = BeautifulSoup(''.join(output)) scriptTags = outputDoc.findAll('script', attrs = {'for' : True}) |
但是通过这种方式,我用
如果我理解得很好,您只需要所有脚本标记,然后检查其中的某些属性?
1 2 3 4 | scriptTags = outputDoc.findAll('script') for script in scriptTags: if script.has_attr('some_attribute'): do_something() |
供将来参考,beautifulsoup 4已弃用has_key。现在您需要使用has_attr
1 2 3 4 | scriptTags = outputDoc.findAll('script') for script in scriptTags: if script.has_attr('some_attribute'): do_something() |
您不需要任何lambda即可按属性过滤,只需在
1 2 3 4 5 | script_tags = soup.find_all('script', some_attribute=True) # or script_tags = soup.find_all('script', {"some-data-attribute": True}) |
以下是使用其他方法的更多示例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | soup = bs4.BeautifulSoup(html) # Find all with a specific attribute tags = soup.find_all(src=True) tags = soup.select("[src]") # Find all meta with either name or http-equiv attribute. soup.select("meta[name],meta[http-equiv]") # find any tags with any name or source attribute. soup.select("[name], [src]") # find first/any script with a src attribute. tag = soup.find('script', src=True) tag = soup.select_one("script[src]") # find all tags with a name attribute beginning with foo # or any src beginning with /path soup.select("[name^=foo], [src^=/path]") # find all tags with a name attribute that contains foo # or any src containing with whatever soup.select("[name*=foo], [src*=whatever]") # find all tags with a name attribute that endwith foo # or any src that ends with whatever soup.select("[name$=foo], [src$=whatever]") |
您还可以对find或find_all使用正则表达式:
1 2 3 4 5 6 7 | import re # starting with soup.find_all("script", src=re.compile("^whatever")) # contains soup.find_all("script", src=re.compile("whatever")) # ends with soup.find_all("script", src=re.compile("whatever$")) |
如果只需要获取带有属性的标签,则可以使用lambda:
1 | soup = bs4.BeautifulSoup(YOUR_CONTENT) |
- 具有属性的标签
1 | tags = soup.find_all(lambda tag: 'src' in tag.attrs) |
要么
1 | tags = soup.find_all(lambda tag: tag.has_attr('src')) |
- 具有属性的特定标签
1 | tag = soup.find(lambda tag: tag.name == 'script' and 'src' in tag.attrs) |
- 等等...
认为它可能有用。
您可以检查是否存在某些属性
1 2 3 | scriptTags = outputDoc.findAll('script', some_attribute=True) for script in scriptTags: do_something() |
通过使用pprint模块,您可以检查元素的内容。
1 2 3 | from pprint import pprint pprint(vars(element)) |
在bs4元素上使用此命令将打印类似于以下内容的内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | {'attrs': {u'class': [u'pie-productname', u'size-3', u'name', u'global-name']}, 'can_be_empty_element': False, 'contents': [u' \t\t\t\tNESNA \t'], 'hidden': False, 'name': u'span', 'namespace': None, 'next_element': u' \t\t\t\tNESNA \t', 'next_sibling': u' ', 'parent': <h1 class="pie-compoundheader" itemprop="name"> <span class="pie-description">Bedside table</span> <span class="pie-productname size-3 name global-name"> \t\t\t\tNESNA \t</span> , 'parser_class': <class 'bs4.BeautifulSoup'>, 'prefix': None, 'previous_element': u' ', 'previous_sibling': u' '} |
要访问一个属性(可以说是类列表),请使用以下命令:
1 | class_list = element.attrs.get('class', []) |
您可以使用以下方法过滤元素:
1 2 3 4 5 6 7 | for script in soup.find_all('script'): if script.attrs.get('for'): # ... Has 'for' attr elif"myClass" in script.attrs.get('class', []): # ... Has class"myClass" else: # ... Do something else |