how to remove an element in lxml
我需要使用python的lxml根据属性的内容完全删除元素。 例:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import lxml.etree as et xml=""" <groceries> <fruit state="rotten">apple</fruit> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="rotten">mango</fruit> <fruit state="fresh">peach</fruit> </groceries> """ tree=et.fromstring(xml) for bad in tree.xpath("//fruit[@state=\'rotten\']"): #remove this element from the tree print et.tostring(tree, pretty_print=True) |
我想打印:
1 2 3 4 5 | <groceries> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="fresh">peach</fruit> </groceries> |
有没有一种方法可以执行此操作而无需存储临时变量并手动将其打印出来,如下所示:
1 2 3 4 5 6 | newxml="<groceries>\ " for elt in tree.xpath('//fruit[@state=\'fresh\']'): newxml+=et.tostring(elt) newxml+="</groceries>" |
使用xmlElement的
1 2 3 4 5 6 | tree=et.fromstring(xml) for bad in tree.xpath("//fruit[@state=\'rotten\']"): bad.getparent().remove(bad) # here I grab the parent of the element to call the remove directly on it print et.tostring(tree, pretty_print=True, xml_declaration=True) |
如果我必须与@Acorn版本进行比较,即使要删除的元素不是直接位于xml的根节点下,我的也可以工作。
您正在寻找
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | import lxml.etree as et xml=""" <groceries> <fruit state="rotten">apple</fruit> <fruit state="fresh">pear</fruit> <punnet> <fruit state="rotten">strawberry</fruit> <fruit state="fresh">blueberry</fruit> </punnet> <fruit state="fresh">starfruit</fruit> <fruit state="rotten">mango</fruit> <fruit state="fresh">peach</fruit> </groceries> """ tree=et.fromstring(xml) for bad in tree.xpath("//fruit[@state='rotten']"): bad.getparent().remove(bad) print et.tostring(tree, pretty_print=True) |
结果:
1 2 3 4 5 | <groceries> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="fresh">peach</fruit> </groceries> |
我遇到一种情况:
1 2 3 | some code text here |
按照这里的答案,我发现
但是我仍然不知道这是否可以使用xpath过滤器进行标记。只是为了告知。
这是文档:
strip_elements(tree_or_element, *tag_names, with_tail=True)
Delete all elements with the provided tag names from a tree or
subtree. This will remove the elements and their entire subtree,
including all their attributes, text content and descendants. It
will also remove the tail text of the element unless you
explicitly set thewith_tail keyword argument option to False.Tag names can contain wildcards as in
_Element.iter .Note that this will not delete the element (or ElementTree root
element) that you passed even if it matches. It will only treat
its descendants. If you want to include the root element, check
its tag name directly before even calling this function.Example usage::
1
2
3
4
5
6 strip_elements(some_element,
'simpletagname', # non-namespaced tag
'{http://some/ns}tagname', # namespaced tag
'{http://some/other/ns}*' # any tag from a namespace
lxml.etree.Comment # comments
)
如前所述,可以使用
1 2 | for bad in tree.xpath("//fruit[@state=\'rotten\']"): bad.getparent().remove(bad) |
但是它会删除包含
1 | <fruit state="rotten">avocado</fruit> Hello! |
成为
1 |
我想这是你不总是想要的:)
我创建了辅助函数,以仅删除元素并保留其尾部:
1 2 3 4 5 6 7 8 9 10 11 12 | def remove_element(el): parent = el.getparent() if el.tail.strip(): prev = el.getprevious() if prev: prev.tail = (prev.tail or '') + el.tail else: parent.text = (parent.text or '') + el.tail parent.remove(el) for bad in tree.xpath("//fruit[@state=\'rotten\']"): remove_element(bad) |
这样,它将保留尾部文本:
1 | Hello! |
您也可以使用lxml中的html来解决该问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from lxml import html xml=""" <groceries> <fruit state="rotten">apple</fruit> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="rotten">mango</fruit> <fruit state="fresh">peach</fruit> </groceries> """ tree = html.fromstring(xml) print("//BEFORE") print(html.tostring(tree, pretty_print=True).decode("utf-8")) for i in tree.xpath("//fruit[@state='rotten']"): i.drop_tree() print("//AFTER") print(html.tostring(tree, pretty_print=True).decode("utf-8")) |
它应该输出以下内容:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | //BEFORE <groceries> <fruit state="rotten">apple</fruit> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="rotten">mango</fruit> <fruit state="fresh">peach</fruit> </groceries> //AFTER <groceries> <fruit state="fresh">pear</fruit> <fruit state="fresh">starfruit</fruit> <fruit state="fresh">peach</fruit> </groceries> |