关于python:如何在lxml中删除元素

how to remove an element in lxml

我需要使用python的lxml根据属性的内容完全删除元素。 例:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""


tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

我想打印:

1
2
3
4
5
<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

有没有一种方法可以执行此操作而无需存储临时变量并手动将其打印出来,如下所示:

1
2
3
4
5
6
newxml="<groceries>\
"

for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

使用xmlElement的remove方法:

1
2
3
4
5
6
tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

如果我必须与@Acorn版本进行比较,即使要删除的元素不是直接位于xml的根节点下,我的也可以工作。


您正在寻找remove函数。调用树的remove方法,并将其传递给要删除的子元素。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""


tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

结果:

1
2
3
4
5
<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


我遇到一种情况:

1
2
3
        some code
   
    text here

div.remove(script)将删除我不是故意的text here部分。

按照这里的答案,我发现etree.strip_elements对我来说是一个更好的解决方案,您可以控制是否使用with_tail=(bool)参数删除后面的文本。

但是我仍然不知道这是否可以使用xpath过滤器进行标记。只是为了告知。

这是文档:

strip_elements(tree_or_element, *tag_names, with_tail=True)

Delete all elements with the provided tag names from a tree or
subtree. This will remove the elements and their entire subtree,
including all their attributes, text content and descendants. It
will also remove the tail text of the element unless you
explicitly set the with_tail keyword argument option to False.

Tag names can contain wildcards as in _Element.iter.

Note that this will not delete the element (or ElementTree root
element) that you passed even if it matches. It will only treat
its descendants. If you want to include the root element, check
its tag name directly before even calling this function.

Example usage::

1
2
3
4
5
6
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

如前所述,可以使用remove()方法从树中删除(子)元素:

1
2
for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

但是它会删除包含tail的元素,如果您正在处理HTML之类的混合内容文档,这将是一个问题:

1
<fruit state="rotten">avocado</fruit> Hello!

成为

1
 

我想这是你不总是想要的:)
我创建了辅助函数,以仅删除元素并保留其尾部:

1
2
3
4
5
6
7
8
9
10
11
12
def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

这样,它将保留尾部文本:

1
 Hello!


您也可以使用lxml中的html来解决该问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""


tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

它应该输出以下内容:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>