关于python:NLTK语言树遍历并提取名词短语(NP)

NLTK linguistic tree traversal and extract noun phrase (NP)

我创建了一个基于自定义分类器的分块器:DigDug_classifier,该分块器对以下句子进行了分块:

1
sentence ="There is high signal intensity evident within the disc at T1."

要创建这些块,请执行以下操作:

1
2
3
4
5
6
7
8
9
(S
  (NP There/EX)
  (VP is/VBZ)
  (NP high/JJ signal/JJ intensity/NN evident/NN)
  (PP within/IN)
  (NP the/DT disc/NN)
  (PP at/IN)
  (NP T1/NNP)
  ./.)

我需要从上面创建一个仅包含NP的列表,如下所示:

1
NP = ['There', 'high signal intensity evident', 'the disc', 'T1']

我写了以下代码:

1
2
3
4
5
6
7
output = []
for subtree in DigDug_classifier.parse(pos_tags):
    try:
        if subtree.label() == 'NP': output.append(subtree)
    except AttributeError:
        output.append(subtree)
print(output)

但这给了我这个答案:

1
[Tree('NP', [('There', 'EX')]), Tree('NP', [('high', 'JJ'), ('signal', 'JJ'), ('intensity', 'NN'), ('evident', 'NN')]), Tree('NP', [('the', 'DT'), ('disc', 'NN')]), Tree('NP', [('T1', 'NNP')]), ('.', '.')]

我该怎么做才能得到想要的答案?


首先,请参见如何遍历NLTK树对象?

特定于您的提取NP问题:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
>>> from nltk import Tree
>>> parse_tree = Tree.fromstring("""(S
...   (NP There/EX)
...   (VP is/VBZ)
...   (NP high/JJ signal/JJ intensity/NN evident/NN)
...   (PP within/IN)
...   (NP the/DT disc/NN)
...   (PP at/IN)
...   (NP T1/NNP)
...   ./.)"""
)

# Iterating through the parse tree and
# 1. check that the subtree is a Tree type and
# 2. make sure the subtree label is NP
>>> [subtree for subtree in parse_tree if type(subtree) == Tree and subtree.label() =="NP"]
[Tree('NP', ['There/EX']), Tree('NP', ['high/JJ', 'signal/JJ', 'intensity/NN', 'evident/NN']), Tree('NP', ['the/DT', 'disc/NN']), Tree('NP', ['T1/NNP'])]

# To access the item inside the Tree object,
# use the .leaves() function
>>> [subtree.leaves() for subtree in parse_tree if type(subtree) == Tree and subtree.label() =="NP"]
[['There/EX'], ['high/JJ', 'signal/JJ', 'intensity/NN', 'evident/NN'], ['the/DT', 'disc/NN'], ['T1/NNP']]

# To get the string representation of the leaves
# use"".join()
>>> [' '.join(subtree.leaves()) for subtree in parse_tree if type(subtree) == Tree and subtree.label() =="NP"]
['There/EX', 'high/JJ signal/JJ intensity/NN evident/NN', 'the/DT disc/NN', 'T1/NNP']


# To just get the leaves' string,
# iterate through the leaves and split the string and
# keep the first part of the"/"
>>> ["".join([leaf.split('/')[0] for leaf in subtree.leaves()]) for subtree in parse_tree if type(subtree) == Tree and subtree.label() =="NP"]
['There', 'high signal intensity evident', 'the disc', 'T1']