Spacyですべての名詞句を取得する方法

Question

私はSpacyを初めて使用し、文から「すべての」名詞句を抽出したいと思います。どうすればいいのかしら。私は次のコードを持っています：

import spacy nlp = spacy.load("en") file = open("E:/test.txt", "r") doc = nlp(file.read()) for np in doc.noun_chunks: print(np.text)

ただし、基本名詞句、つまり他のNPが含まれていない句のみが返されます。つまり、次のフレーズの場合、次の結果が得られます。

フレーズ：We try to explicitly describe the geometry of the edges of the images.

結果：We, the geometry, the edges, the images。

期待される結果： We, the geometry, the edges, the images, the geometry of the edges of the images, the edges of the images.

ネストされたフレーズを含むすべての名詞句を取得するにはどうすればよいですか？

Adnan S · Accepted Answer

名詞を再帰的に組み合わせるには、以下のコメント付きコードを参照してください。 Spacy Docs here に触発されたコード

import spacy nlp = spacy.load("en") doc = nlp("We try to explicitly describe the geometry of the edges of the images.") for np in doc.noun_chunks: # use np instead of np.text print(np) print() # code to recursively combine nouns # 'We' is actually a pronoun but included in your question # hence the token.pos_ == "PRON" part in the last if statement # suggest you extract PRON separately like the noun-chunks above index = 0 nounIndices = [] for token in doc: # print(token.text, token.pos_, token.dep_, token.head.text) if token.pos_ == 'NOUN': nounIndices.append(index) index = index + 1 print(nounIndices) for idxValue in nounIndices: doc = nlp("We try to explicitly describe the geometry of the edges of the images.") span = doc[doc[idxValue].left_Edge.i : doc[idxValue].right_Edge.i+1] span.merge() for token in doc: if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON": print(token.text)

Daniel Mahler · Answer

すべての名詞チャンクについて、その下にサブツリーを取得することもできます。 Spacyは、それにアクセスする2つの方法を提供します。left_Edgeおよびright Edge属性と、スパンではなくsubtreeイテレータを返すToken属性です。 noun_chunksとそのサブツリーを組み合わせると、重複が発生し、後で削除できます。

left_Edge属性とright Edge属性を使用した例を次に示します。

{np.text for nc in doc.noun_chunks for np in [ nc, doc[ nc.root.left_Edge.i :nc.root.right_Edge.i+1]]} ==> {'We', 'the edges', 'the edges of the images', 'the geometry', 'the geometry of the edges of the images', 'the images'}