NLTK名前付きエンティティ認識Pythonリスト

Question

NLTKのne_chunkテキストから名前付きエンティティを抽出するには：

my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement." nltk.ne_chunk(my_sent, binary=True)

しかし、これらのエンティティをリストに保存する方法がわかりませんか？例えば。 –

print Entity_list ('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')

ありがとう。

alvas · Accepted Answer

nltk.ne_chunkはネストされたnltk.tree.Treeオブジェクトを返すため、NEに到達するにはTreeオブジェクトを走査する必要があります。

正規表現による名前付きエンティティ認識：NLTK をご覧ください

>>> from nltk import ne_chunk, pos_tag, Word_tokenize >>> from nltk.tree import Tree >>> >>> def get_continuous_chunks(text): ... chunked = ne_chunk(pos_tag(Word_tokenize(text))) ... continuous_chunk = [] ... current_chunk = [] ... for i in chunked: ... if type(i) == Tree: ... current_chunk.append(" ".join([token for token, pos in i.leaves()])) ... Elif current_chunk: ... named_entity = " ".join(current_chunk) ... if named_entity not in continuous_chunk: ... continuous_chunk.append(named_entity) ... current_chunk = [] ... else: ... continue ... return continuous_chunk ... >>> my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement." >>> get_continuous_chunks(my_sent) ['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

imanzabet · Answer

次のコードを使用して、テキスト内の各名前エンティティのlabelを抽出することもできます。

import nltk for sent in nltk.sent_tokenize(sentence): for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.Word_tokenize(sent))): if hasattr(chunk, 'label'): print(chunk.label(), ' '.join(c[0] for c in chunk))

出力：

GPE WASHINGTON GPE New York PERSON Loretta E. Lynch GPE Brooklyn

Washington、New YorkとBrooklynはGPEが地政学的エンティティを意味する

およびLoretta E. LynchはPERSONです

b3000 · Answer

戻り値として tree を取得すると、NEでラベル付けされたサブツリーを選択したいと思います。

リスト内のすべてを収集する簡単な例を次に示します。

import nltk my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement." parse_tree = nltk.ne_chunk(nltk.tag.pos_tag(my_sent.split()), binary=True) # POS tagging before chunking! named_entities = [] for t in parse_tree.subtrees(): if t.label() == 'NE': named_entities.append(t) # named_entities.append(list(t)) # if you want to save a list of tagged words instead of a tree print named_entities

これは与える：

[Tree('NE', [('WASHINGTON', 'NNP')]), Tree('NE', [('New', 'NNP'), ('York', 'NNP')])]

またはリストのリストとして：

[[('WASHINGTON', 'NNP')], [('New', 'NNP'), ('York', 'NNP')]]

次も参照してください： nltk.tree.Treeをナビゲートする方法？

elwhite · Answer

nltk.chunkのtree2conlltagsを使用します。また、ne_chunkには、Wordトークンにタグを付けるPOSタグが必要です（したがって、Word_tokenizeが必要です）。

from nltk import Word_tokenize, pos_tag, ne_chunk from nltk.chunk import tree2conlltags sentence = "Mark and John are working at Google." print(tree2conlltags(ne_chunk(pos_tag(Word_tokenize(sentence)) """[('Mark', 'NNP', 'B-PERSON'), ('and', 'CC', 'O'), ('John', 'NNP', 'B-PERSON'), ('are', 'VBP', 'O'), ('working', 'VBG', 'O'), ('at', 'IN', 'O'), ('Google', 'NNP', 'B-ORGANIZATION'), ('.', '.', 'O')] """

これにより、タプルのリストが得られます：[（token、pos_tag、name_entity_tag）]このリストが正確に必要なものではない場合、このリストから必要なリストを解析し、nltkツリーを解析する方が確かに簡単です。

このリンク ;のコードと詳細詳細を確認してください

次の機能を使用して、単語を抽出するだけで続行することもできます。

def wordextractor(Tuple1): #bring the Tuple back to lists to work with it words, tags, pos = Zip(*Tuple1) words = list(words) pos = list(pos) c = list() i=0 while i<= len(Tuple1)-1: #get words with have pos B-PERSON or I-PERSON if pos[i] == 'B-PERSON': c = c+[words[i]] Elif pos[i] == 'I-PERSON': c = c+[words[i]] i=i+1 return c print(wordextractor(tree2conlltags(nltk.ne_chunk(nltk.pos_tag(nltk.Word_tokenize(sentence))))

Edit出力ドキュメント文字列を追加**編集* B-Personのみに出力を追加

alexis · Answer

Treeはリストです。チャンクはサブツリーであり、チャンクされていない単語は通常の文字列です。リストを下って、各チャンクから単語を抽出し、それらを結合しましょう。

>>> chunked = nltk.ne_chunk(my_sent) >>> >>> [ " ".join(w for w, t in elt) for elt in chunked if isinstance(elt, nltk.Tree) ] ['WASHINGTON', 'New York', 'Loretta E. Lynch', 'Brooklyn']

Nic Scozzaro · Answer

Spacyの使用を検討することもできます。

import spacy nlp = spacy.load('en') doc = nlp('WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement.') print([ent for ent in doc.ents]) >>> [WASHINGTON, New York, the 1990s, Loretta E. Lynch, Brooklyn, African-Americans]