Spacyを使用した文のセグメンテーション

Question

SpacyとNLPは初めてです。 Spacyを使用してセンテンスセグメンテーションを実行しているときに以下の問題に直面します。

私が文章にトークン化しようとしているテキストには、番号付きリストが含まれています（番号付けと実際のテキストの間にスペースがあります）。以下のように。

import spacy nlp = spacy.load('en_core_web_sm') text = "This is first sentence.
Next is numbered list.
1. Hello World!
2. Hello World2!
3. Hello World!" text_sentences = nlp(text) for sentence in text_sentences.sents: print(sentence.text)

出力（1.、2.、3。は別々の行と見なされます）は次のとおりです。

This is first sentence. Next is numbered list. 1. Hello World! 2. Hello World2! 3. Hello World!

ただし、番号付けと実際のテキストの間にスペースがない場合、文のトークン化は問題ありません。以下のように：

import spacy nlp = spacy.load('en_core_web_sm') text = "This is first sentence.
Next is numbered list.
1.Hello World!
2.Hello World2!
3.Hello World!" text_sentences = nlp(text) for sentence in text_sentences.sents: print(sentence.text)

出力（望ましい）は：

This is first sentence. Next is numbered list. 1.Hello World! 2.Hello World2! 3.Hello World!

文検出器をこれに合わせてカスタマイズできるかどうか提案してください。

gdaras · Accepted Answer

スペイシーの事前学習済みモデルを使用すると、モデルの学習手順中に提供された学習データに基づいて文が分割されます。

もちろん、カスタムセンテンスセグメンテーションロジックを使用したい場合もあります。これは、コンポーネントをスペイシーパイプラインに追加することで可能になります。

あなたのケースでは、{number}がある場合に文の分割を防ぐルールを追加できます。パターン。

問題の回避策：

import spacy import re nlp = spacy.load('en') boundary = re.compile('^[0-9]$') def custom_seg(doc): prev = doc[0].text length = len(doc) for index, token in enumerate(doc): if (token.text == '.' and boundary.match(prev) and index!=(length - 1)): doc[index+1].sent_start = False prev = token.text return doc nlp.add_pipe(custom_seg, before='parser') text = u'This is first sentence.
Next is numbered list.
1. Hello World!
2. Hello World2!
3. Hello World!' doc = nlp(text) for sentence in doc.sents: print(sentence.text)

それが役に立てば幸い！