Lucene TokenStreamからトークンを取得する方法は？

Question

トークン化にApache Luceneを使用しようとしていますが、TokenStreamからトークンを取得するプロセスに困惑しています。

最悪の部分は、私の質問に対処するJavaDocsのコメントを見ていることです。

http://lucene.Apache.org/Java/3_0_1/api/core/org/Apache/lucene/analysis/TokenStream.html#incrementToken%28%29

どういうわけか、AttributeSourcesではなく、Tokenが使用されることになっています。私は完全に途方に暮れています。

TokenStreamからトークンのような情報を取得する方法を説明できる人はいますか？

Adam Paynter · Accepted Answer

ええ、それは少し複雑です（良い方法と比較して）が、これはそれを行う必要があります：

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class); TermAttribute termAttribute = tokenStream.getAttribute(TermAttribute.class); while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = termAttribute.term(); }

編集：new方法

Donotelloによると、TermAttributeはCharTermAttributeを支持して廃止されました。 jpountz（およびLuceneのドキュメント）によれば、addAttributeはgetAttributeよりも望ましいです。

TokenStream tokenStream = analyzer.tokenStream(fieldName, reader); OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class); CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { int startOffset = offsetAttribute.startOffset(); int endOffset = offsetAttribute.endOffset(); String term = charTermAttribute.toString(); }

yegor256 · Answer

これがどうあるべきかです（Adamの答えのクリーンバージョン）：

TokenStream stream = analyzer.tokenStream(null, new StringReader(text)); CharTermAttribute cattr = stream.addAttribute(CharTermAttribute.class); stream.reset(); while (stream.incrementToken()) { System.out.println(cattr.toString()); } stream.end(); stream.close();

Flamingo · Answer

Lucene 7.3.1の最新バージョン用

 // Test the tokenizer Analyzer testAnalyzer = new CJKAnalyzer(); String testText = "Test Tokenizer"; TokenStream ts = testAnalyzer.tokenStream("context", new StringReader(testText)); OffsetAttribute offsetAtt = ts.addAttribute(OffsetAttribute.class); try { ts.reset(); // Resets this stream to the beginning. (Required) while (ts.incrementToken()) { // Use AttributeSource.reflectAsString(boolean) // for token stream debugging. System.out.println("token: " + ts.reflectAsString(true)); System.out.println("token start offset: " + offsetAtt.startOffset()); System.out.println(" token end offset: " + offsetAtt.endOffset()); } ts.end(); // Perform end-of-stream operations, e.g. set the final offset. } finally { ts.close(); // Release resources associated with this stream. }

リファレンス： https://lucene.Apache.org/core/7_3_1/core/org/Apache/lucene/analysis/package-summary.html

William Price · Answer

OPの質問には2つのバリエーションがあります。

「TokenStreamからトークンを取得するプロセス」とは何ですか？
「TokenStreamからトークンのような情報を取得する方法を説明できる人はいますか？」

TokenのLuceneドキュメント sayの最近のバージョン（強調を追加）：

注：2.9の時点では、は不要ですTokenを使用することはできません。新しいTokenStream APIでは、すべての属性を実装する便利なクラスとして使用できます。これは特に便利です。古いTokenStream APIから新しいTokenStream APIに簡単に切り替えることができます。

TokenStreamはそのAPIを示しています。

...トークンベースから属性ベースに移行しました...トークンの情報を保存する好ましい方法は、AttributeImplsを使用することです。

この質問に対する他の回答は、上記の＃2をカバーしています。属性を使用して「新しい」推奨方法でTokenStreamからtoken-like情報を取得する方法。ドキュメントを読んで、Lucene開発者は、この変更は、一度に作成される個々のオブジェクトの数を減らすために部分的に行われたことを示唆しています。

しかし、それらの回答のコメントで指摘しているように、彼らは直接回答しません＃1：本当にそのタイプが欲しい/必要な場合、どのようにしてTokenを取得しますか？

TokenStreamをAttributeSourceにする同じAPIの変更により、TokenはAttributeを実装し、他と同様に TokenStream.addAttribute で使用できます。 CharTermAttributeおよびOffsetAttributeの回答が表示されます。だから彼らは本当に元の質問のその部分に答えた、彼らは単にそれを見せなかった。

このアプローチでは、ループ中にTokenにアクセスできますが、ストリーム内の論理トークンの数に関係なく、単一のオブジェクトにすぎないことが重要です。 incrementToken()を呼び出すたびに、Tokenから返されるaddAttributeの状態が変更されます。そのため、ループの外で使用するさまざまなTokenオブジェクトのコレクションを構築することが目標の場合、newTokenオブジェクトを（深い？）コピーとして。