Java入力テキストからキーワードを抽出するためのライブラリ

Question

テキストのブロックからキーワードを抽出するためのJavaライブラリを探しています。

プロセスは次のようになります。

単語のクリーニングを停止する->ステミング->英語の言語統計情報に基づいてキーワードを検索する-単語が英語よりもテキスト内に出現する確率が、キーワード候補よりも英語の確率の方が高い場合。

このタスクを実行するライブラリはありますか？

sp00m · Accepted Answer

Apache Lucene を使用した可能な解決策を次に示します。私は最後のバージョンを使用しませんでした .6.2 one 、これが私が最もよく知っているバージョンであるためです。 /lucene-core-x.x.x.jarのほかに、ダウンロードしたアーカイブから/contrib/analyzers/common/lucene-analyzers-x.x.x.jarをプロジェクトに追加することを忘れないでください。これには、言語固有のアナライザー（特に英語の場合）が含まれています。

これはonlyで、それぞれの語幹に基づいて入力テキストの単語の頻度を見つけることに注意してください。これらの頻度と英語の統計を後で比較する必要があります（この答えが役立つかもしれません）。

データモデル

1つの語幹に1つのキーワード。異なる単語が同じ語幹を持つ場合があるため、termsセット。キーワードの頻度は、新しい用語が見つかるたびに増分されます（既に見つかっていても、セットは重複を自動的に削除します）。

public class Keyword implements Comparable<Keyword> { private final String stem; private final Set<String> terms = new HashSet<String>(); private int frequency = 0; public Keyword(String stem) { this.stem = stem; } public void add(String term) { terms.add(term); frequency++; } @Override public int compareTo(Keyword o) { // descending order return Integer.valueOf(o.frequency).compareTo(frequency); } @Override public boolean equals(Object obj) { if (this == obj) { return true; } else if (!(obj instanceof Keyword)) { return false; } else { return stem.equals(((Keyword) obj).stem); } } @Override public int hashCode() { return Arrays.hashCode(new Object[] { stem }); } public String getStem() { return stem; } public Set<String> getTerms() { return terms; } public int getFrequency() { return frequency; } }

ユーティリティ

単語をステミングするには：

public static String stem(String term) throws IOException { TokenStream tokenStream = null; try { // tokenize tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(term)); // stem tokenStream = new PorterStemFilter(tokenStream); // add each token in a set, so that duplicates are removed Set<String> stems = new HashSet<String>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if no stem or 2+ stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { tokenStream.close(); } } }

コレクションを検索するには（潜在的なキーワードのリストで使用されます）：

public static <T> T find(Collection<T> collection, T example) { for (T element : collection) { if (element.equals(example)) { return element; } } collection.add(example); return example; }

芯

主な入力方法は次のとおりです。

public static List<Keyword> guessFromString(String input) throws IOException { TokenStream tokenStream = null; try { // hack to keep dashed words (e.g. "non-specific" rather than "non" and "specific") input = input.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes by a space input = input.replaceAll("[\p{Punct}&&[^'-]]+", " "); // replace most common english contractions input = input.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", ""); // tokenize input tokenStream = new ClassicTokenizer(Version.LUCENE_36, new StringReader(input)); // to lowercase tokenStream = new LowerCaseFilter(Version.LUCENE_36, tokenStream); // remove dots from acronyms (and "'s" but already done manually above) tokenStream = new ClassicFilter(tokenStream); // convert any char to ASCII tokenStream = new ASCIIFoldingFilter(tokenStream); // remove english stop words tokenStream = new StopFilter(Version.LUCENE_36, tokenStream, EnglishAnalyzer.getDefaultStopSet()); List<Keyword> keywords = new LinkedList<Keyword>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); tokenStream.reset(); while (tokenStream.incrementToken()) { String term = token.toString(); // stem each term String stem = stem(term); if (stem != null) { // create the keyword or get the existing one if any Keyword keyword = find(keywords, new Keyword(stem.replaceAll("-0", "-"))); // add its corresponding initial token keyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(keywords); return keywords; } finally { if (tokenStream != null) { tokenStream.close(); } } }

例

Javaウィキペディアの記事の紹介部分でguessFromStringメソッドを使用すると、見つかった最初の10個の最も頻繁なキーワード（つまり、語幹）が次のようになります。

Java x12 [Java] compil x5 [compiled, compiler, compilers] Sun x5 [Sun] develop x4 [developed, developers] languag x3 [languages, language] implement x3 [implementation, implementations] applic x3 [application, applications] run x3 [run] Origin x3 [originally, original] gnu x3 [gnu]

termsセット（上記の括弧[...]で表示）を取得することにより、出力リストを反復処理して、各ステムの元の単語を確認します例）。

次は何ですか

stem frequency/frequency sumの比率を英語の統計の比率と比較し、管理していればループに入れてください。私も非常に興味があります:)

Mike B. · Answer

上記で提案されたコードの更新済みですぐに使用できるバージョン。
このコードはApache Lucene 5.x…6.xと互換性があります。

CardKeywordクラス：

import Java.util.HashSet; import Java.util.Set; /** * Keyword card with stem form, terms dictionary and frequency rank */ class CardKeyword implements Comparable<CardKeyword> { /** * Stem form of the keyword */ private final String stem; /** * Terms dictionary */ private final Set<String> terms = new HashSet<>(); /** * Frequency rank */ private int frequency; /** * Build keyword card with stem form * * @param stem */ public CardKeyword(String stem) { this.stem = stem; } /** * Add term to the dictionary and update its frequency rank * * @param term */ public void add(String term) { this.terms.add(term); this.frequency++; } /** * Compare two keywords by frequency rank * * @param keyword * @return int, which contains comparison results */ @Override public int compareTo(CardKeyword keyword) { return Integer.valueOf(keyword.frequency).compareTo(this.frequency); } /** * Get stem's hashcode * * @return int, which contains stem's hashcode */ @Override public int hashCode() { return this.getStem().hashCode(); } /** * Check if two stems are equal * * @param o * @return boolean, true if two stems are equal */ @Override public boolean equals(Object o) { if (this == o) return true; if (!(o instanceof CardKeyword)) return false; CardKeyword that = (CardKeyword) o; return this.getStem().equals(that.getStem()); } /** * Get stem form of keyword * * @return String, which contains getStemForm form */ public String getStem() { return this.stem; } /** * Get terms dictionary of the stem * * @return Set<String>, which contains set of terms of the getStemForm */ public Set<String> getTerms() { return this.terms; } /** * Get stem frequency rank * * @return int, which contains getStemForm frequency */ public int getFrequency() { return this.frequency; } }

KeywordsExtractorクラス：

import org.Apache.lucene.analysis.TokenStream; import org.Apache.lucene.analysis.core.LowerCaseFilter; import org.Apache.lucene.analysis.core.StopFilter; import org.Apache.lucene.analysis.en.EnglishAnalyzer; import org.Apache.lucene.analysis.en.PorterStemFilter; import org.Apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter; import org.Apache.lucene.analysis.standard.ClassicFilter; import org.Apache.lucene.analysis.standard.StandardTokenizer; import org.Apache.lucene.analysis.tokenattributes.CharTermAttribute; import Java.io.IOException; import Java.io.StringReader; import Java.util.*; /** * Keywords extractor functionality handler */ class KeywordsExtractor { /** * Get list of keywords with stem form, frequency rank, and terms dictionary * * @param fullText * @return List<CardKeyword>, which contains keywords cards * @throws IOException */ static List<CardKeyword> getKeywordsList(String fullText) throws IOException { TokenStream tokenStream = null; try { // treat the dashed words, don't let separate them during the processing fullText = fullText.replaceAll("-+", "-0"); // replace any punctuation char but apostrophes and dashes with a space fullText = fullText.replaceAll("[\p{Punct}&&[^'-]]+", " "); // replace most common English contractions fullText = fullText.replaceAll("(?:'(?:[tdsm]|[vr]e|ll))+\b", ""); StandardTokenizer stdToken = new StandardTokenizer(); stdToken.setReader(new StringReader(fullText)); tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); List<CardKeyword> cardKeywords = new LinkedList<>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { String term = token.toString(); String stem = getStemForm(term); if (stem != null) { CardKeyword cardKeyword = find(cardKeywords, new CardKeyword(stem.replaceAll("-0", "-"))); // treat the dashed words back, let look them pretty cardKeyword.add(term.replaceAll("-0", "-")); } } // reverse sort by frequency Collections.sort(cardKeywords); return cardKeywords; } finally { if (tokenStream != null) { try { tokenStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * Get stem form of the term * * @param term * @return String, which contains the stemmed form of the term * @throws IOException */ private static String getStemForm(String term) throws IOException { TokenStream tokenStream = null; try { StandardTokenizer stdToken = new StandardTokenizer(); stdToken.setReader(new StringReader(term)); tokenStream = new PorterStemFilter(stdToken); tokenStream.reset(); // eliminate duplicate tokens by adding them to a set Set<String> stems = new HashSet<>(); CharTermAttribute token = tokenStream.getAttribute(CharTermAttribute.class); while (tokenStream.incrementToken()) { stems.add(token.toString()); } // if stem form was not found or more than 2 stems have been found, return null if (stems.size() != 1) { return null; } String stem = stems.iterator().next(); // if the stem form has non-alphanumerical chars, return null if (!stem.matches("[a-zA-Z0-9-]+")) { return null; } return stem; } finally { if (tokenStream != null) { try { tokenStream.close(); } catch (IOException e) { e.printStackTrace(); } } } } /** * Find sample in collection * * @param collection * @param sample * @param <T> * @return <T> T, which contains the found object within collection if exists, otherwise the initially searched object */ private static <T> T find(Collection<T> collection, T sample) { for (T element : collection) { if (element.equals(sample)) { return element; } } collection.add(sample); return sample; } }

関数の呼び出し：

String text = "…"; List<CardKeyword> keywordsList = KeywordsExtractor.getKeywordsList(text);