Java）の文字列からストップワードを削除する

Question

たくさんの単語を含む文字列があり、文字列から削除する必要のあるストップワードを含むテキストファイルがあります。文字列があるとしましょう

s="I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."

ストップワードを削除すると、文字列は次のようになります。

"love phone, super fast much cool Jelly bean....but recently bugs."

私はこれを達成することができましたが、私が直面している問題は、文字列に隣接するストップワードがある場合は常に最初のストップワードのみを削除し、次のような結果が得られることです。

"love phone, super fast there's much and cool with Jelly bean....but recently seen bugs"

これが私のstopwordslist.txtファイルです： Stopwords

どうすればこの問題を解決できますか。これが私がこれまでにしたことです：

int k=0,i,j; ArrayList<String> wordsList = new ArrayList<String>(); String sCurrentLine; String[] stopwords = new String[2000]; try{ FileReader fr=new FileReader("F:\stopwordslist.txt"); BufferedReader br= new BufferedReader(fr); while ((sCurrentLine = br.readLine()) != null){ stopwords[k]=sCurrentLine; k++; } String s="I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."; StringBuilder builder = new StringBuilder(s); String[] words = builder.toString().split("\s"); for (String Word : words){ wordsList.add(Word); } for(int ii = 0; ii < wordsList.size(); ii++){ for(int jj = 0; jj < k; jj++){ if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ wordsList.remove(ii); break; } } } for (String str : wordsList){ System.out.print(str+" "); } }catch(Exception ex){ System.out.println(ex); }

alain.janinm · Accepted Answer

そこからいくつかの解決策があります。たとえば、値を削除する代わりに、値を「」に設定できます。または、特別な「結果」リストを作成します。

geert3 · Answer

これは、正規表現のみを使用する、はるかに洗練されたソリューション（IMHO）です。

 // instead of the ".....", add all your stopwords, separated by "|" // "\b" is to account for Word boundaries, i.e. not replace "his" in "this" // the "\s?" is to suppress optional trailing white space Pattern p = Pattern.compile("\b(I|this|its.....)\b\s?"); Matcher m = p.matcher("I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."); String s = m.replaceAll(""); System.out.println(s);

robin · Answer

以下のプログラムをお試しください。

String s="I love this phone, its super fast and there's so" + " much new and cool things with Jelly bean....but of recently I've seen some bugs."; String[] words = s.split(" "); ArrayList<String> wordsList = new ArrayList<String>(); Set<String> stopWordsSet = new HashSet<String>(); stopWordsSet.add("I"); stopWordsSet.add("THIS"); stopWordsSet.add("AND"); stopWordsSet.add("THERE'S"); for(String Word : words) { String wordCompare = Word.toUpperCase(); if(!stopWordsSet.contains(wordCompare)) { wordsList.add(Word); } } for (String str : wordsList){ System.out.print(str+" "); }

出力：電話が大好きです。ジェリービーンズを使った非常に高速で新しいクールなものです。しかし、最近、いくつかのバグが発生しました。

Navnath Chinchore · Answer

このようにすべて置換機能を使用できます

String yourString ="I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs." yourString=yourString.replaceAll("stop" ,"");

Darshan Lila · Answer

次の方法で試してください。

 String s="I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."; String stopWords[]={"love","this","cool"}; for(int i=0;i<stopWords.length;i++){ if(s.contains(stopWords[i])){ s=s.replaceAll(stopWords[i]+"\s+", ""); //note this will remove spaces at the end } } System.out.println(s);

このようにして、最終的な出力には、不要な単語が含まれなくなります。配列内のストップワードのリストを取得し、必要な文字列に置き換えるだけです。
私のストップワードの出力：

I phone, its super fast and there's so much new and things with Jelly bean....but of recently I've seen some bugs.

Vimal Bera · Answer

代わりに、以下のアプローチを使用してみませんか。読みやすく、理解しやすくなります。

for(String Word : words){ s = s.replace(Word+"\s*", ""); } System.out.println(s);//It will print removed Word string.

SMA · Answer

次のような文字列の replaceAll apiを使用してみてください。

String myString = "I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."; String stopWords = "I|its|with|but"; String afterStopWords = myString.replaceAll("(" + stopWords + ")\s*", ""); System.out.println(afterStopWords); OUTPUT: love this phone, super fast and there's so much new and cool things Jelly bean....of recently 've seen some bugs.

Michal Lozinski · Answer

ストップワードをセットコレクションに格納してから、文字列をリストにトークン化してみてください。その後、「removeAll」を使用して結果を取得できます。

Set<String> stopwords = new Set<>() //fill in the set with your file String s="I love this phone, its super fast and there's so much new and cool things with Jelly bean....but of recently I've seen some bugs."; List<String> listOfStrings = asList(s.split(" ")); listOfStrings.removeAll(stopwords); StringUtils.join(listOfStrings, " ");

Forループは必要ありません-通常は問題を意味します。

Uttesh Kumar · Answer

最近、プロジェクトの1つで、いくつかのブログや記事を読んだ後、指定されたテキストまたはファイルから停止/冒涜および冒とく的な単語をフィルタリングする機能が必要になりました。データ/ファイルをフィルタリングするための単純なライブラリを作成し、Mavenで利用できるようにしました。これが誰かを助けるかもしれないことを願っています。

https://github.com/uttesh/exude

     <dependency>
        <groupId>com.uttesh</groupId>
        <artifactId>exude</artifactId>
        <version>0.0.2</version>
    </dependency>

https://github.com/uttesh/exude

 <dependency> <groupId>com.uttesh</groupId> <artifactId>exude</artifactId> <version>0.0.2</version> </dependency>

Inquisitor · Answer

ある停止単語を停止したようです。ある文で単語が削除され、別の停止単語に移動します。各文のすべての停止単語を削除する必要があります。

コードを変更してみてください。

から：

_for(int ii = 0; ii < wordsList.size(); ii++){ for(int jj = 0; jj < k; jj++){ if(stopwords[jj].contains(wordsList.get(ii).toLowerCase())){ wordsList.remove(ii); break; } } } _

次のようなものに：

_for(int ii = 0; ii < wordsList.size(); ii++) { for(int jj = 0; jj < k; jj++) { if(wordsList.get(ii).toLowerCase().contains(stopwords[jj]) { wordsList.remove(ii); } } } _

breakが削除され、stopword.contains(Word)がWord.contains(stopword)に変更されていることに注意してください。