ストップワードを含み、除外するテキストで最も頻繁に出現する10の単語を印刷

Question

私は here から私の変更について質問を受けました。私は次のコードを持っています：

from nltk.corpus import stopwords >>> def content_text(text): stopwords = nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() in stopwords] return content

どうすればprint 1）includeおよび2）excludeストップワードであるテキストで最も頻繁に出現する10の単語？

Padraic Cunningham · Accepted Answer

関数内の_is stopwords_がわからない場合、inにする必要があると思いますが、Counterdictをmost_common(10)で使用すると、最も頻繁に10を取得できます。

_from collections import Counter from string import punctuation def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups with_stp = Counter() without_stp = Counter() with open(text) as f: for line in f: spl = line.split() # update count off all words in the line that are in stopwrods with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords) # update count off all words in the line that are not in stopwords without_stp.update(w.lower().rstrip(punctuation) for w in spl if w not in stopwords) # return a list with top ten most common words from each return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)] wth_stop, wthout_stop = content_text(...) _

Nltkファイルオブジェクトを渡す場合は、それを繰り返し処理します。

_def content_text(text): stopwords = set(nltk.corpus.stopwords.words('english')) with_stp = Counter() without_stp = Counter() for Word in text: # update count off all words in the line that are in stopwords Word = Word.lower() if Word in stopwords: with_stp.update([Word]) else: # update count off all words in the line that are not in stopwords without_stp.update([Word]) # return a list with top ten most common words from each return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)] print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))) _

Nltkメソッドには句読点が含まれているため、意図したとおりにならない場合があります。

igorushi · Answer

NltkにはFreqDist関数があります

import nltk allWords = nltk.tokenize.Word_tokenize(text) allWordDist = nltk.FreqDist(w.lower() for w in allWords) stopwords = nltk.corpus.stopwords.words('english') allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)

最も一般的な10を抽出するには：

mostCommon= allWordDist.most_common(10).keys()

prahlad · Answer

あなたはこれを試すことができます：

for Word, frequency in allWordsDist.most_common(10): print('%s;%d' % (Word, frequency)).encode('utf-8')