pythonを使用してソートされた単語頻度カウント

Question

Pythonを使用してテキスト内の単語の頻度を数える必要があります。辞書に単語を入れておき、これらの単語ごとにカウントすることを考えました。

出現回数に応じて単語を並べ替える必要がある場合。キーを値として、単語の配列を値として持つ新しい辞書を使用する代わりに、同じ辞書でそれを行うことはできますか？

Fr&#233;d&#233;ric Hamidi · Accepted Answer

同じ辞書を使用できます：

>>> d = { "foo": 4, "bar": 2, "quux": 3 } >>> sorted(d.items(), key=lambda item: item[1])

2行目は次を印刷します。

[('bar', 2), ('quux', 3), ('foo', 4)]

ソートされたWordリストのみが必要な場合は、次のようにします。

>>> [pair[0] for pair in sorted(d.items(), key=lambda item: item[1])]

その行は印刷します：

['bar', 'quux', 'foo']

jathanism · Answer

警告：この例では、Python 2.7以上が必要です。

Pythonの組み込み Counter オブジェクトはまさにあなたが探しているものです。単語のカウントは、ドキュメントの最初の例でもあります。

>>> # Tally occurrences of words in a list >>> from collections import Counter >>> cnt = Counter() >>> for Word in ['red', 'blue', 'red', 'green', 'blue', 'blue']: ... cnt[Word] += 1 >>> cnt Counter({'blue': 3, 'red': 2, 'green': 1})

コメントで指定されているように、Counterはイテラブルをとるので、上記の例は単に説明のためであり、次と同等です。

>>> mywords = ['red', 'blue', 'red', 'green', 'blue', 'blue'] >>> cnt = Counter(mywords) >>> cnt Counter({'blue': 3, 'red': 2, 'green': 1})

martineau · Answer

2段階のプロセスでPython 2.7 Counterモジュールでdefaultdict and collectionsを使用できます。最初にCounterを使用して、各Wordが関連する頻度カウントを持つキーである辞書を作成します。

次に、defaultdictを使用して、キーが出現頻度であり、関連する値が何度も出会った単語のリストである逆辞書または逆辞書を作成できます。ここに私が意味するものがあります：

from collections import Counter, defaultdict wordlist = ['red', 'yellow', 'blue', 'red', 'green', 'blue', 'blue', 'yellow'] # invert a temporary Counter(wordlist) dictionary so keys are # frequency of occurrence and values are lists the words encountered freqword = defaultdict(list) for Word, freq in Counter(wordlist).items(): freqword[freq].append(Word) # print in order of occurrence (with sorted list of words) for freq in sorted(freqword): print('count {}: {}'.format(freq, sorted(freqword[freq])))

出力：

count 1: ['green'] count 2: ['red', 'yellow'] count 3: ['blue']

user470379 · Answer

>>> d = {'a': 3, 'b': 1, 'c': 2, 'd': 5, 'e': 0} >>> l = d.items() >>> l.sort(key = lambda item: item[1]) >>> l [('e', 0), ('b', 1), ('c', 2), ('a', 3), ('d', 5)]

Russell Asher · Answer

これらのアイテムの頻度を見つけるのは簡単です。リストにすべての単語がある場合（文字列分割機能を使用すると簡単です）。次に：

#(Pseudo Python Code) listOfWords = inputString.split() # splits the words up from whitespace setOfWords = Set(listOfWords) # Gives you all the unique words (no duplicates) for each Word in setOfWords #Count how many words are in the list print Word + " appears: " + listOfWords.Count(Word) + "times"

Fruitful · Answer

Stack Overflowの協力を得て、同様のプログラムを作成しました。

from string import punctuation from operator import itemgetter N = 100 words = {} words_gen = (Word.strip(punctuation).lower() for line in open("poi_run.txt") for Word in line.split()) for Word in words_gen: words[Word] = words.get(Word, 0) + 1 top_words = sorted(words.items(), key=itemgetter(1), reverse=True)[:N] for Word, frequency in top_words: print ("%s %d" % (Word, frequency))

prisco.napoli · Answer

数日前に同様のプログラムを書きました。プログラムは、ファイル名（必須）とN（オプション）の2つの引数を使用します

from collections import Counter import re import sys if sys.version_info <(2,7): Sys.exit("Must use Python 2.7 or greater") if len(sys.argv)<2: sys.exit('Usage: python %s filename N'%sys.argv[0]) n=0 if len(sys.argv)>2: try: n=int(sys.argv[2]) if n<=0: raise ValueError except ValueError: sys.exit("Invalid value for N: %s.
N must be an integer greater than 0"%sys.argv[2]) filename=sys.argv[1] try: with open(filename,"r") as input_text: wordcounter=Counter() for line in input_text: wordcounter.update(re.findall("\w+",line.lower())) if n==0: n=len(wordcounter) for Word, frequency in wordcounter.most_common(n): print("%s %d" % (Word, frequency)) except IOError: sys.exit("Cannot open file: %s"% filename)

Gani Simsek · Answer

そのようなタスクにCounterオブジェクトがあったことを知りませんでした。あなたのアプローチと同じように、私がそれを当時どのようにしたかを以下に示します。同じ辞書の表現でソートを行うことができます。

#Takes a list and returns a descending sorted dict of words and their counts def countWords(a_list): words = {} for i in range(len(a_list)): item = a_list[i] count = a_list.count(item) words[item] = count return sorted(words.items(), key = lambda item: item[1], reverse=True)

例：

>>>countWords("the quick red fox jumped over the lazy brown dog".split()) [('the', 2), ('brown', 1), ('lazy', 1), ('jumped', 1), ('over', 1), ('fox', 1), ('dog', 1), ('quick', 1), ('red', 1)]

user3443599 · Answer

この問題に関係するいくつかのステップがあります：

句読点をきれいにします。

頻度に基づいて配列を並べ替えます。

def wordCount(self,nums): nums = "Hello, number of transaction which happened, for," nums=nums.lower().translate(None,string.punctuation).split() d = {} for i in nums: if i not in d: d[i] = 1 else: d[i] = d[i]+1 sorted_d = (sorted(d.items(), key = operator.itemgetter(1), reverse = True) for key,val in sorted_d: print key,val