python nltkでn-gram頻度をカウント

Question

私は次のコードを持っています。 apply_freq_filter関数を使用して、頻度カウントより少ないコロケーションを除外できることを知っています。ただし、ドキュメントに含まれるすべてのn-gramタプル（私の場合はbi-gram）の頻度を取得する方法を知りません。フィルタリングに設定する頻度を決定する前に。ご覧のとおり、nltk collocationsクラスを使用しています。

import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line.split() bigram_measures = nltk.collocations.BigramAssocMeasures() Finder = BigramCollocationFinder.from_words(tokens) Finder.apply_freq_filter(3) print Finder.nbest(bigram_measures.pmi, 100)

Rkz · Accepted Answer

Finder.ngram_fd.viewitems()関数は機能します

Ram Narasimhan · Answer

NLTKには、独自の_bigrams generator_と便利なFreqDist()関数が付属しています。

_f = open('a_text_file') raw = f.read() tokens = nltk.Word_tokenize(raw) #Create your bigrams bgs = nltk.bigrams(tokens) #compute frequency distribution for all the bigrams in the text fdist = nltk.FreqDist(bgs) for k,v in fdist.items(): print k,v _

BiGramsと頻度分布にアクセスしたら、必要に応じてフィルタリングできます。

お役に立てば幸いです。

Vahab · Answer

from nltk import FreqDist from nltk.util import ngrams def compute_freq(): textfile = open('corpus.txt','r') bigramfdist = FreqDist() threeramfdist = FreqDist() for line in textfile: if len(line) > 1: tokens = line.strip().split(' ') bigrams = ngrams(tokens, 2) bigramfdist.update(bigrams) compute_freq()