sklearnを使用して単語と単語の共起行列を計算するにはどうすればよいですか？

Question

単語と単語の共起行列を導き出すことができるsklearnのモジュールを探しています。

ドキュメント-項マトリックスは取得できますが、同時発生のワード-ワードマトリックスを取得する方法はわかりません。

titipata · Answer

Scikit-learnでCountVectorizerを使用したソリューションの例を次に示します。そして、これを参照して post 、単純に行列乗算を使用して、単語と単語の共起行列を取得できます。

from sklearn.feature_extraction.text import CountVectorizer docs = ['this this this book', 'this cat good', 'cat good shit'] count_model = CountVectorizer(ngram_range=(1,1)) # default unigram model X = count_model.fit_transform(docs) # X[X > 0] = 1 # run this line if you don't want extra within-text cooccurence (see below) Xc = (X.T * X) # this is co-occurrence matrix in sparse csr format Xc.setdiag(0) # sometimes you want to fill same Word cooccurence to 0 print(Xc.todense()) # print out matrix in dense format

count_modelで単語の辞書を参照することもできます。

count_model.vocabulary_

または、対角成分で正規化する場合（前の投稿の回答を参照）。

import scipy.sparse as sp Xc = (X.T * X) g = sp.diags(1./Xc.diagonal()) Xc_norm = g * Xc # normalized co-occurence matrix

Extra@Federico Cacciaの回答に注意してください。自分のテキストから偽の共起を望まない場合は、1より大きいオカレンスを設定してください。 1など.

X[X > 0] = 1 # do this line first before computing cooccurrence Xc = (X.T * X) ...

Federico Caccia · Answer

@titipataあなたのソリューションは、実際の同時発生と偽の発生に同じ重みを与えているため、良い指標ではないと思います。たとえば、5つのテキストと単語Appleおよびhouseがある場合この頻度で表示されます：

text1：Apple：10、 "house"：1

text2：Apple：10、 "house"：0

text3：Apple：10、 "house"：0

text4：Apple：10、 "house"：0

text5：Apple：10、 "house"：0

co-occurrence計測しようとしているのは10 * 1 + 10 * 0 + 10 * 0 + 10 * 0 + 10 * 0 = 1ですが、単なるスプリアスです。

また、この場合、次のような別の重要なケースもあります。

text1：Apple：1、 "banana"：1

text2：Apple：1、 "banana"：1

text3：Apple：1、 "banana"：1

text4：Apple：1、 "banana"：1

text5：Apple：1、 "banana"：1

実際に共起する場合、1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 = 5の共起を取得します。本当に重要です。

@Guiem Boschこの場合、共起は、2つの単語が連続している場合にのみ測定されます。

マトリックスを計算するために@titipaソリューションを使用することを提案します。

Xc = (Y.T * Y) # this is co-occurrence matrix in sparse csr format

ここで、Xを使用する代わりに、0より大きい位置ではonesで、別の位置ではzerosで行列Yを使用します。

これを使用して、最初の例では次のようになります。co-occurrence：1 * 1 + 1 * 0 + 1 * 0 + 1 * 0 + 1 * 0 = 1そして2番目の例では：co-occurrence：1 * 1 + 1 * 1 + 1 * 1 + 1 * 1 + 1 * 0 = 5 which is what本当に探しています。

Guiem Bosch · Answer

CountVectorizerまたはTfidfVectorizerでngram_rangeパラメーターを使用できます

コード例：

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2)) # by saying 2,2 you are telling you only want pairs of 2 words

カウントする単語の共起を明示的に指定する場合は、vocabulary param、つまり：vocabulary = {'awesome unicorns':0, 'batman forever':1}を使用します

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

定義済みの単語と単語の共起性を備えた自明ですぐに使用できるコード。この場合、awesome unicornsとbatman foreverの共起を追跡しています。

from sklearn.feature_extraction.text import CountVectorizer import numpy as np samples = ['awesome unicorns are awesome','batman forever and ever','I love batman forever'] bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), vocabulary = {'awesome unicorns':0, 'batman forever':1}) co_occurrences = bigram_vectorizer.fit_transform(samples) print 'Printing sparse matrix:', co_occurrences print 'Printing dense matrix (cols are vocabulary keys 0-> "awesome unicorns", 1-> "batman forever")', co_occurrences.todense() sum_occ = np.sum(co_occurrences.todense(),axis=0) print 'Sum of Word-word occurrences:', sum_occ print 'Pretty printig of co_occurrences count:', Zip(bigram_vectorizer.get_feature_names(),np.array(sum_occ)[0].tolist())

最終出力は('awesome unicorns', 1), ('batman forever', 2)であり、samplesが提供するデータに正確に対応します。