Word2vec Gensimでバイグラムとトライグラムを取得する

Question

私は現在、次のようにWord2vecモデルでユニグラムを使用しています。

def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list of words # #NLTK tokenizer to split the paragraph into sentences raw_sentences = tokenizer.tokenize(review.strip()) sentences = [] for raw_sentence in raw_sentences: # If a sentence is empty, skip it if len(raw_sentence) > 0: # Otherwise, call review_to_wordlist to get a list of words sentences.append( review_to_wordlist( raw_sentence, \ remove_stopwords )) # # Return the list of sentences (each sentence is a list of words, # so this returns a list of lists return sentences

ただし、その場合、データセット内の重要なバイグラムとトライグラムを見逃します。

E.g., "team work" -> I am currently getting it as "team", "work" "New York" -> I am currently getting it as "New", "York"

したがって、重要なバイグラム、トライグラムなどをデータセットに取り込み、Word2vecモデルに入力します。

私はwordvecが初めてで、その方法に苦労しています。私を助けてください。

nitheism · Accepted Answer

まず、genramのクラス Phrases を使用して、バイグラムを取得する必要があります。これは、ドキュメントで指摘されているように動作します

>>> bigram = Phraser(phrases) >>> sent = [u'the', u'mayor', u'of', u'new', u'york', u'was', u'there'] >>> print(bigram[sent]) [u'the', u'mayor', u'of', u'new_york', u'was', u'there']

トライグラムなどを取得するには、すでにお持ちのバイグラムモデルを使用して、フレーズを再度適用する必要があります。例：

trigram_model = Phrases(bigram_sentences)

また、その使用方法を説明する優れたノートブックとビデオもあります。.. ノートブック、ビデオ

それの最も重要な部分は、次のような実際の文章でそれを使用する方法です。

// to create the bigrams bigram_model = Phrases(unigram_sentences) // apply the trained model to a sentence for unigram_sentence in unigram_sentences: bigram_sentence = u' '.join(bigram_model[unigram_sentence]) // get a trigram model out of the bigram trigram_model = Phrases(bigram_sentences)

これがあなたの助けになることを願っていますが、次回はあなたが使っているものなどについてもっと情報をください。

追伸：編集したので、バイグラムを分割するために何もしていません。ニューヨークのような単語をバイグラムとして取得するには、フレーズを使用する必要があります。

brb · Answer

from gensim.models import Phrases from gensim.models.phrases import Phraser documents = ["the mayor of new york was there", "machine learning can be useful sometimes","new york mayor was present"] sentence_stream = [doc.split(" ") for doc in documents] print(sentence_stream) bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ') bigram_phraser = Phraser(bigram) print(bigram_phraser) for sent in sentence_stream: tokens_ = bigram_phraser[sent] print(tokens_)

Vivek Ananthan · Answer

フレーズとフレーズはあなたが探しているものです

bigram = gensim.models.Phrases(data_words, min_count=1, threshold=10) # higher threshold fewer phrases. trigram = gensim.models.Phrases(bigram[data_words], threshold=100)

単語の追加が十分に完了したら、Phraserを使用してアクセスを高速化し、メモリを効率的に使用します。必須ではありませんが便利です。

bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram)

おかげで、