gensimからLDAトピックモデルを印刷する方法Python

Question

gensimを使用して、LSAの一連のドキュメントからトピックを抽出できましたが、LDAモデルから生成されたトピックにアクセスするにはどうすればよいですか？

lda.print_topics(10)がNoneTypeを返すため、print_topics()を出力すると、コードは次のエラーを出しました。

Traceback (most recent call last): File "/home/alvas/workspace/XLINGTOP/xlingtop.py", line 93, in <module> for top in lda.print_topics(2): TypeError: 'NoneType' object is not iterable

コード：

from gensim import corpora, models, similarities from gensim.models import hdpmodel, ldamodel from itertools import izip documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] # remove common words and tokenize stoplist = set('for a of the and to in'.split()) texts = [[Word for Word in document.lower().split() if Word not in stoplist] for document in documents] # remove words that appear only once all_tokens = sum(texts, []) tokens_once = set(Word for Word in set(all_tokens) if all_tokens.count(Word) == 1) texts = [[Word for Word in text if Word not in tokens_once] for text in texts] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] # I can print out the topics for LSA lsi = models.LsiModel(corpus_tfidf, id2Word=dictionary, num_topics=2) corpus_lsi = lsi[corpus] for l,t in izip(corpus_lsi,corpus): print l,"#",t print for top in lsi.print_topics(2): print top # I can print out the documents and which is the most probable topics for each doc. lda = ldamodel.LdaModel(corpus, id2Word=dictionary, num_topics=50) corpus_lda = lda[corpus] for l,t in izip(corpus_lda,corpus): print l,"#",t print # But I am unable to print out the topics, how should i do it? for top in lda.print_topics(10): print top

alvas · Answer

いじくり回した後、ldamodelのprint_topics(numoftopics)にバグがあるようです。だから私の回避策はprint_topic(topicid)を使うことです：

>>> print lda.print_topics() None >>> for i in range(0, lda.num_topics-1): >>> print lda.print_topic(i) 0.083*response + 0.083*interface + 0.083*time + 0.083*human + 0.083*user + 0.083*survey + 0.083*computer + 0.083*eps + 0.083*trees + 0.083*system ...

user2597000 · Answer

私はshow_topicsの構文が時間とともに変化したと思います：

show_topics(num_topics=10, num_words=10, log=False, formatted=True)

トピックの数がnum_topicsの場合、num_words個の最上位の単語を返します（デフォルトでは、トピックごとに10語）。

トピックはリストとして返されます。書式設定がTrueの場合は文字列のリスト、Falseの場合は（確率、Word）2タプルのリストです。

ログがTrueの場合、この結果もログに出力します。

LSAとは異なり、LDAのトピック間に自然な順序はありません。したがって、すべてのトピックの返されたnum_topics <= self.num_topicsサブセットは任意であり、2つのLDAトレーニング実行間で変わる可能性があります。

zanbri · Answer

ロギングを使用していますか？ _print_topics_は docs に記述されているようにログファイルに出力します。

@ mac389が言うように、lda.show_topics()は画面に出力する方法です。

xu2mao · Answer

あなたは使うことができます：

for i in lda_model.show_topics(): print i[0], i[1]

Samuel Nde · Answer

トピックを単語のリストとして見ることは、常により役立つと思います。次のコードスニペットは、その目標を達成するのに役立ちます。 lda_modelというldaモデルがすでにあると思います。

for index, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, [w[0] for w in topic]))

上記のコードでは、各トピックに属する最初の30語を表示することにしました。簡単にするために、最初に取得したトピックを示しました。

Topic: 0 Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental'] Topic: 1 Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

上記のトピックの見た目が気に入らないので、通常は次のようにコードを変更します。

for idx, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, '|'.join([w[0] for w in topic])))

...そして出力（最初の2つのトピックが表示されます）は次のようになります。

Topic: 0 Words: associate|incident|time|task|pain|amcare|work|ppe|train|proper|report|standard|pmv|level|perform|wear|date|factor|overtime|location|area|yes|new|treatment|start|stretch|assign|condition|participate|environmental Topic: 1 Words: work|associate|cage|aid|shift|leave|area|eye|incident|aider|hit|pit|manager|return|start|continue|pick|call|come|right|take|report|lead|break|paramedic|receive|get|inform|room|head

Shirish Kumar · Answer

トピックを印刷するサンプルコードを次に示します。

def ExtractTopics(filename, numTopics=5): # filename is a pickle file where I have lists of lists containing bag of words texts = pickle.load(open(filename, "rb")) # generate dictionary dict = corpora.Dictionary(texts) # remove words with low freq. 3 is an arbitrary number I have picked here low_occerance_ids = [tokenid for tokenid, docfreq in dict.dfs.iteritems() if docfreq == 3] dict.filter_tokens(low_occerance_ids) dict.compactify() corpus = [dict.doc2bow(t) for t in texts] # Generate LDA Model lda = models.ldamodel.LdaModel(corpus, num_topics=numTopics) i = 0 # We print the topics for topic in lda.show_topics(num_topics=numTopics, formatted=False, topn=20): i = i + 1 print "Topic #" + str(i) + ":", for p, id in topic: print dict[int(id)], print ""

Maneet · Answer

最近、Python 3およびGensim 2.3.0で作業しているときに同様の問題に遭遇しました。print_topics()およびshow_topics()はエラーを出さず、何も出力しません。show_topics()がリストを返すことがわかります。

topic_list = show_topics() print(topic_list)

Feng Mai · Answer

各トピックの上位の単語をcsvファイルにエクスポートすることもできます。 topnは、エクスポートする各トピックの下の単語数を制御します。

import pandas as pd top_words_per_topic = [] for t in range(lda_model.num_topics): top_words_per_topic.extend([(t, ) + x for x in lda_model.show_topic(t, topn = 5)]) pd.DataFrame(top_words_per_topic, columns=['Topic', 'Word', 'P']).to_csv("top_words.csv")

CSVファイルの形式は次のとおりです

Topic Word P 0 w1 0.004437 0 w2 0.003553 0 w3 0.002953 0 w4 0.002866 0 w5 0.008813 1 w6 0.003393 1 w7 0.003289 1 w8 0.003197 ...

Shivom Sharma · Answer

****This code works fine but I want to know the topic name instead of Topic: 0 and Topic:1, How do i know which topic this Word comes in**?** for index, topic in lda_model.show_topics(formatted=False, num_words= 30): print('Topic: {} 
Words: {}'.format(idx, [w[0] for w in topic])) Topic: 0 Words: ['associate', 'incident', 'time', 'task', 'pain', 'amcare', 'work', 'ppe', 'train', 'proper', 'report', 'standard', 'pmv', 'level', 'perform', 'wear', 'date', 'factor', 'overtime', 'location', 'area', 'yes', 'new', 'treatment', 'start', 'stretch', 'assign', 'condition', 'participate', 'environmental'] Topic: 1 Words: ['work', 'associate', 'cage', 'aid', 'shift', 'leave', 'area', 'eye', 'incident', 'aider', 'hit', 'pit', 'manager', 'return', 'start', 'continue', 'pick', 'call', 'come', 'right', 'take', 'report', 'lead', 'break', 'paramedic', 'receive', 'get', 'inform', 'room', 'head']

Nikita sharma · Answer

Gensimを使用して独自のトピック形式をクリーニングします。

from gensim.parsing.preprocessing import preprocess_string, strip_punctuation, strip_numeric lda_topics = lda.show_topics(num_words=5) topics = [] filters = [lambda x: x.lower(), strip_punctuation, strip_numeric] for topic in lda_topics: print(topic) topics.append(preprocess_string(topic[1], filters)) print(topics)

出力：

(0, '0.020*"business" + 0.018*"data" + 0.012*"experience" + 0.010*"learning" + 0.008*"analytics"') (1, '0.027*"data" + 0.020*"experience" + 0.013*"business" + 0.010*"role" + 0.009*"science"') (2, '0.026*"data" + 0.016*"experience" + 0.012*"learning" + 0.011*"machine" + 0.009*"business"') (3, '0.028*"data" + 0.015*"analytics" + 0.015*"experience" + 0.008*"business" + 0.008*"skills"') (4, '0.014*"data" + 0.009*"learning" + 0.009*"machine" + 0.009*"business" + 0.008*"experience"') [ ['business', 'data', 'experience', 'learning', 'analytics'], ['data', 'experience', 'business', 'role', 'science'], ['data', 'experience', 'learning', 'machine', 'business'], ['data', 'analytics', 'experience', 'business', 'skills'], ['data', 'learning', 'machine', 'business', 'experience'] ]