NLTKによる効率的な用語ドキュメントマトリックス

Question

NLTKとパンダで用語ドキュメントマトリックスを作成しようとしています。私は次の関数を書きました：

def fnDTM_Corpus(xCorpus): import pandas as pd '''to create a Term Document Matrix from a NLTK Corpus''' fd_list = [] for x in range(0, len(xCorpus.fileids())): fd_list.append(nltk.FreqDist(xCorpus.words(xCorpus.fileids()[x]))) DTM = pd.DataFrame(fd_list, index = xCorpus.fileids()) DTM.fillna(0,inplace = True) return DTM.T

それを実行する

import nltk from nltk.corpus import PlaintextCorpusReader corpus_root = 'C:/Data/' newcorpus = PlaintextCorpusReader(corpus_root, '.*') x = fnDTM_Corpus(newcorpus)

コーパス内のいくつかの小さなファイルに対してはうまく機能しますが、4,000ファイル（それぞれ約2 kb）のコーパスで実行しようとするとMemoryErrorが得られます。

何か不足していますか？

32ビットのpythonを使用しています。（Windows 7、64ビットOS、Core Quad CPU、8 GB RAM）。このサイズのコーパスでは本当に64ビットを使用する必要がありますか？

user1043144 · Accepted Answer

RadimとLarsmansに感謝します。私の目的は、R tmで得られるようなDTMを持つことでした。私はscikit-learnを使用することを決定し、一部はこのブログエントリに触発されました。これは私が思いついたコードです。

他の誰かが役に立てば幸いです。ここに投稿します。

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer def fn_tdm_df(docs, xColNames = None, **kwargs): ''' create a term document matrix as pandas DataFrame with **kwargs you can pass arguments of CountVectorizer if xColNames is given the dataframe gets columns Names''' #initialize the vectorizer vectorizer = CountVectorizer(**kwargs) x1 = vectorizer.fit_transform(docs) #create dataFrame df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names()) if xColNames is not None: df.columns = xColNames return df

ディレクトリ内のテキストのリストで使用するには

DIR = 'C:/Data/' def fn_CorpusFromDIR(xDIR): ''' functions to create corpus from a Directories Input: Directory Output: A dictionary with Names of files ['ColNames'] the text in corpus ['docs']''' import os Res = dict(docs = [open(os.path.join(xDIR,f)).read() for f in os.listdir(xDIR)], ColNames = map(lambda x: 'P_' + x[0:6], os.listdir(xDIR))) return Res

データフレームを作成する

d1 = fn_tdm_df(docs = fn_CorpusFromDIR(DIR)['docs'], xColNames = fn_CorpusFromDIR(DIR)['ColNames'], stop_words=None, charset_error = 'replace')

duhaime · Answer

OPがNLTKでtdmを作成したかったのはわかっていますが、textminingパッケージ（pip install textmining）を使用すると、非常にシンプルになります。

import textmining def termdocumentmatrix_example(): # Create some very short sample documents doc1 = 'John and Bob are brothers.' doc2 = 'John went to the store. The store was closed.' doc3 = 'Bob went to the store too.' # Initialize class to create term-document matrix tdm = textmining.TermDocumentMatrix() # Add the documents tdm.add_doc(doc1) tdm.add_doc(doc2) tdm.add_doc(doc3) # Write out the matrix to a csv file. Note that setting cutoff=1 means # that words which appear in 1 or more documents will be included in # the output (i.e. every Word will appear in the output). The default # for cutoff is 2, since we usually aren't interested in words which # appear in a single document. For this example we want to see all # words however, hence cutoff=1. tdm.write_csv('matrix.csv', cutoff=1) # Instead of writing out the matrix you can also access its rows directly. # Let's print them to the screen. for row in tdm.rows(cutoff=1): print row termdocumentmatrix_example()

出力：

['and', 'the', 'brothers', 'to', 'are', 'closed', 'bob', 'john', 'was', 'went', 'store', 'too'] [1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0] [0, 2, 0, 1, 0, 1, 0, 1, 1, 1, 2, 0] [0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1]

または、pandas and sklearn [source] を使用することもできます。

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer docs = ['why hello there', 'omg hello pony', 'she went there? omg'] vec = CountVectorizer() X = vec.fit_transform(docs) df = pd.DataFrame(X.toarray(), columns=vec.get_feature_names()) print(df)

出力：

 hello omg pony she there went why 0 1 0 0 0 1 0 1 1 1 1 1 0 0 0 0 2 0 1 0 1 1 1 0

Ajay Ohri · Answer

トークンとデータフレームを使用した代替アプローチ

import nltk comment #nltk.download() to get toenize from urllib import request url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = request.urlopen(url) raw = response.read().decode('utf8') type(raw) tokens = nltk.Word_tokenize(raw) type(tokens) tokens[1:10] ['Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by'] tokens2=pd.DataFrame(tokens) tokens2.columns=['Words'] tokens2.head() Words 0 The 1 Project 2 Gutenberg 3 EBook 4 of tokens2.Words.value_counts().head() , 16178 . 9589 the 7436 and 6284 to 5278