Spacy ParserをPandas DataFrame w / Multiprocessingに適用する

Question

たとえば、次のようなデータセットがあるとします。

iris = pd.DataFrame(sns.load_dataset('iris'))

Spacyと.applyを使用して、文字列列をトークンに解析できます（実際のデータセットには、もちろん、エントリあたり1ワード/トークンが含まれています）。

import spacy # (I have version 1.8.2) nlp = spacy.load('en') iris['species_parsed'] = iris['species'].apply(nlp)

結果：

 sepal_length ... species species_parsed 0 1.4 ... setosa (setosa) 1 1.4 ... setosa (setosa) 2 1.3 ... setosa (setosa)

この便利なマルチプロセッシング関数（ thanks to this blogpost ）を使用して、データフレームでほとんどの任意の適用関数を並列に実行することもできます。

from multiprocessing import Pool, cpu_count def parallelize_dataframe(df, func, num_partitions): df_split = np.array_split(df, num_partitions) pool = Pool(num_partitions) df = pd.concat(pool.map(func, df_split)) pool.close() pool.join() return df

例えば：

def my_func(df): df['length_of_Word'] = df['species'].apply(lambda x: len(x)) return df num_cores = cpu_count() iris = parallelize_dataframe(iris, my_func, num_cores)

結果：

 sepal_length species length_of_Word 0 5.1 setosa 6 1 4.9 setosa 6 2 4.7 setosa 6

...しかし、何らかの理由で、この方法でマルチプロセッシングを使用してSpacyパーサーをデータフレームに適用することはできません。

def add_parsed(df): df['species_parsed'] = df['species'].apply(nlp) return df iris = parallelize_dataframe(iris, add_parsed, num_cores)

結果：

 sepal_length species length_of_Word species_parsed 0 5.1 setosa 6 () 1 4.9 setosa 6 () 2 4.7 setosa 6 ()

これを行う他の方法はありますか？ NLPのSpacyが大好きですが、テキストデータがたくさんあるため、いくつかの処理機能を並列化したいのですが、この問題が発生しました。

Ed Rushton · Accepted Answer

Spacyは高度に最適化されており、マルチプロセッシングを実行します。そのため、.apply 直接。

次に、解析の結果を照合し、これをデータフレームに戻す必要があります。

したがって、あなたの例では、次のようなものを使用できます：

tokens = [] lemma = [] pos = [] for doc in nlp.pipe(df['species'].astype('unicode').values, batch_size=50, n_threads=3): if doc.is_parsed: tokens.append([n.text for n in doc]) lemma.append([n.lemma_ for n in doc]) pos.append([n.pos_ for n in doc]) else: # We want to make sure that the lists of parsed results have the # same number of entries of the original Dataframe, so add some blanks in case the parse fails tokens.append(None) lemma.append(None) pos.append(None) df['species_tokens'] = tokens df['species_lemma'] = lemma df['species_pos'] = pos

このアプローチは小さなデータセットでも問題なく機能しますが、メモリを消費するため、大量のテキストを処理する場合はあまり適していません。