事前生成済みのWord2vecとLSTMを使用してWordを生成する

Question

LSTM/RNNはテキスト生成に使用できます。これは、Kerasモデルに事前にトレーニングされたGloVe Word埋め込みを使用する方法を示しています。

Keras LSTMモデルで事前にトレーニングされたWord2Vec Word埋め込みを使用する方法は？これ投稿は役に立ちました。
モデルに入力として単語のシーケンスが提供されているときに、次のWordを予測/生成する方法は？

試みたサンプルアプローチ：

# Sample code to prepare Word2vec Word embeddings import gensim documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey"] sentences = [[Word for Word in document.lower().split()] for document in documents] Word_model = gensim.models.Word2Vec(sentences, size=200, min_count = 1, window = 5) # Code tried to prepare LSTM model for Word generation from keras.layers.recurrent import LSTM from keras.layers.embeddings import Embedding from keras.models import Model, Sequential from keras.layers import Dense, Activation embedding_layer = Embedding(input_dim=Word_model.syn0.shape[0], output_dim=Word_model.syn0.shape[1], weights=[Word_model.syn0]) model = Sequential() model.add(embedding_layer) model.add(LSTM(Word_model.syn0.shape[1])) model.add(Dense(Word_model.syn0.shape[0])) model.add(Activation('softmax')) model.compile(optimizer='sgd', loss='mse')

LSTMをトレーニングして予測するためのサンプルコード/擬似コードは歓迎されます。

Maxim · Answer

私は Gist を作成しました。最初のアイデアの上に構築するシンプルなジェネレーターです。これは、事前に訓練されたWord2vec埋め込みに配線されたLSTMネットワークであり、文の次のWordを予測するように訓練されています。データは arXiv Webサイトの要約リストです。

ここで最も重要な部分を強調します。

Gensim Word2Vec

コードは、それを訓練するための反復回数を除いて、問題ありません。デフォルトのiter=5はかなり低いようです。それに、それは間違いなくボトルネックではありません-LSTMトレーニングにはかなり時間がかかります。 iter=100は見栄えが良い。

Word_model = gensim.models.Word2Vec(sentences, size=100, min_count=1, window=5, iter=100) pretrained_weights = Word_model.wv.syn0 vocab_size, emdedding_size = pretrained_weights.shape print('Result embedding shape:', pretrained_weights.shape) print('Checking similar words:') for Word in ['model', 'network', 'train', 'learn']: most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in Word_model.most_similar(Word)[:8]) print(' %s -> %s' % (Word, most_similar)) def Word2idx(Word): return Word_model.wv.vocab[Word].index def idx2Word(idx): return Word_model.wv.index2Word[idx]

結果の埋め込み行列は、pretrained_weights形状を持つ(vocab_size, emdedding_size)配列に保存されます。

ケラスモデル

損失関数を除いて、コードはほぼ正しいです。モデルは次のWordを予測するため、分類タスクであるため、損失はcategorical_crossentropyまたはsparse_categorical_crossentropyになります。効率上の理由から後者を選択しました。このようにワンホットエンコードを回避します。これは、大きなボキャブラリーにはかなり高価です。

model = Sequential() model.add(Embedding(input_dim=vocab_size, output_dim=emdedding_size, weights=[pretrained_weights])) model.add(LSTM(units=emdedding_size)) model.add(Dense(units=vocab_size)) model.add(Activation('softmax')) model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

事前に訓練された重みをweightsに渡すことに注意してください。

データ準備

sparse_categorical_crossentropy損失を処理するには、文とラベルの両方がWordインデックスでなければなりません。短い文には、共通の長さになるまでゼロを埋め込む必要があります。

train_x = np.zeros([len(sentences), max_sentence_len], dtype=np.int32) train_y = np.zeros([len(sentences)], dtype=np.int32) for i, sentence in enumerate(sentences): for t, Word in enumerate(sentence[:-1]): train_x[i, t] = Word2idx(Word) train_y[i] = Word2idx(sentence[-1])

サンプル生成

これは非常に簡単です。モデルは確率のベクトルを出力し、そのベクトルの次のWordがサンプリングされて入力に追加されます。生成されたテキストは、次のWordがsampledではなく、pickedargmax。使用した温度ベースのランダムサンプリングはここで説明です。

def sample(preds, temperature=1.0): if temperature <= 0: return np.argmax(preds) preds = np.asarray(preds).astype('float64') preds = np.log(preds) / temperature exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) probas = np.random.multinomial(1, preds, 1) return np.argmax(probas) def generate_next(text, num_generated=10): Word_idxs = [Word2idx(Word) for Word in text.lower().split()] for i in range(num_generated): prediction = model.predict(x=np.array(Word_idxs)) idx = sample(prediction[-1], temperature=0.7) Word_idxs.append(idx) return ' '.join(idx2Word(idx) for idx in Word_idxs)

生成されたテキストの例

deep convolutional... -> deep convolutional arithmetic initialization step unbiased effectiveness simple and effective... -> simple and effective family of variables preventing compute automatically a nonconvex... -> a nonconvex technique compared layer converges so independent onehidden markov a... -> a function parameterization necessary both both intuitions with technique valpola utilizes

あまり意味を成しませんが、少なくとも文法的には（場合によっては）聞こえる文を生成できます。

完全な実行可能スクリプトへのリンク。