1つを使用してモデルを作成する-Kerasのホットエンコーディング

Question

私は文章分類の問題に取り組んでおり、Kerasを使用して解決しようとしています。語彙のユニークな単語の総数は36です。

この場合、総語彙は[W1、W2、W3 .... W36]です。

したがって、[W1 W2 W6 W7 W9]のような単語の文がある場合、それをエンコードすると、以下のような配列が取得されます。

[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1] [0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]

形状は（5,36）

私はここから立ち往生しています。すべて、私が生成した20000の数の多い配列はさまざまな形状、つまり（N、36）です。ここで、Nは文の単語数です。つまり、トレーニングには20,000文、テストには100文があり、すべての文は（1,36）ワンホットエンコーディングでラベル付けされています

X_train、x_test、y_train、y_testがあります

x_testとy_testの次元（1,36）

誰も私にそれを行う方法をアドバイスできますか？

以下のコーディングをいくつか行いました

model = Sequential() model.add(Dense(512, input_shape=(??????))), model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

どんな助けでも大歓迎です。

@putonspectaclesの更新と応答

細かい対応に時間と労力をいただき誠にありがとうございます。コードを機能させるために必要ないくつかの小さな変更を加えて、コードを試しました。以下で見つけてください

num_classes = 5 max_words = 20 sentences = ["The cat is in the house","The green boy","computer programs are not alive while the children are"] labels = np.random.randint(0, num_classes, 3) y = to_categorical(labels, num_classes=num_classes) words = set(w for sent in sentences for w in sent.split()) Word_map = {w : i+1 for (i, w) in enumerate(words)} #-Changed the below line the inner for loop sent to sent.split() sent_ints = [[Word_map[w] for w in sent.split()] for sent in sentences] vocab_size = len(words) print(vocab_size) #-changed the below line - the outer for loop sentences to sent_ints X = np.array([to_categorical(pad_sequences((sent,), max_words),vocab_size+1) for sent in sent_ints]) print(X) print(y) model = Sequential() model.add(Dense(512, input_shape=(max_words, vocab_size + 1))) model.add(LSTM(128)) model.add(Dense(5, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X,y)

これらの変更がないと、コードは機能しません。上記のコードを実行すると、以下のように適切な埋め込みが印刷されます

[[[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.] [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]] [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]]] [[[1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.] [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.] [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]]] [[0. 0. 0. 0. 1.] [1. 0. 0. 0. 0.] [0. 1. 0. 0. 0.]]

しかし、私が取得しているエラーは「入力をチェックするときのエラー：3つの次元を持つと思われるデンス_44_入力ですが、形状（3、1、20、16）の配列を取得しました "

入力形状をmodel.add（Dense（512、input_shape =（None、max_words、vocab_size + 1）））に変更すると

「入力0はレイヤーlstm_27と互換性がありません：予期されたndim = 3、ndim = 4が見つかりました "

この問題の解決に取り組んでいます。私に指示を与えることができれば、それは素晴らしいことです。

言葉を埋め込むという目的に答えるため、私はその答えを受け入れました。再度、感謝します。

orsonady · Accepted Answer

クール、あなたは質問を片付けました。文を分類したい。私はあなたがバッグ・オブ・ワードのエンコーディングよりももっとうまくやりたいと言ったと仮定しています。シーケンスを重視したい。

次に、新しいモデルを選択します- RNN（LSTMバージョン）。このモデルは、タスクに最も適した文の表現を構築するときに、各単語の重要性を（順番に）合計します。

ただし、前処理を少し異なる方法で処理する必要があります。効率を上げるため（一度に1つの文を処理するのではなく、バッチでより多くの文をまとめて処理できるようにするため）、すべての文にsameの量を設定します言葉の。したがって、max_wordsを選択し、20と指定します。短い単語を埋めて最大単語数に到達させ、20単語より長い文章を切り捨てます。

ケラスはそれを手伝うでしょう。すべてのWordを整数でエンコードします。

from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Embedding, Dense, LSTM num_classes = 5 max_words = 20 sentences = ["The cat is in the house", "The green boy", "computer programs are not alive while the children are"] labels = np.random.randint(0, num_classes, 3) y = to_categorical(labels, num_classes=num_classes) words = set(w for sent in sentences for w in sent.split()) Word_map = {w : i+1 for (i, w) in enumerate(words)} sent_ints = [[Word_map[w] for w in sent] for sent in sentences] vocab_size = len(words)

したがって、「緑の少年」は今[1、3、5]かもしれません。次に、ワンホットエンコードでパディングし、

# pad to max_words length and encode with len(words) + 1 # + 1 because we'll reserve 0 add the padding sentinel. X = np.array([to_categorical(pad_sequences((sent,), max_words), vocab_size + 1) for sent in sent_ints]) print(X.shape) # (3, 20, 16)

ここでモデルに：Denseレイヤーを追加して、これらの1つのホットワードを密集したベクトルに変換します。次に、LSTMを使用して、センテンス内のWordベクトルを密なセンテンスベクトルに変換します。最後に、softmaxアクティベーションを使用して、クラス全体の確率分布を作成します。

model = Sequential() model.add(Dense(512, input_shape=(max_words, vocab_size + 1))) model.add(LSTM(128)) model.add(Dense(5, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

それは完了するはずです。その後、トレーニングを続けることができます。

model.fit(X,y)

編集：

この行：

# we need to split the sentences in a words write now it reading every # letter notice the sent.split() in the correct version below. sent_ints = [[Word_map[w] for w in sent] for sent in sentences]

でなければなりません：

sent_ints = [[Word_map[w] for w in sent.split()] for sent in sentences]