データフレームからのニューラルネットワークLSTM入力形状

Question

LSTM with Keras を実装しようとしています。

KerasのLSTMでは、入力として(nb_samples, timesteps, input_dim)という形状の3Dテンソルが必要であることを知っています。ただし、複数のサンプルではなく(nb_samples=1, timesteps=T, input_dim=N)のように、各入力に対してTオブザベーションのサンプルが1つしかないため、入力がどのように表示されるかは完全にはわかりません。各入力を長さT/Mのサンプルに分割する方が良いですか？ Tは約数百万の観測値です。その場合、各サンプルはどのくらいの長さである必要がありますか。つまり、Mをどのように選択しますか？

また、このテンソルは次のようになるはずです。

[[[a_11, a_12, ..., a_1M], [a_21, a_22, ..., a_2M], ..., [a_N1, a_N2, ..., a_NM]], [[b_11, b_12, ..., b_1M], [b_21, b_22, ..., b_2M], ..., [b_N1, b_N2, ..., b_NM]], ..., [[x_11, x_12, ..., a_1M], [x_21, x_22, ..., x_2M], ..., [x_N1, x_N2, ..., x_NM]]]

ここで、MとNは以前のように定義されており、xは、上記のように分割から取得した最後のサンプルに対応していますか？

最後に、各列にT観測値を含むpandasデータフレームと各入力に1つずつN列を指定すると、フィードにそのような入力を作成するにはどうすればよいですか。ケラスへ？

Andrew · Accepted Answer

以下は、LSTMをトレーニングするための時系列データを設定する例です。モデルの構築方法を示すために設定しただけなので、モデルの出力は意味がありません。

_import pandas as pd import numpy as np # Get some time series data df = pd.read_csv("https://raw.githubusercontent.com/plotly/datasets/master/timeseries.csv") df.head() _

時系列データフレーム：

_Date A B C D E F G 0 2008-03-18 24.68 164.93 114.73 26.27 19.21 28.87 63.44 1 2008-03-19 24.18 164.89 114.75 26.22 19.07 27.76 59.98 2 2008-03-20 23.99 164.63 115.04 25.78 19.01 27.04 59.61 3 2008-03-25 24.14 163.92 114.85 27.41 19.61 27.84 59.41 4 2008-03-26 24.44 163.45 114.84 26.86 19.53 28.02 60.09 _

入力をベクトルに構築し、pandas .cumsum()関数を使用して時系列のシーケンスを構築できます。

_# Put your inputs into a single list df['single_input_vector'] = df[input_cols].apply(Tuple, axis=1).apply(list) # Double-encapsulate list so that you can sum it in the next step and keep time steps as separate elements df['single_input_vector'] = df.single_input_vector.apply(lambda x: [list(x)]) # Use .cumsum() to include previous row vectors in the current row list of vectors df['cumulative_input_vectors'] = df.single_input_vector.cumsum() _

出力は同様の方法で設定できますが、シーケンスではなく単一のベクトルになります。

_# If your output is multi-dimensional, you need to capture those dimensions in one object # If your output is a single dimension, this step may be unnecessary df['output_vector'] = df[output_cols].apply(Tuple, axis=1).apply(list) _

入力シーケンスは、モデルを介して実行するために同じ長さである必要があるため、累積ベクトルの最大長になるようにパディングする必要があります。

_# Pad your sequences so they are the same length from keras.preprocessing.sequence import pad_sequences max_sequence_length = df.cumulative_input_vectors.apply(len).max() # Save it as a list padded_sequences = pad_sequences(df.cumulative_input_vectors.tolist(), max_sequence_length).tolist() df['padded_input_vectors'] = pd.Series(padded_sequences).apply(np.asarray) _

トレーニングデータをデータフレームからプルして、numpy配列に入れることができます。 データフレームから出力される入力データは3D配列を作成しません。配列の配列を作成しますが、これは同じではありません。

Hstackとreshapeを使用して、3D入力配列を作成できます。

_# Extract your training data X_train_init = np.asarray(df.padded_input_vectors) # Use hstack to and reshape to make the inputs a 3d vector X_train = np.hstack(X_train_init).reshape(len(df),max_sequence_length,len(input_cols)) y_train = np.hstack(np.asarray(df.output_vector)).reshape(len(df),len(output_cols)) _

それを証明するには：

_>>> print(X_train_init.shape) (11,) >>> print(X_train.shape) (11, 11, 6) >>> print(X_train == X_train_init) False _

トレーニングデータを取得したら、入力レイヤーと出力レイヤーのディメンションを定義できます。

_# Get your input dimensions # Input length is the length for one input sequence (i.e. the number of rows for your sample) # Input dim is the number of dimensions in one input vector (i.e. number of input columns) input_length = X_train.shape[1] input_dim = X_train.shape[2] # Output dimensions is the shape of a single output vector # In this case it's just 1, but it could be more output_dim = len(y_train[0]) _

モデルを作成します。

_from keras.models import Model, Sequential from keras.layers import LSTM, Dense # Build the model model = Sequential() # I arbitrarily picked the output dimensions as 4 model.add(LSTM(4, input_dim = input_dim, input_length = input_length)) # The max output value is > 1 so relu is used as final activation. model.add(Dense(output_dim, activation='relu')) model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy']) _

最後に、モデルをトレーニングして、トレーニングログを履歴として保存できます。

_# Set batch_size to 7 to show that it doesn't have to be a factor or multiple of your sample size history = model.fit(X_train, y_train, batch_size=7, nb_Epoch=3, verbose = 1) _

出力：

_Epoch 1/3 11/11 [==============================] - 0s - loss: 3498.5756 - acc: 0.0000e+00 Epoch 2/3 11/11 [==============================] - 0s - loss: 3498.5755 - acc: 0.0000e+00 Epoch 3/3 11/11 [==============================] - 0s - loss: 3498.5757 - acc: 0.0000e+00 _

それでおしまい。モデルから予測を行うには、model.predict(X)を使用します。ここで、Xは、_X_train_と同じ形式（サンプル数を除く）です。

Andrew · Answer

テンソル形状

KerasがLSTMニューラルネットワークの3Dテンソルを期待しているのはあなたのとおりですが、欠けているのは、Kerasが各観測に複数の次元があることを期待していることです。

たとえば、Kerasでは、Wordベクターを使用して自然言語処理用のドキュメントを表現しました。ドキュメント内の各Wordは、n次元の数値ベクトルで表されます（つまり、n = 2の場合、Wordの「猫」は、[0.31, 0.65]のようなもので表されます）。単一のドキュメントを表すために、Wordのベクトルは順番に並んでいます（例： 'The cat sat。' = [[0.12, 0.99], [0.31, 0.65], [0.94, 0.04]]）。ドキュメントは、Keras LSTMの単一のサンプルになります。

これは、時系列の観測に類似しています。ドキュメントは時系列のようなものであり、Wordは時系列における単一の観測のようなものですが、あなたの場合、観測の表現は単にn = 1次元であるというだけです。

そのため、テンソルは[[[a1], [a2], ... , [aT]], [[b1], [b2], ..., [bT]], ..., [[x1], [x2], ..., [xT]]]のようにする必要があると思います。ここで、xはnb_samples、timesteps = T、input_dim = 1に対応します。観測値は1つの数値のみです。

バッチサイズ

バッチサイズは、マシンのメモリ容量を超えずにスループットを最大化するように設定する必要があります。これは Cross Validated post です。私の知る限り、モデルをトレーニングしてモデルから予測を行う場合も、バッチサイズの倍数である必要はありません。

例

サンプルコードを探している場合は、 Keras Github に、LSTMとシーケンス入力を持つその他のネットワークタイプを使用した多くの例があります。