TensorflowのConnectionist Temporal Classification（CTC）実装の使用

Question

TensorflowのCTC実装をcontribパッケージ（tf.contrib.ctc.ctc_loss）の下で使用しようとしていますが、成功しません。

まず第一に、誰もが良いステップバイステップのチュートリアルをどこで読むことができるか知っていますか？ Tensorflowのドキュメントは、このトピックに関して非常に貧弱です。
空白のラベルがインターリーブされたラベルをctc_lossに提供する必要がありますか？
200エポックを超える長さ1の列車データセットを使用しても、ネットワークを過剰に適合させることができませんでした。 :(
Tf.edit_distanceを使用してラベルエラー率を計算するにはどうすればよいですか？

これが私のコードです：

with graph.as_default(): max_length = X_train.shape[1] frame_size = X_train.shape[2] max_target_length = y_train.shape[1] # Batch size x time steps x data width data = tf.placeholder(tf.float32, [None, max_length, frame_size]) data_length = tf.placeholder(tf.int32, [None]) # Batch size x max_target_length target_dense = tf.placeholder(tf.int32, [None, max_target_length]) target_length = tf.placeholder(tf.int32, [None]) # Generating sparse tensor representation of target target = ctc_label_dense_to_sparse(target_dense, target_length) # Applying LSTM, returning output for each timestep (y_rnn1, # [batch_size, max_time, cell.output_size]) and the final state of shape # [batch_size, cell.state_size] y_rnn1, h_rnn1 = tf.nn.dynamic_rnn( tf.nn.rnn_cell.LSTMCell(num_hidden, state_is_Tuple=True, num_proj=num_classes), # num_proj=num_classes data, dtype=tf.float32, sequence_length=data_length, ) # For sequence labelling, we want a prediction for each timestamp. # However, we share the weights for the softmax layer across all timesteps. # How do we do that? By flattening the first two dimensions of the output tensor. # This way time steps look the same as examples in the batch to the weight matrix. # Afterwards, we reshape back to the desired shape # Reshaping logits = tf.transpose(y_rnn1, perm=(1, 0, 2)) # Get the loss by calculating ctc_loss # Also calculates # the gradient. This class performs the softmax operation for you, so inputs # should be e.g. linear projections of outputs by an LSTM. loss = tf.reduce_mean(tf.contrib.ctc.ctc_loss(logits, target, data_length)) # Define our optimizer with learning rate optimizer = tf.train.RMSPropOptimizer(learning_rate).minimize(loss) # Decoding using beam search decoded, log_probabilities = tf.contrib.ctc.ctc_beam_search_decoder(logits, data_length, beam_width=10, top_paths=1)

ありがとう！

更新（2016年6月29日）

ありがとう、@ jihyeon-seo！したがって、RNNの入力には[num_batch、max_time_step、num_features]のようなものがあります。 dynamic_rnnを使用して、入力を指定して反復計算を実行し、形状のテンソル[num_batch、max_time_step、num_hidden]を出力します。その後、重み共有を使用して各ティルメステップでアフィン射影を実行する必要があるため、[num_batch * max_time_step、num_hidden]に再形成し、形状の重み行列[num_hidden、num_classes]を掛け、バイアスを合計して元に戻します。 reshape、transpose（したがって、ctc loss入力に対して[max_time_steps、num_batch、num_classes]があります）、そしてこの結果がctc_loss関数の入力になります。私はすべてを正しく行いましたか？

これはコードです：

 cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_Tuple=True) h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32) # Reshaping to share weights accross timesteps x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden]) self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1 # Reshaping self._logits = tf.reshape(self._logits, [max_length, -1, num_classes]) # Calculating loss loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length) self.cost = tf.reduce_mean(loss)

更新（07/11/2016）

@Xivありがとうございます。バグ修正後のコードは次のとおりです。

 cell = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers, state_is_Tuple=True) h_rnn1, self.last_state = tf.nn.dynamic_rnn(cell, self.input_data, self.sequence_length, dtype=tf.float32) # Reshaping to share weights accross timesteps x_fc1 = tf.reshape(h_rnn1, [-1, num_hidden]) self._logits = tf.matmul(x_fc1, self._W_fc1) + self._b_fc1 # Reshaping self._logits = tf.reshape(self._logits, [-1, max_length, num_classes]) self._logits = tf.transpose(self._logits, (1,0,2)) # Calculating loss loss = tf.contrib.ctc.ctc_loss(self._logits, self._targets, self.sequence_length) self.cost = tf.reduce_mean(loss)

更新（07/25/16）

私のコードのGitHub部分で published を使用して、1つの発話で作業しています。お気軽にご利用ください！ :)

Jihyeon Seo · Accepted Answer

同じことをやろうとしている。これがあなたが興味を持っているかもしれないと私が見つけたものです。

CTCのチュートリアルを見つけるのは本当に大変でしたが、この例は役に立ちました。

空白のラベルの場合、 CTCレイヤーは空白のインデックスがnum_classes - 1 なので、空白のラベルに追加のクラスを提供する必要があります。

また、CTCネットワークはソフトマックス層を実行します。コードでは、RNNレイヤーはCTC損失レイヤーに接続されています。 RNNレイヤーの出力は内部でアクティブ化されているため、アクティベーション機能なしで1つの非表示レイヤー（出力レイヤーである可能性があります）を追加してから、CTC損失レイヤーを追加する必要があります。

Jon Rein · Answer

双方向LSTM、CTCの例についてはこちらを参照し、距離の実装を編集して、TIMITコーパスで音素認識モデルをトレーニングします。そのコーパスのトレーニングセットでトレーニングする場合は、120エポック程度で音素エラー率を20〜25％に下げることができるはずです。