web-dev-qa-db-ja.com

'tf.data()' throwing入力がデータ不足になりました。トレーニングを中断する

tf.data()を使用してkeras apiでバッチでデータを生成しようとすると、奇妙な問題が発生します。それは、training_dataが不足しているというエラーをスローし続けます。

TensorFlow 2.1

_import numpy as np
import nibabel
import tensorflow as tf
from tensorflow.keras.layers import Conv3D, MaxPooling3D
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Flatten
from tensorflow.keras import Model
import os
import random


"""Configure GPUs to prevent OOM errors"""
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

"""Retrieve file names"""
ad_files = os.listdir("/home/asdf/OASIS/3D/ad/")
cn_files = os.listdir("/home/asdf/OASIS/3D/cn/")

sub_id_ad = []
sub_id_cn = []

"""OASIS AD: 178 Subjects, 278 3T MRIs"""
"""OASIS CN: 588 Subjects, 1640 3T MRIs"""
"""Down-sampling CN to 278 MRIs"""
random.Random(129).shuffle(ad_files)
random.Random(129).shuffle(cn_files)

"""Split files for training"""
ad_train = ad_files[0:276]
cn_train = cn_files[0:276]

"""Shuffle Train data and Train labels"""
train = ad_train + cn_train
labels = np.concatenate((np.ones(len(ad_train)), np.zeros(len(cn_train))), axis=None)
random.Random(129).shuffle(train)
random.Random(129).shuffle(labels)
print(len(train))
print(len(labels))

"""Change working directory to OASIS/3D/all/"""
os.chdir("/home/asdf/OASIS/3D/all/")

"""Create tf data pipeline"""


def load_image(file, label):
    nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata())

    xs, ys, zs = np.where(nifti != 0)
    nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
    nifti = nifti[0:100, 0:100, 0:100]
    nifti = np.reshape(nifti, (100, 100, 100, 1))
    nifti = tf.convert_to_tensor(nifti, np.float64)
    return nifti, label


@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, labels):
    return tf.py_function(load_image, [file, labels], [tf.float64, tf.float64])


dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.shuffle(6, 129)
dataset = dataset.repeat(50)
dataset = dataset.map(load_image_wrapper, num_parallel_calls=6)
dataset = dataset.batch(6)
dataset = dataset.prefetch(buffer_size=1)
iterator = iter(dataset)
batch_images, batch_labels = iterator.get_next()

########################################################################################
with tf.device("/cpu:0"):
    with tf.device("/gpu:0"):
        model = tf.keras.Sequential()

        model.add(Conv3D(64,
                         input_shape=(100, 100, 100, 1),
                         data_format='channels_last',
                         kernel_size=(7, 7, 7),
                         strides=(2, 2, 2),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:1"):
        model.add(Conv3D(64,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:2"):
        model.add(Conv3D(128,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

        model.add(MaxPooling3D(pool_size=(2, 2, 2),
                               padding='valid'))

        model.add(Flatten())

        model.add(Dense(256, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))


model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer=tf.keras.optimizers.Adagrad(0.01),
              metrics=['accuracy'])


########################################################################################
model.fit(batch_images, batch_labels, steps_per_Epoch=92, epochs=50)
_

データセットを作成した後、繰り返しパラメーターをシャッフルして_num_of_epochs_、つまりこの場合は50に追加します。これは機能しますが、3番目のエポックの後でクラッシュし、この特定のインスタンスで私が何が間違っているのか理解できません。パイプラインの上部で繰り返しとシャッフルのステートメントを宣言することを想定していますか?

ここにエラーがあります:

_Epoch 3/50
92/6 [============================================================================================================================================================================================================================================================================================================================================================================================================================================================================] - 3s 36ms/sample - loss: 0.1902 - accuracy: 0.8043
Epoch 4/50
5/6 [========================>.....] - ETA: 0s - loss: 0.2216 - accuracy: 0.80002020-03-06 15:18:17.804126: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[BiasAddGrad_3/_54]]
2020-03-06 15:18:17.804137: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[sequential/conv3d_3/Conv3D/ReadVariableOp/_21]]
2020-03-06 15:18:17.804140: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[Conv3DBackpropFilterV2_3/_68]]
2020-03-06 15:18:17.804263: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[sequential/dense/MatMul/ReadVariableOp/_30]]
2020-03-06 15:18:17.804364: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
         [[BiasAddGrad_5/_62]]
2020-03-06 15:18:17.804561: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Out of range: End of sequence
         [[{{node IteratorGetNext}}]]
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_Epoch * epochs` batches (in this case, 4600 batches). You may need to use the repeat() f24/6 [========================================================================================================================] - 1s 36ms/sample - loss: 0.1673 - accuracy: 0.8750
Traceback (most recent call last):
  File "python_scripts/gpu_farm/tf_data_generator/3D_tf_data_generator.py", line 181, in <module>
    evaluation_ad = model.evaluate(ad_test, ad_test_labels, verbose=0)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 930, in evaluate
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 490, in evaluate
    use_multiprocessing=use_multiprocessing, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 426, in _model_iteration
    use_multiprocessing=use_multiprocessing)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_v2.py", line 646, in _process_inputs
    x, y, sample_weight=sample_weights)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2383, in _standardize_user_data
    batch_size=batch_size)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2489, in _standardize_tensors
    y, self._feed_loss_fns, feed_output_shapes)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 810, in check_loss_and_target_compatibility
    ' while using as loss `' + loss_name + '`. '
ValueError: A target array with shape (5, 2) was passed for an output of shape (None, 1) while using as loss `binary_crossentropy`. This loss expects targets to have the same shape as the output.
_

更新:したがって、model.fit()は、model.fit(x=data, y=labels)を使用する場合、tf.data()とともに提供する必要があります。奇妙な問題の。これにより、_list out of index_エラーが削除されます。そして今、元のエラーに戻ります。ただし、これはテンソルフローの問題である可能性があるようです: https://github.com/tensorflow/tensorflow/issues/32

したがって、バッチサイズを6からより大きな数値に増やし、_steps_per_Epoch_を減らすと、_StartAbort: Out of range_エラーをスローすることなく、より多くのエポックを通過します

Update2:@ jkjung13の提案に従って、データセットmodel.fit()を使用する場合、model.fit(x=batch)は1つのパラメータを取ります。これは正しい実装です。

ただし、model.fit()datasetパラメータのみを使用している場合は、反復可能オブジェクトではなくxを指定する必要があります。

したがって、次のようになります。model.fit(dataset, epochs=50, steps_per_Epoch=46, validation_data=(v, v_labels))

そしてそれで私は新しいエラーを受け取ります: GitHub Issue

これを克服するために、データセットをnumpy_iterator()に変換しています:model.fit(dataset.as_numpy_iterator(), epochs=50, steps_per_Epoch=46, validation_data=(v, v_labels))

これにより問題は解決しますが、マルチプロセッシングなしの古いケラス_model.fit_generator_と同様に、パフォーマンスは驚くべきものになります。したがって、これは「tf.data」の目的全体を無効にします。

6
domin thomas

TF 2.1

これは現在、次のパラメーターで機能しています。

def load_image(file, label):
    nifti = np.asarray(nibabel.load(file.numpy().decode('utf-8')).get_fdata()).astype(np.float32)

    xs, ys, zs = np.where(nifti != 0)
    nifti = nifti[min(xs):max(xs) + 1, min(ys):max(ys) + 1, min(zs):max(zs) + 1]
    nifti = nifti[0:100, 0:100, 0:100]
    nifti = np.reshape(nifti, (100, 100, 100, 1))
    return nifti, label


@tf.autograph.experimental.do_not_convert
def load_image_wrapper(file, label):
    return tf.py_function(load_image, [file, label], [tf.float64, tf.float64])


dataset = tf.data.Dataset.from_tensor_slices((train, labels))
dataset = dataset.map(load_image_wrapper, num_parallel_calls=32)
dataset = dataset.prefetch(buffer_size=1)
dataset = dataset.apply(tf.data.experimental.prefetch_to_device('/device:GPU:0', 1))

# So, my dataset size is 522, i.e. 522 MRI images.
# I need to load the entire dataset as a batch.
# This should exceed 60GiBs of RAM, but it doesn't go over 12GiB of RAM.
# I'm not sure how tf.data batch() stores the data, maybe a custom file?
# And also add a repeat parameter to iterate with each Epoch.
dataset = dataset.batch(522, drop_remainder=True).repeat()

# Now initialise an iterator
iterator = iter(dataset)

# Create two objects, x & y, from batch
batch_image, batch_label = iterator.get_next()

##################################################################################
with tf.device("/cpu:0"):
    with tf.device("/gpu:0"):
        model = tf.keras.Sequential()

        model.add(Conv3D(64,
                         input_shape=(100, 100, 100, 1),
                         data_format='channels_last',
                         kernel_size=(7, 7, 7),
                         strides=(2, 2, 2),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:1"):
        model.add(Conv3D(64,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

    with tf.device("/gpu:2"):
        model.add(Conv3D(128,
                         kernel_size=(3, 3, 3),
                         padding='valid',
                         activation='relu'))

        model.add(MaxPooling3D(pool_size=(2, 2, 2),
                               padding='valid'))

        model.add(Flatten())

        model.add(Dense(256, activation='relu'))
        model.add(Dropout(0.7))
        model.add(Dense(1, activation='sigmoid'))

model.compile(loss=tf.keras.losses.binary_crossentropy,
              optimizer=tf.keras.optimizers.Adagrad(0.01),
              metrics=['accuracy'])
##################################################################################

# Now supply x=batch_image, y= batch_label to Keras' model.fit()
# And finally, supply your batchs_size here!
model.fit(batch_image, batch_label, epochs=100, batch_size=12)

##################################################################################

これにより、トレーニングの開始には約8分かかります。しかし、トレーニングが始まると、私は信じられないほどのスピードを目にしています!

Epoch 30/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.3526 - accuracy: 0.8640
Epoch 31/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.3334 - accuracy: 0.8448
Epoch 32/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.3308 - accuracy: 0.8697
Epoch 33/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2936 - accuracy: 0.8755
Epoch 34/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2935 - accuracy: 0.8851
Epoch 35/100
522/522 [==============================] - 14s 28ms/sample - loss: 0.3157 - accuracy: 0.8889
Epoch 36/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2910 - accuracy: 0.8851
Epoch 37/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2810 - accuracy: 0.8697
Epoch 38/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2536 - accuracy: 0.8966
Epoch 39/100
522/522 [==============================] - 16s 31ms/sample - loss: 0.2506 - accuracy: 0.9004
Epoch 40/100
522/522 [==============================] - 15s 28ms/sample - loss: 0.2353 - accuracy: 0.8927
Epoch 41/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2336 - accuracy: 0.9042
Epoch 42/100
522/522 [==============================] - 14s 26ms/sample - loss: 0.2243 - accuracy: 0.9234
Epoch 43/100
522/522 [==============================] - 15s 29ms/sample - loss: 0.2181 - accuracy: 0.9176

エポックあたりの古い12分と比較して、エポックあたり15秒!

これが実際に機能しているかどうか、およびテストデータにどのような影響があるかを確認するために、さらにテストを行います。エラーが発生した場合は、この投稿を更新します。

なぜこれが機能するのですか?何も思いつきません。 Kerasのドキュメントには何も見つかりませんでした。

0
domin thomas