Kを使用したK分割交差検証

Question

ニューラルネットワークの実行時間が非常に長いため、convn netのk分割交差検証は真剣に行われていないようです。私は小さなデータセットを持っていて、与えられた例 here を使用してk分割交差検証を行うことに興味があります。出来ますか？ありがとう。

nbarbosa · Answer

データジェネレーターで画像を使用している場合、Kerasとscikit-learnで10分割交差検証を行う1つの方法を次に示します。戦略は、各フォールドに従ってファイルをtraining、validation、およびtestサブフォルダーにコピーすることです。

import numpy as np import os import pandas as pd import shutil from sklearn.model_selection import StratifiedKFold from sklearn.metrics import accuracy_score, classification_report, confusion_matrix # used to copy files according to each fold def copy_images(df, directory): destination_directory = "{path to your data directory}/" + directory print("copying {} files to {}...".format(directory, destination_directory)) # remove all files from previous fold if os.path.exists(destination_directory): shutil.rmtree(destination_directory) # create folder for files from this fold if not os.path.exists(destination_directory): os.makedirs(destination_directory) # create subfolders for each class for c in set(list(df['class'])): if not os.path.exists(destination_directory + '/' + c): os.makedirs(destination_directory + '/' + c) # copy files for this fold from a directory holding all the files for i, row in df.iterrows(): try: # this is the path to all of your images kept together in a separate folder path_from = "{path to all of your images}" path_from = path_from + "{}.jpg" path_to = "{}/{}".format(destination_directory, row['class']) # move from folder keeping all files to training, test, or validation folder (the "directory" argument) shutil.copy(path_from.format(row['filename']), path_to) except Exception, e: print("Error when copying {}: {}".format(row['filename'], str(e))) # dataframe containing the filenames of the images (e.g., GUID filenames) and the classes df = pd.read_csv('{path to your data}.csv') df_y = df['class'] df_x = df del df_x['class'] skf = StratifiedKFold(n_splits = 10) total_actual = [] total_predicted = [] total_val_accuracy = [] total_val_loss = [] total_test_accuracy = [] for i, (train_index, test_index) in enumerate(skf.split(df_x, df_y)): x_train, x_test = df_x.iloc[train_index], df_x.iloc[test_index] y_train, y_test = df_y.iloc[train_index], df_y.iloc[test_index] train = pd.concat([x_train, y_train], axis=1) test = pd.concat([x_test, y_test], axis = 1) # take 20% of the training data from this fold for validation during training validation = train.sample(frac = 0.2) # make sure validation data does not include training data train = train[~train['filename'].isin(list(validation['filename']))] # copy the images according to the fold copy_images(train, 'training') copy_images(validation, 'validation') copy_images(test, 'test') print('**** Running fold '+ str(i)) # here you call a function to create and train your model, returning validation accuracy and validation loss val_accuracy, val_loss = create_train_model(); # append validation accuracy and loss for average calculation later on total_val_accuracy.append(val_accuracy) total_val_loss.append(val_loss) # here you will call a predict() method that will predict the images on the "test" subfolder # this function returns the actual classes and the predicted classes in the same order actual, predicted = predict() # append accuracy from the predictions on the test data total_test_accuracy.append(accuracy_score(actual, predicted)) # append all of the actual and predicted classes for your final evaluation total_actual = total_actual + actual total_predicted = total_predicted + predicted # this is optional, but you can also see the performance on each fold as the process goes on print(classification_report(total_actual, total_predicted)) print(confusion_matrix(total_actual, total_predicted)) print(classification_report(total_actual, total_predicted)) print(confusion_matrix(total_actual, total_predicted)) print("Validation accuracy on each fold:") print(total_val_accuracy) print("Mean validation accuracy: {}%".format(np.mean(total_val_accuracy) * 100)) print("Validation loss on each fold:") print(total_val_loss) print("Mean validation loss: {}".format(np.mean(total_val_loss))) print("Test accuracy on each fold:") print(total_test_accuracy) print("Mean test accuracy: {}%".format(np.mean(total_test_accuracy) * 100))

データジェネレーターを使用している場合、predict（）関数で、テスト時にbatch_size of 1を使用することで予測を同じ順序に保つ唯一の方法は次のとおりです。

generator = ImageDataGenerator().flow_from_directory( '{path to your data directory}/test', target_size = (img_width, img_height), batch_size = 1, color_mode = 'rgb', # categorical for a multiclass problem class_mode = 'categorical', # this will also ensure the same order shuffle = False)

このコードを使用すると、データジェネレーターを使用して10倍の交差検証を実行できました（そのため、すべてのファイルをメモリに保持する必要がありませんでした）。何百万もの画像がある場合、これは多くの作業になる可能性があり、テストセットが大きい場合はbatch_size = 1がボトルネックになる可能性がありますが、私のプロジェクトではこれはうまくいきました。