データセットをトレーニングとテストデータセットに分割/分割して、たとえば相互検証する方法は？

Question

NumPy配列をランダムにトレーニングおよびテスト/検証データセットに分割する良い方法は何ですか？ Matlabのcvpartitionまたはcrossvalind関数に似たもの。

pberkes · Accepted Answer

データセットを2等分に分割する場合、numpy.random.shuffleを使用できます。インデックスを追跡する必要がある場合はnumpy.random.permutationを使用できます。

import numpy # x is your dataset x = numpy.random.Rand(100, 5) numpy.random.shuffle(x) training, test = x[:80,:], x[80:,:]

または

import numpy # x is your dataset x = numpy.random.Rand(100, 5) indices = numpy.random.permutation(x.shape[0]) training_idx, test_idx = indices[:80], indices[80:] training, test = x[training_idx,:], x[test_idx,:]

相互検証のために同じデータセットを繰り返しパーティション化するには多くの方法があります。 1つの戦略は、繰り返しを使用してデータセットからリサンプリングすることです。

import numpy # x is your dataset x = numpy.random.Rand(100, 5) training_idx = numpy.random.randint(x.shape[0], size=80) test_idx = numpy.random.randint(x.shape[0], size=20) training, test = x[training_idx,:], x[test_idx,:]

最後に、- sklearn には複数の相互検証方法（k-fold、leave-n-out、...）が含まれます。また、より高度な "層別サンプリング" いくつかの機能に関してバランスのとれたデータのパーティションを作成するメソッドも含まれています。たとえば、トレーニングとテストセット。

Paulo Malvar · Answer

Scikit-learnを使用するだけの別のオプションがあります。 scikitのwikiの説明のように、次の手順を使用できます。

from sklearn.model_selection import train_test_split data, labels = np.arange(10).reshape((5, 2)), range(5) data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size=0.20, random_state=42)

これにより、トレーニングとテストに分割しようとしているデータのラベルを同期させることができます。

offwhitelotus · Answer

ちょっとだけ。トレーニング、テスト、および検証セットが必要な場合は、次を実行できます。

from sklearn.cross_validation import train_test_split X = get_my_X() y = get_my_y() x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3) x_test, x_val, y_test, y_val = train_test_split(x_test, y_test, test_size=0.5)

これらのパラメーターは、トレーニングに70％、テストと評価セットにそれぞれ15％を与えます。お役に立てれば。

M. Mashaye · Answer

sklearn.cross_validationモジュールが非推奨になったため、次を使用できます。

import numpy as np from sklearn.model_selection import train_test_split X, y = np.arange(10).reshape((5, 2)), range(5) X_trn, X_tst, y_trn, y_tst = train_test_split(X, y, test_size=0.2, random_state=42)

Apogentus · Answer

また、階層化されたトレーニングとテストセットへの分割を検討することもできます。 Startified Divisionは、トレーニングとテストセットをランダムに生成しますが、元のクラスの割合が保持されるようにします。これにより、トレーニングセットとテストセットが元のデータセットのプロパティをより適切に反映するようになります。

import numpy as np def get_train_test_inds(y,train_proportion=0.7): '''Generates indices, making random stratified split into training set and testing sets with proportions train_proportion and (1-train_proportion) of initial sample. y is any iterable indicating classes of each observation in the sample. Initial proportions of classes inside training and testing sets are preserved (stratified sampling). ''' y=np.array(y) train_inds = np.zeros(len(y),dtype=bool) test_inds = np.zeros(len(y),dtype=bool) values = np.unique(y) for value in values: value_inds = np.nonzero(y==value)[0] np.random.shuffle(value_inds) n = int(train_proportion*len(value_inds)) train_inds[value_inds[:n]]=True test_inds[value_inds[n:]]=True return train_inds,test_inds y = np.array([1,1,2,2,3,3]) train_inds,test_inds = get_train_test_inds(y,train_proportion=0.5) print y[train_inds] print y[test_inds]

このコードの出力：

[1 2 3] [1 2 3]

Zahran · Answer

答えてくれてありがとう。（1）サンプリング中の置換（2）トレーニングとテストの両方で重複したインスタンスが発生しないように変更しました。

training_idx = np.random.choice(X.shape[0], int(np.round(X.shape[0] * 0.8)),replace=False) training_idx = np.random.permutation(np.arange(X.shape[0]))[:np.round(X.shape[0] * 0.8)] test_idx = np.setdiff1d( np.arange(0,X.shape[0]), training_idx)

Colin · Answer

私は自分のプロジェクトでこれを行う関数を作成しました（ただし、numpyは使用しません）。

def partition(seq, chunks): """Splits the sequence into equal sized chunks and them as a list""" result = [] for i in range(chunks): chunk = [] for element in seq[i:len(seq):chunks]: chunk.append(element) result.append(chunk) return result

チャンクをランダム化する場合は、渡す前にリストをシャッフルするだけです。

prashanth · Answer

以下は、データを階層化された方法でn = 5に分割するコードです

% X = data array % y = Class_label from sklearn.cross_validation import StratifiedKFold skf = StratifiedKFold(y, n_folds=5) for train_index, test_index in skf: print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]

rotem · Answer

いくつかの読み取りを行い、データを分割してトレーニングおよびテストするための（多くの..）さまざまな方法を考慮に入れた後、時間を調整することにしました！

私は4つの異なる方法を使用しました（それらのいずれもライブラリsklearnを使用していません。ライブラリsklearnを使用すると、コードが適切に設計およびテストされているため、最良の結果が得られます）。

マトリックスarr全体をシャッフルし、データを分割してトレーニングとテストを行います
インデックスをシャッフルしてから、xとyを割り当ててデータを分割します
方法2と同じですが、より効率的な方法で実行します
pandasデータフレームを使用して分割する

方法3は、方法1に続いて、最短時間で圧倒的に勝ちました。方法2と4は、非常に効率が悪いことがわかりました。

私が時間を計った4つの異なるメソッドのコード：

import numpy as np arr = np.random.Rand(100, 3) X = arr[:,:2] Y = arr[:,2] spl = 0.7 N = len(arr) sample = int(spl*N) #%% Method 1: shuffle the whole matrix arr and then split np.random.shuffle(arr) x_train, x_test, y_train, y_test = X[:sample,:], X[sample:, :], Y[:sample, ], Y[sample:,] #%% Method 2: shuffle the indecies and then shuffle and apply to X and Y train_idx = np.random.choice(N, sample) Xtrain = X[train_idx] Ytrain = Y[train_idx] test_idx = [idx for idx in range(N) if idx not in train_idx] Xtest = X[test_idx] Ytest = Y[test_idx] #%% Method 3: shuffle indicies without a for loop idx = np.random.permutation(arr.shape[0]) # can also use random.shuffle train_idx, test_idx = idx[:sample], idx[sample:] x_train, x_test, y_train, y_test = X[train_idx,:], X[test_idx,:], Y[train_idx,], Y[test_idx,] #%% Method 4: using pandas dataframe to split import pandas as pd df = pd.read_csv(file_path, header=None) # Some csv file (I used some file with 3 columns) train = df.sample(frac=0.7, random_state=200) test = df.drop(train.index)

そして、時間については、1000ループの3回の繰り返しから実行する最小時間は次のとおりです。

方法1：0.35883826200006297秒
方法2：1.7157016959999964秒
方法3：1.7876616719995582秒
方法4：0.07562861499991413秒

それがお役に立てば幸いです！