パンダを使用して1つのデータフレームからテストおよびトレーニングサンプルを作成する方法を教えてください。

Question

私はデータフレームの形でかなり大きなデータセットを持っていて、トレーニングとテストのためにどのようにデータフレームを2つのランダムサンプル（80％と20％）に分割できるか疑問に思いました。

ありがとうございます。

Andy Hayden · Accepted Answer

私はただnumpyのrandnを使います：

In [11]: df = pd.DataFrame(np.random.randn(100, 2)) In [12]: msk = np.random.Rand(len(df)) < 0.8 In [13]: train = df[msk] In [14]: test = df[~msk]

そしてこれがうまくいったのを見るためだけに：

In [15]: len(test) Out[15]: 21 In [16]: len(train) Out[16]: 79

gobrewers14 · Answer

scikit learnのtrain_test_split は良いものです。

from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.2)

PagMax · Answer

Pandasランダムサンプルも機能します

train=df.sample(frac=0.8,random_state=200) test=df.drop(train.index)

Napitupulu Jon · Answer

私はscikit-learn自身のtraining_test_splitを使い、それをインデックスから生成します

from sklearn.cross_validation import train_test_split y = df.pop('output') X = df X_train,X_test,y_train,y_test = train_test_split(X.index,y,test_size=0.2) X.iloc[X_train] # return dataframe train

user1775015 · Answer

以下のコードを使用してテストを作成し、サンプルを学習させることができます。

from sklearn.model_selection import train_test_split trainingSet, testSet = train_test_split(df, test_size=0.2)

テストサイズは、テストおよびトレーニングデータセットに含めるデータの割合によって異なります。

Abhi · Answer

有効な答えはたくさんあります。もう1つ束に追加します。 sklearn.cross_validationからimport train_test_split

#gets a random 80% of the entire set X_train = X.sample(frac=0.8, random_state=1) #gets the left out portion of the dataset X_test = X.loc[~df_model.index.isin(X_train.index)]

Apogentus · Answer

また、トレーニングとテストセットへの層別分割を検討することもできます。分割開始はまた、無作為にしかし元のクラスの割合が維持されるように設定されたトレーニングおよびテストセットを生成する。これにより、トレーニングセットとテストセットは元のデータセットの特性をよりよく反映するようになります。

import numpy as np def get_train_test_inds(y,train_proportion=0.7): '''Generates indices, making random stratified split into training set and testing sets with proportions train_proportion and (1-train_proportion) of initial sample. y is any iterable indicating classes of each observation in the sample. Initial proportions of classes inside training and testing sets are preserved (stratified sampling). ''' y=np.array(y) train_inds = np.zeros(len(y),dtype=bool) test_inds = np.zeros(len(y),dtype=bool) values = np.unique(y) for value in values: value_inds = np.nonzero(y==value)[0] np.random.shuffle(value_inds) n = int(train_proportion*len(value_inds)) train_inds[value_inds[:n]]=True test_inds[value_inds[n:]]=True return train_inds,test_inds

df [train_inds]とdf [test_inds]は、元のDataFrame dfのトレーニングセットとテストセットを提供します。

Pardhu Gopalam · Answer

import pandas as pd from sklearn.model_selection import train_test_split datafile_name = 'path_to_data_file' data = pd.read_csv(datafile_name) target_attribute = data['column_name'] X_train, X_test, y_train, y_test = train_test_split(data, target_attribute, test_size=0.8)

MikeL · Answer

データセットのlables列に関してデータを分割する必要がある場合は、これを使用できます。

def split_to_train_test(df, label_column, train_frac=0.8): train_df, test_df = pd.DataFrame(), pd.DataFrame() labels = df[label_column].unique() for lbl in labels: lbl_df = df[df[label_column] == lbl] lbl_train_df = lbl_df.sample(frac=train_frac) lbl_test_df = lbl_df.drop(lbl_train_df.index) print '
%s:
---------
total:%d
train_df:%d
test_df:%d' % (lbl, len(lbl_df), len(lbl_train_df), len(lbl_test_df)) train_df = train_df.append(lbl_train_df) test_df = test_df.append(lbl_test_df) return train_df, test_df

そしてそれを使う：

train, test = split_to_train_test(data, 'class', 0.7)

分割のランダムさを制御したい場合や、グローバルなランダムシードを使用したい場合は、random_stateを渡すこともできます。

Anarcho-Chossid · Answer

これは、DataFrameを分割する必要があるときに書いたものです。私は上記のAndyのアプローチを使用することを検討しましたが、データセットのサイズを正確に制御できないことを望みませんでした（つまり、79、81など）。

def make_sets(data_df, test_portion): import random as rnd tot_ix = range(len(data_df)) test_ix = sort(rnd.sample(tot_ix, int(test_portion * len(data_df)))) train_ix = list(set(tot_ix) ^ set(test_ix)) test_df = data_df.ix[test_ix] train_df = data_df.ix[train_ix] return train_df, test_df train_df, test_df = make_sets(data_df, 0.2) test_df.head()

AHonarmand · Answer

Train、test、およびvalidationなど、3つ以上のクラスに分割するには、次のようにします。

probs = np.random.Rand(len(df)) training_mask = probs < 0.7 test_mask = (probs>=0.7) & (probs < 0.85) validatoin_mask = probs >= 0.85 df_training = df[training_mask] df_test = df[test_mask] df_validation = df[validatoin_mask]

これにより、データの70％がトレーニングに、15％がテストに、そして15％が検証に使用されます。

Makio · Answer

このようにdfから範囲行を選択するだけです

row_count = df.shape[0] split_point = int(row_count*1/5) test_data, train_data = df[:split_point], df[split_point:]

yannick_leo · Answer

トレイン/テスト、さらには検証サンプルを作成する方法はたくさんあります。

ケース1：オプションなしの従来の方法train_test_split：

from sklearn.model_selection import train_test_split train, test = train_test_split(df, test_size=0.3)

ケース2：非常に小さいデータセット（500行未満）の場合：この相互検証を使用してすべての行の結果を取得するため。最後に、利用可能なトレーニングセットの各行に1つの予測があります。

from sklearn.model_selection import KFold kf = KFold(n_splits=10, random_state=0) y_hat_all = [] for train_index, test_index in kf.split(X, y): reg = RandomForestRegressor(n_estimators=50, random_state=0) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] clf = reg.fit(X_train, y_train) y_hat = clf.predict(X_test) y_hat_all.append(y_hat)

ケース3a：分類目的のための不均衡データセットケース1に続いて、これは同等の解決策です。

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3)

ケース3b：分類目的のための不均衡データセット。ケース2に続いて、これは同等の解決策です。

from sklearn.model_selection import StratifiedKFold kf = StratifiedKFold(n_splits=10, random_state=0) y_hat_all = [] for train_index, test_index in kf.split(X, y): reg = RandomForestRegressor(n_estimators=50, random_state=0) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] clf = reg.fit(X_train, y_train) y_hat = clf.predict(X_test) y_hat_all.append(y_hat)

ケース4：ハイパーパラメーターを調整するためにビッグデータにトレイン/テスト/検証セットを作成する必要があります（60％トレイン、20％テストおよび20％val）。

from sklearn.model_selection import train_test_split X_train, X_test_val, y_train, y_test_val = train_test_split(X, y, test_size=0.6) X_test, X_val, y_test, y_val = train_test_split(X_test_val, y_test_val, stratify=y, test_size=0.5)

kiran6 · Answer

Df.as_matrix（）関数を利用してNumpy-arrayを作成して渡すことができます。

Y = df.pop() X = df.as_matrix() x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2) model.fit(x_train, y_train) model.test(x_test)

Hakim · Answer

後で列を追加したい場合は、データフレームのスライスではなくコピーを取得する必要もあると思います。

msk = np.random.Rand(len(df)) < 0.8 train, test = df[msk].copy(deep = True), df[~msk].copy(deep = True)

Akash Jain · Answer

これはどう？ dfは私のデータフレームです

total_size=len(df) train_size=math.floor(0.66*total_size) (2/3 part of my dataset) #training dataset train=df.head(train_size) #test dataset test=df.tail(len(df) -train_size)

Johnny V · Answer

1つのデータフレームを入力し、2つのデータフレームを出力することを望んでいる場合（厄介な配列ではありません）、これでうまくいくはずです。

def split_data(df, train_perc = 0.8): df['train'] = np.random.Rand(len(df)) < train_perc train = df[df.train == 1] test = df[df.train == 0] split_data ={'train': train, 'test': test} return split_data

thebeancounter · Answer

私の好みに少し優雅なのは、ランダムな列を作成してそれで分割することです。これにより、ニーズに合わせてランダムになる分割を取得できます。

def split_df(df, p=[0.8, 0.2]): import numpy as np df["Rand"]=np.random.choice(len(p), len(df), p=p) r = [df[df["Rand"]==val] for val in df["Rand"].unique()] return r