サンプル関数を使用してデータをトレーニング/テストセットに分割する方法

Question

Rを使い始めたばかりですが、データセットを次のサンプルコードに組み込む方法がわかりません。

sample(x, size, replace = FALSE, prob = NULL)

私はトレーニング（75％）とテスト（25％）を設定するために必要なデータセットを持っています。どのような情報をxとsizeに入れるべきかわからないのですが。 xはデータセットファイルで、サイズはいくつのサンプルがあるか

dickoa · Answer

データ分割を達成するための多数のアプローチがあります。より完全なアプローチについては、createDataPartitionパッケージのcaret関数を見てください。

これは簡単な例です：

data(mtcars) ## 75% of the sample size smp_size <- floor(0.75 * nrow(mtcars)) ## set the seed to make your partition reproducible set.seed(123) train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size) train <- mtcars[train_ind, ] test <- mtcars[-train_ind, ]

TheMI · Answer

それは簡単にできます：

set.seed(101) # Set Seed so that same sample can be reproduced in future also # Now Selecting 75% of data as sample from total 'n' rows of the data sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F) train <- data[sample, ] test <- data[-sample, ]

caTools packageを使用すると、

require(caTools) set.seed(101) sample = sample.split(data$anycolumn, SplitRatio = .75) train = subset(data, sample == TRUE) test = subset(data, sample == FALSE)

Katrina Malakhova · Answer

これはほぼ同じコードですが、もっと見栄えがよくなります

bound <- floor((nrow(df)/4)*3) #define % of training and test set df <- df[sample(nrow(df)), ] #sample rows df.train <- df[1:bound, ] #get training set df.test <- df[(bound+1):nrow(df), ] #get test set

Edwin · Answer

これにはdplyrを使います。とてもシンプルになります。それはあなたのデータセットにid変数を必要とします、それはとにかくセットを作成するためだけでなく、あなたのプロジェクトの間のトレーサビリティのためにもいい考えです。まだ含まれていない場合は追加してください。

mtcars$id <- 1:nrow(mtcars) train <- mtcars %>% dplyr::sample_frac(.75) test <- dplyr::anti_join(mtcars, train, by = 'id')

hyunwoo jeong · Answer

'a'を電車（70％）とテスト（30％）に分けます

 a # original data frame library(dplyr) train<-sample_frac(a, 0.7) sid<-as.numeric(rownames(train)) # because rownames() returns character test<-a[-sid,]

終わった

pradnya chavan · Answer

library(caret) intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE) training<-m_train[intrain,] testing<-m_train[-intrain,]

AlexG · Answer

私の解決策は基本的にdickoaのものと同じですが、解釈が少し簡単です。

data(mtcars) n = nrow(mtcars) trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE) train = mtcars[trainIndex ,] test = mtcars[-trainIndex ,]

Shayan Amani · Answer

素晴らしい dplyr libraryを使ったもっと簡単で簡単な方法：

library(dplyr) set.seed(275) #to get repeatable data data.train <- sample_frac(Default, 0.7) train_index <- as.numeric(rownames(data.train)) data.test <- Default[-train_index, ]

user2502836 · Answer

あなたが入力した場合：

?sample

サンプル関数のパラメータの意味を説明するヘルプメニューを起動します。

私は専門家ではありませんが、ここに私が持っているいくつかのコードがあります。

data <- data.frame(matrix(rnorm(400), nrow=100)) splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4)))) test <- splitdata[[1]] train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

これはあなたに75％の列車と25％のテストをするでしょう。

Johnny V · Answer

私の解決策は列をシャッフルしてから列の最初の75％を列車とし、最後の25％をテストとします。スーパーシンプル！

row_count <- nrow(orders_pivotted) shuffled_rows <- sample(row_count) train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),] test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

Yohan Obadia · Answer

同じサイズのサブサンプルの listを作成する関数の下に これはまさにあなたが望んでいたものではありませんが、他の人にとっては役に立つかもしれない。私の場合は、過剰適合をテストするために、小さいサンプルに複数の分類木を作成します。

df_split <- function (df, number){ sizedf <- length(df[,1]) bound <- sizedf/number list <- list() for (i in 1:number){ list[i] <- list(df[((i*bound+1)-bound):(i*bound),]) } return(list) }

例：

x <- matrix(c(1:10), ncol=1) x # [,1] # [1,] 1 # [2,] 2 # [3,] 3 # [4,] 4 # [5,] 5 # [6,] 6 # [7,] 7 # [8,] 8 # [9,] 9 #[10,] 10 x.split <- df_split(x,5) x.split # [[1]] # [1] 1 2 # [[2]] # [1] 3 4 # [[3]] # [1] 5 6 # [[4]] # [1] 7 8 # [[5]] # [1] 9 10

Yash Sharma · Answer

Rサンプルコードで使用caToolsパッケージは以下のようになります： -

data split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6) training_set = subset(data, split == TRUE) test_set = subset(data, split == FALSE)

Konstantin Mingoulin · Answer

ベースRを使用します。関数runifは、0から1までの一様に分布した値を生成します。カットオフ値を変更することにより（以下の例のtrain.size）、カットオフ値以下のランダムレコードは常にほぼ同じパーセンテージになります。

data(mtcars) set.seed(123) #desired proportion of records in training set train.size<-.7 #true/false vector of values above/below the cutoff above train.ind<-runif(nrow(mtcars))<train.size #train train.df<-mtcars[train.ind,] #test test.df<-mtcars[!train.ind,]

user322203 · Answer

私はこれにぶつかった、それも助けになる。

set.seed(12) data = Sonar[sample(nrow(Sonar)),]#reshufles the data bound = floor(0.7 * nrow(data)) df_train = data[1:bound,] df_test = data[(bound+1):nrow(data),]

camnesia · Answer

scorecardパッケージはそのために便利な関数を持っています、そこであなたは比率と種を指定することができます

library(scorecard) dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

テストデータと列車データはリストに格納され、dt_list$trainとdt_list$testを呼び出すことでアクセスできます。

Abhishek · Answer

require(caTools) set.seed(101) #This is used to create same samples everytime split1=sample.split(data$anycol,SplitRatio=2/3) train=subset(data,split1==TRUE) test=subset(data,split1==FALSE)

sample.split()関数はデータフレームに1つの余分な列 'split1'を追加し、行の2/3はこの値をTRUEとし、それ以外はFALSEとします。データフレームをテストします。

igoR87 · Answer

私はrsampleパッケージの使用を提案することができます：

# choosing 75% of the data to be the training data data_split <- initial_split(data, prop = .75) # extracting training data and test data as two seperate dataframes data_train <- training(data_split) data_test <- testing(data_split)

Corentin · Answer

df があなたのデータフレームであり、 75％train および 25％test を作成したいと仮定します。

all <- 1:nrow(df) train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE)) test_i <- all[-train_i]

次に、電車を作成してデータフレームをテストする

df_train <- df[train_i,] df_test <- df[test_i,]

Joe · Answer

ここに掲載されているさまざまな方法をすべて見ても、誰もがTRUE/FALSEを使用してデータを選択および選択解除することはありませんでした。それで、私はそのテクニックを利用する方法を共有すると思いました。

n = nrow(dataset) split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25)) training = dataset[split, ] testing = dataset[!split, ]

説明

Rからデータを選択する方法は複数ありますが、最も一般的には、選択/選択解除にそれぞれ正/負のインデックスを使用します。ただし、選択/選択解除にTRUE/FALSEを使用しても同じ機能を実現できます。

次の例を考えてください。

# let's explore ways to select every other element data = c(1, 2, 3, 4, 5) # using positive indices to select wanted elements data[c(1, 3, 5)] [1] 1 3 5 # using negative indices to remove unwanted elements data[c(-2, -4)] [1] 1 3 5 # using booleans to select wanted elements data[c(TRUE, FALSE, TRUE, FALSE, TRUE)] [1] 1 3 5 # R recycles the TRUE/FALSE vector if it is not the correct dimension data[c(TRUE, FALSE)] [1] 1 3 5

Xavier Jim&#233;nez · Answer

set.seed(123) llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) wmydata<-mydata[llwork, ] tmydata<-mydata[-llwork, ]

dzeltzer · Answer

再現可能な結果を探す場合は、分割のためにsampleに注意してください。データが少しでも変化すると、set.seedを使用しても分割は変わります。たとえば、データ内のソートされたIDのリストが1から10までのすべての数字であるとします。4つの観測値を1つドロップしたばかりの場合、場所によるサンプリングは異なる結果になります。

別の方法は、ハッシュ関数を使用してIDをいくつかの擬似乱数にマッピングしてから、これらの数値のmodをサンプリングすることです。このサンプルは、割り当てが各観測値のハッシュによって決定され、相対位置では決定されないため、より安定しています。

例えば：

require(openssl) # for md5 require(data.table) # for the demo data set.seed(1) # this won't help `sample` population <- as.character(1e5:(1e6-1)) # some made up ID names N <- 1e4 # sample size sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids sample2 <- sample1[-sample(N, 1)] # randomly drop one observation from sample1 # samples are all but identical sample1 sample2 nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed test <- sample(N-1, N/2, replace = F) test1 <- sample1[test, .(id)] test2 <- sample2[test, .(id)] nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit md5_bit_mod <- function(x, m = 2L) { # Inputs: # x: a character vector of ids # m: the modulo divisor (modify for split proportions other than 50:50) # Output: remainders from dividing the first digit of the md5 hash of x by m as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m) } # hash splitting preserves the similarity, because the assignment of test/train # is determined by the hash of each obs., and not by its relative location in the data # which may change test1a <- sample1[md5_bit_mod(id) == 0L, .(id)] test2a <- sample2[md5_bit_mod(id) == 0L, .(id)] nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

割り当ては確率的であるため、サンプルサイズは正確に5000ではありませんが、大きな数の法則のおかげで大きなサンプルでは問題にならないはずです。

また参照してください： http://blog.richardweiss.org/2016/12/25/hash-splits.html および https://crypto.stackexchange.com/questions/20742/statistical-properties-ofハッシュ関数 - モジュロ計算時