異なる列を持つdata.framesを再バインドする効率的な方法

Question

列の異なるセットを持つデータフレームのリストがあります。行ごとに1つのデータフレームに結合したいと思います。私が使う plyr::rbind.fillそれを行う。私はこれをより効率的に行うものを探していますが、与えられた答えに似ています here

require(plyr) set.seed(45) sample.fun <- function() { nam <- sample(LETTERS, sample(5:15)) val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10)) setNames(val, nam) } ll <- replicate(1e4, sample.fun()) rbind.fill(ll)

Arun · Accepted Answer

UPDATE：代わりにこの更新された回答を参照してください。

UPDATE（eddi）：これは version 1.8.11 のfill引数として実装されましたrbind。例えば：

_DT1 = data.table(a = 1:2, b = 1:2) DT2 = data.table(a = 3:4, c = 1:2) rbind(DT1, DT2, fill = TRUE) # a b c #1: 1 1 NA #2: 2 2 NA #3: 3 NA 1 #4: 4 NA 2 _

FR＃479 が追加されました-rbind.fill（plyrから）機能のようにdata.frames/data.tablesのリストをマージします

注1：

このソリューションは_data.table_のrbindlist関数を使用してdata.tablesのリストを「バインド」します。このため、バージョン1.8.9を使用してくださいこのバグバージョン<1.8.9。

注2：

rbindlistは、data.frames/data.tablesのリストをバインドするとき、現在のところ、最初の列のデータ型を保持します。つまり、最初のdata.frameの列が文字で、2番目のdata.frameの同じ列が「ファクター」である場合、rbindlistはこの列が文字になります。したがって、data.frameがすべての文字列で構成されている場合、このメソッドを使用したソリューションはplyrメソッドと同じになります。そうでない場合、値は同じままですが、一部の列は係数ではなく文字になります。自分で「ファクター」に変換する必要があります。うまくいけば、この動作は将来変更されるでしょう。

そして、ここで_data.table_を使用しています（そしてplyrから_rbind.fill_とのベンチマーク比較）：

_require(data.table) rbind.fill.DT <- function(ll) { # changed sapply to lapply to return a list always all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) ll.m <- rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) } rbind.fill.PLYR <- function(ll) { rbind.fill(ll) } require(microbenchmark) microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10) # Unit: seconds # expr min lq median uq max neval # t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10 # t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10 # for comparison change t2 to data.table setattr(t2, 'class', c('data.table', 'data.frame')) data.table:::settruelength(t2, 0L) invisible(alloc.col(t2)) setcolorder(t2, unique(unlist(sapply(ll, names)))) identical(t1, t2) # [1] TRUE _

plyrの_rbind.fill_は、この特定の_data.table_ソリューションを、リストサイズが約500になるまで超えていることに注意してください。

ベンチマークプロット：

seq(1000, 10000, by=1000)を使用したdata.framesのリスト長での実行のプロットを次に示します。これらの異なるリストの長さのそれぞれで10人の担当者でmicrobenchmarkを使用しました。

enter image description here

ベンチマークの要点：

ここにベンチマークの要点があります 、誰かが結果を複製したい場合に備えて。

Arun · Answer

data.tableのrbindlist（およびrbind）は、 v1.9.3での最近の変更/コミット（開発バージョン）で機能と速度が向上しました。 dplyrには、rbind.fillという名前のplyrのrbind_allの高速バージョンがあります。この答えは少し古すぎるようです。

rbindlistに関連するNEWSエントリは次のとおりです。

o 'rbindlist' gains 'use.names' and 'fill' arguments and is now implemented entirely in C. Closes #5249 -> use.names by default is FALSE for backwards compatibility (doesn't bind by names by default) -> rbind(...) now just calls rbindlist() internally, except that 'use.names' is TRUE by default, for compatibility with base (and backwards compatibility). -> fill by default is FALSE. If fill is TRUE, use.names has to be TRUE. -> At least one item of the input list has to have non-null column names. -> Duplicate columns are bound in the order of occurrence, like base. -> Attributes that might exist in individual items would be lost in the bound result. -> Columns are coerced to the highest SEXPTYPE, if they are different, if/when possible. -> And incredibly fast ;). -> Documentation updated in much detail. Closes DR #5158.

そのため、以下の比較的大きなデータで新しい（および高速バージョン）のベンチマークを行いました。

新しいベンチマーク：

バインド後の列の総数が500になるように、200〜300の範囲の列を持つ合計10,000個のdata.tablesを作成します。

データを作成する関数：

require(data.table) ## 1.9.3 commit 1267 require(dplyr) ## commit 1504 devel set.seed(1L) names = paste0("V", 1:500) foo <- function() { cols = sample(200:300, 1) data = setDT(lapply(1:cols, function(x) sample(10))) setnames(data, sample(names)[1:cols]) } n = 10e3L ll = vector("list", n) for (i in 1:n) { .Call("Csetlistelt", ll, i, foo()) }

そして、ここにタイミングがあります：

## Updated timings on data.table v1.9.5 - three consecutive runs: system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.993 0.106 2.107 system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.644 0.092 1.744 system.time(ans1 <- rbindlist(ll, fill=TRUE)) # user system elapsed # 1.297 0.088 1.389 ## dplyr's rbind_all - Timings for three consecutive runs system.time(ans2 <- rbind_all(ll)) # user system elapsed # 9.525 0.121 9.761 # user system elapsed # 9.194 0.112 9.370 # user system elapsed # 8.665 0.081 8.780 identical(ans1, setDT(ans2)) # [1] TRUE

Xachriel · Answer

rbind.fillとrbindlistの両方を並列化すると、まだ何かが得られます。並列化された関数で試してみると、バージョン1.8.9がブロックされたため、結果はdata.tableバージョン1.8.8で実行されます。したがって、結果はdata.tableとplyrの間で同一ではありませんが、data.tableまたはplyrソリューション内では同一です。意味平行plyrは非平行plyrと一致し、逆も同様です。

これがベンチマーク/スクリプトです。 parallel.rbind.fill.DTは恐ろしいように見えますが、これは私が引き出せる最速のものです。

require(plyr) require(data.table) require(ggplot2) require(rbenchmark) require(parallel) # data.table::rbindlist solutions rbind.fill.DT <- function(ll) { all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) } parallel.rbind.fill.DT <- function(ll, cluster=NULL){ all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) if(is.null(cluster)){ ll.m <- rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) }else{ cores <- length(cluster) sequ <- as.integer(seq(1, length(ll), length.out = cores+1)) Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ") ll <- eval(parse(text=paste("list(", Call, ")"))) rbindlist(clusterApply(cluster, ll, function(ll, unq.names){ rbindlist(lapply(seq_along(ll), function(x, ll, unq.names) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% colnames(tt)]) := NA_character_] setcolorder(tt, unq.names) }, ll=ll, unq.names=unq.names)) }, unq.names=unq.names)) } } # plyr::rbind.fill solutions rbind.fill.PLYR <- function(ll) { rbind.fill(ll) } parallel.rbind.fill.PLYR <- function(ll, cluster=NULL, magicConst=400){ if(is.null(cluster) | ceiling(length(ll)/magicConst) < length(cluster)){ rbind.fill(ll) }else{ cores <- length(cluster) sequ <- as.integer(seq(1, length(ll), length.out = ceiling(length(ll)/magicConst))) Call <- paste(paste("list", seq(cores), sep=""), " = ll[", c(1, sequ[2:(length(sequ)-1)]+1), ":", sequ[2:length(sequ)], "]", sep="", collapse=", ") ll <- eval(parse(text=paste("list(", Call, ")"))) rbind.fill(parLapply(cluster, ll, rbind.fill)) } } # Function to generate sample data of varying list length set.seed(45) sample.fun <- function() { nam <- sample(LETTERS, sample(5:15)) val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10)) setNames(val, nam) } ll <- replicate(10000, sample.fun()) cl <- makeCluster(4, type="SOCK") clusterEvalQ(cl, library(data.table)) clusterEvalQ(cl, library(plyr)) benchmark(t1 <- rbind.fill.PLYR(ll), t2 <- rbind.fill.DT(ll), t3 <- parallel.rbind.fill.PLYR(ll, cluster=cl, 400), t4 <- parallel.rbind.fill.DT(ll, cluster=cl), replications=5) stopCluster(cl) # Results for rbinding 10000 dataframes # done with 4 cores, i5 3570k and 16gb memory # test reps elapsed relative # rbind.fill.PLYR 5 321.80 16.682 # rbind.fill.DT 5 26.10 1.353 # parallel.rbind.fill.PLYR 5 28.00 1.452 # parallel.rbind.fill.DT 5 19.29 1.000 # checking are results equal t1 <- as.matrix(t1) t2 <- as.matrix(t2) t3 <- as.matrix(t3) t4 <- as.matrix(t4) t1 <- t1[order(t1[, 1], t1[, 2]), ] t2 <- t2[order(t2[, 1], t2[, 2]), ] t3 <- t3[order(t3[, 1], t3[, 2]), ] t4 <- t4[order(t4[, 1], t4[, 2]), ] identical(t2, t4) # TRUE identical(t1, t3) # TRUE identical(t1, t2) # FALSE, mismatch between plyr and data.table

ご覧のとおり、rbind.fillのパラレイジングはdata.tableに匹敵するものであり、データフレーム数が少ない場合でもdata.tableをパラレライズすることで速度をわずかに上げることができます。