



N = 1e7L
DT = data.table(x = sample(letters, N, TRUE),
                y = sample(1000L, N, TRUE),
                val = runif(N))
setkey(DT, x, y)


SUBSET1 <- function(){
  a <- DT[.(c("a"), c(5L)), .N, nomatch = NULL]
SUBSET2 <- function(){
  a <- DT[ x == "a" & y == 5L, .N, nomatch = NULL]


               times = 500 )
  Unit: milliseconds
        expr    min      lq     mean  median     uq      max neval
   SUBSET1() 1.0328 1.27790 1.878415 1.53370 1.8924  20.5789   500
   SUBSET2() 2.4896 3.06665 4.476864 3.52685 4.3682 179.1607   500

SUBSET2の処理速度が遅い理由がわかりません。 「ベクタースキャン方法」からバイナリ検索への一種の内部変換があるのか​​、それとも「ベクタースキャン方法」がそのように実行されるため(バイナリ検索よりも遅いため)ですか?

Cédric Guilmin


_DT[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]


_Optimized subsetting with key 'x, y'
forder.c received 1 rows and 2 columns
forder took 0.001 sec
x is already ordered by these columns, no need to call reorder
i.x has same type (character) as x.x. No coercion needed.
i.y has same type (integer) as x.y. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Detected that j uses these columns: <none> 


_DT[.("a", 5L), .N, nomatch = NULL, verbose = TRUE]
_i.V1 has same type (character) as x.x. No coercion needed.
i.V2 has same type (integer) as x.y. No coercion needed.
on= matches existing key, using key
Starting bmerge ...
forder.c received 1 rows and 2 columns
bmerge done in 0.001s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Detected that j uses these columns: <none> 


_DTrand = DT[sample(.N)]


_DTrand[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]


_Creating new index 'y__x'
Creating index y__x done in ... forder.c received 10000000 rows and 3 columns
forder took 0.424 sec
0.286s elapsed (1.117s cpu) 
Optimized subsetting with index 'y__x'
forder.c received 1 rows and 2 columns
forder took 0.002 sec
x is already ordered by these columns, no need to call reorder
i.y has same type (integer) as x.y. No coercion needed.
i.x has same type (character) as x.x. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.001s cpu) 
Reorder irows for 'mult=="all" && !allGrp1' ... forder.c received 360 rows and 2 columns
0.000s elapsed (0.002s cpu) 
Detected that j uses these columns: <none> 
[1] 360


_DTrand[ x == "a" & y == 5L, .N, nomatch = NULL, verbose = TRUE]


_Optimized subsetting with index 'y__x'
forder.c received 1 rows and 2 columns
forder took 0 sec
x is already ordered by these columns, no need to call reorder
i.y has same type (integer) as x.y. No coercion needed.
i.x has same type (character) as x.x. No coercion needed.
on= matches existing index, using index
Starting bmerge ...
bmerge done in 0.000s elapsed (0.000s cpu) 
Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 
Reorder irows for 'mult=="all" && !allGrp1' ... forder.c received 360 rows and 2 columns
0.001s elapsed (0.001s cpu) 
Detected that j uses these columns: <none> 
[1] 360

したがって、DTrandであっても単純なベンチマークは真の比較にはなりません。最初のベンチマークの実行後、テーブルにインデックスが付けられ、後続のサブセットはこの&バイナリ検索を使用します。詳細は セカンダリインデックスのビネット を参照してください。


_options(datatable.auto.index = FALSE)
setindex(DTrand, NULL)


  times = 50L,
  vector = DTrand[ x == "a" & y == 5L, .N, nomatch = NULL],
  binary = DT[     x == "a" & y == 5L, .N, nomatch = NULL]
# Unit: milliseconds
#    expr       min         lq       mean     median        uq        max neval
#  vector 101.43306 114.325340 134.154362 119.367909 128.05273 345.721296    50
#  binary   1.06033   1.160188   1.631119   1.367017   1.57334   5.508802    50

したがって、.()を使用したスト​​レートアップアプローチは、_==_を使用した最適化アプローチの2倍の速度ですが、_==_は、よりも100倍高速です。 trueベクトルのサブセット。

