グループごとの行数に基づくサブセットデータフレーム

Question

次のようなデータがあり、「名前」が3回以上出現します。

df <- data.frame(name = c("a", "a", "a", "b", "b", "c", "c", "c", "c"), x = 1:9) name x 1 a 1 2 a 2 3 a 3 4 b 4 5 b 5 6 c 6 7 c 7 8 c 8 9 c 9

name変数の各レベル内の行（観測）の数に基づいてデータをサブセット化（フィルター）したいと思います。 nameの特定のレベルが3回以上発生する場合は、そのレベルに属するすべての行を削除します。したがって、この例では、そのグループにname == c行があるため、> 3の観測を削除します。

 name x 1 a 1 2 a 2 3 a 3 4 b 4 5 b 5

私はこのコードを書きましたが、機能させることができません。

as.data.frame(table(unique(df)$name)) subset(df, name > 3)

Henrik · Accepted Answer

最初に、2つのbase代替案。 1つはtableに依存し、もう1つはaveとlengthに依存します。次に、2つのdata.table 方法。

1. `table`

tt <- table(df$name) df2 <- subset(df, name %in% names(tt[tt < 3])) # or df2 <- df[df$name %in% names(tt[tt < 3]), ]

ステップバイステップで説明したい場合：

# count each 'name', assign result to an object 'tt' tt <- table(df$name) # which 'name' in 'tt' occur more than three times? # Result is a logical vector that can be used to subset the table 'tt' tt < 3 # from the table, select 'name' that occur < 3 times tt[tt < 3] # ...their names names(tt[tt < 3]) # rows of 'name' in the data frame that matches "the < 3 names" # the result is a logical vector that can be used to subset the data frame 'df' df$name %in% names(tt[tt < 3]) # subset data frame by a logical vector # 'TRUE' rows are kept, 'FALSE' rows are removed. # assign the result to a data frame with a new name df2 <- subset(df, name %in% names(tt[tt < 3])) # or df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. `ave`および`length`

@flodelが示唆しているように：

df[ave(df$x, df$name, FUN = length) < 3, ]

3. `data.table`：`.N`および`.SD`：

library(data.table) setDT(df)[, if (.N < 3) .SD, by = name]

4. `data.table`：`.N`および`.I`：

setDT(df) df[df[, .I[.N < 3], name]$V1]

関連するQ＆A グループごとの観測/行の数をカウントし、データフレームに結果を追加するも参照してください。

Joe · Answer

dplyrパッケージの使用：

_df %>% group_by(name) %>% filter(n() < 4) # A tibble: 5 x 2 # Groups: name [2] name x <fct> <int> 1 a 1 2 a 2 3 a 3 4 b 4 5 b 5 _

n()は、現在のグループの観測値の数を返すので、_group_by_の名前を付けて、グループの一部である行のみを保持し、そのグループの行数が4。

Cettt · Answer

dpylrパッケージを使用するさらに別の方法は、count関数を使用してから、元のデータフレームで準結合を行うことです。

library(dplyr) df %>% count(name) %>% filter(n <= 3) %>% semi_join(df, ., by = "name")

グループごとの行数に基づくサブセットデータフレーム

1. table

2. aveおよびlength

3. data.table：.Nおよび.SD：

4. data.table：.Nおよび.I：

1. `table`

2. `ave`および`length`

3. `data.table`：`.N`および`.SD`：

4. `data.table`：`.N`および`.I`：