グループ化されたデータから最初と最後の行を選択する

Question

質問

dplyrを使用して、1つのステートメントでグループ化されたデータの上部と下部の観測/行を選択するにはどうすればよいですか？

データと例

与えられたデータフレーム

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2))

sliceを使用して各グループから上位および下位の観測値を取得できますが、2つの別個のステートメントを使用します。

firstStop <- df %>% group_by(id) %>% arrange(stopSequence) %>% slice(1) %>% ungroup lastStop <- df %>% group_by(id) %>% arrange(stopSequence) %>% slice(n()) %>% ungroup

これらの2つのstatmenetsを組み合わせて、bothの上位と下位の観測値を選択できますか？

jeremycg · Accepted Answer

おそらくもっと速い方法があります：

df %>% group_by(id) %>% arrange(stopSequence) %>% filter(row_number()==1 | row_number()==n())

Frank · Answer

完全を期すために：sliceにインデックスのベクトルを渡すことができます

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

与える

 id stopId stopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 b 1 6 3 a 3

MichaelChirico · Answer

dplyrではありませんが、data.tableを使用する方がはるかに直接的です。

library(data.table) setDT(df) df[ df[order(id, stopSequence), .I[c(1L,.N)], by=id]$V1 ] # id stopId stopSequence # 1: 1 a 1 # 2: 1 c 3 # 3: 2 b 1 # 4: 2 c 4 # 5: 3 b 1 # 6: 3 a 3

より詳細な説明：

# 1) get row numbers of first/last observations from each group # * basically, we sort the table by id/stopSequence, then, # grouping by id, name the row numbers of the first/last # observations for each id; since this operation produces # a data.table # * .I is data.table shorthand for the row number # * here, to be maximally explicit, I've named the variable V1 # as row_num to give other readers of my code a clearer # understanding of what operation is producing what variable first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id] idx = first_last$row_num # 2) extract rows by number df[idx]

data.tableの基本をカバーするための Getting Started wikiを必ずチェックしてください。

hrbrmstr · Answer

何かのようなもの：

library(dplyr) df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2)) first_last <- function(x) { bind_rows(slice(x, 1), slice(x, n())) } df %>% group_by(id) %>% arrange(stopSequence) %>% do(first_last(.)) %>% ungroup ## Source: local data frame [6 x 3] ## ## id stopId stopSequence ## 1 1 a 1 ## 2 1 c 3 ## 3 2 b 1 ## 4 2 c 4 ## 5 3 b 1 ## 6 3 a 3

doを使用すると、グループに対してほぼ任意の数の操作を実行できますが、@ jeremycgの答えはこのタスクにのみ適しています。

mpalanco · Answer

dplyrが指定された質問を知っています。しかし、他の人が既に他のパッケージを使用してソリューションを投稿しているため、他のパッケージも使用することにしました。

基本パッケージ：

df <- df[with(df, order(id, stopSequence, stopId)), ] merge(df[!duplicated(df$id), ], df[!duplicated(df$id, fromLast = TRUE), ], all = TRUE)

データ表：

df <- setDT(df) df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

sqldf：

library(sqldf) min <- sqldf("SELECT id, stopId, min(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId") max <- sqldf("SELECT id, stopId, max(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId") sqldf("SELECT * FROM min UNION SELECT * FROM max")

1つのクエリで：

sqldf("SELECT * FROM (SELECT id, stopId, min(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId) UNION SELECT * FROM (SELECT id, stopId, max(stopSequence) AS StopSequence FROM df GROUP BY id ORDER BY id, StopSequence, stopId)")

出力：

 id stopId StopSequence 1 1 a 1 2 1 c 3 3 2 b 1 4 2 c 4 5 3 a 3 6 3 b 1

sindri_baldur · Answer

2018年のdata.tableの使用：

# convert to data.table setDT(df) # order, group, filter df[order(stopSequence)][, .SD[c(1, .N)], by = id] id stopId stopSequence 1: 1 a 1 2: 1 c 3 3: 2 b 1 4: 2 c 4 5: 3 b 1 6: 3 a 3

Ronak Shah · Answer

別のベースRの代替案は、最初にorderとidによるstopSequence、splitに基づいたidであり、idごとに最初と最後のインデックスのみを選択し、それらのインデックスを使用してデータフレームをサブセット化します。

df[sapply(with(df, split(order(id, stopSequence), id)), function(x) c(x[1], x[length(x)])), ] # id stopId stopSequence #1 1 a 1 #3 1 c 3 #5 2 b 1 #6 2 c 4 #8 3 b 1 #7 3 a 3

またはbyを使用して同様

df[unlist(with(df, by(order(id, stopSequence), id, function(x) c(x[1], x[length(x)])))), ]

Sahir Moosvi · Answer

Lapplyとdplyrステートメントを使用した別のアプローチ。同じステートメントに任意の数のサマリー関数を適用できます。

lapply(c(first, last), function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% bind_rows()

たとえば、最大stopSequence値を持つ行にも興味がある場合は、次のようにします。

lapply(c(first, last, max("stopSequence")), function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% bind_rows()