group_byとsummarise（sum）を適用しますが、関連性のない競合データを含む列を保持しますか？

Question

私の質問は group_byを適用し、すべての列の情報を保持しながらデータを要約するに非常に似ていますが、グループ化後に競合するために除外される列を保持したいと思います。

_Label <- c("203c","203c","204a","204a","204a","204a","204a","204a","204a","204a") Type <- c("wholefish","flesh","flesh","fleshdelip","formula","formuladelip", "formula","formuladelip","wholefish", "wholefishdelip") Proportion <- c(1,1,0.67714,0.67714,0.32285,0.32285,0.32285, 0.32285, 0.67714,0.67714) N <- (1:10) C <- (1:10) Code <- c("c","a","a","b","a","b","c","d","c","d") df <- data.frame(Label,Type, Proportion, N, C, Code) df Label Type Proportion N C Code 1 203c wholefish 1.0000 1 1 c 2 203c flesh 1.0000 2 2 a 3 204a flesh 0.6771 3 3 a 4 204a fleshdelip 0.6771 4 4 b 5 204a formula 0.3228 5 5 a 6 204a formuladelip 0.3228 6 6 b 7 204a formula 0.3228 7 7 c 8 204a formuladelip 0.3228 8 8 d 9 204a wholefish 0.6771 9 9 c 10 204a wholefishdelip 0.6771 10 10 d total <- df %>% #where the Label and Code are the same the Proportion, N and C #should be added together respectively group_by(Label, Code) %>% #total proportion should add up to 1 #my way of checking that the correct task has been completed summarise_if(is.numeric, sum) # A tibble: 6 x 5 # Groups: Label [?] Label Code Proportion N C <fctr> <fctr> <dbl> <int> <int> 1 203c a 1.00000 2 2 2 203c c 1.00000 1 1 3 204a a 0.99999 8 8 4 204a b 0.99999 10 10 5 204a c 0.99999 16 16 6 204a d 0.99999 18 18 _

ここまで私は欲しいものを手に入れます。値が競合しているため除外されていますが、列Typeを含めたいと思います。これが私が得たい結果です

_# A tibble: 6 x 5 # Groups: Label [?] Label Code Proportion N C Type <fctr> <fctr> <dbl> <int> <int> <fctr> 1 203c a 1.00000 2 2 wholefish 2 203c c 1.00000 1 1 flesh 3 204a a 0.99999 8 8 flesh_formula 4 204a b 0.99999 10 10 fleshdelip_formuladelip 5 204a c 0.99999 16 16 wholefish_formula 6 204a d 0.99999 18 18 wholefishdelip_formuladelip _

ungroup()とmutateとuniteのいくつかのバリエーションを試しましたが、役に立たないので、提案をいただければ幸いです。

Mako212 · Accepted Answer

これが_data.table_ソリューションです。これらのグループ化されたプロポーションは加算的ではない可能性が高いため、プロポーションのmean()が必要であると想定しています。

_setDT(df) df[, .(Type =paste(Type,collapse="_"), Proportion=mean(Proportion),N= sum(N),C=sum(C)), by=.(Label,Code)] [order(Label)] Label Code Type Proportion N C 1: 203c c wholefish 1.000000 1 1 2: 203c a flesh 1.000000 2 2 3: 204a a flesh_formula 0.499995 8 8 4: 204a b fleshdelip_formuladelip 0.499995 10 10 5: 204a c formula_wholefish 0.499995 16 16 6: 204a d formuladelip_wholefishdelip 0.499995 18 18 _

これが最もクリーンなdplyrソリューションかどうかはわかりませんが、機能します。

_df %>% group_by(Label, Code) %>% mutate(Type = paste(Type,collapse="_")) %>% group_by(Label,Type,Code) %>% summarise(N=sum(N),C=sum(C),Proportion=mean(Proportion)) _

ここで重要なのは、結合されたType列を作成したら再グループ化することです。

_ Label Type Code N C Proportion <fctr> <chr> <fctr> <int> <int> <dbl> 1 203c flesh a 2 2 1.000000 2 203c wholefish c 1 1 1.000000 3 204a flesh_formula a 8 8 0.499995 4 204a fleshdelip_formuladelip b 10 10 0.499995 5 204a formula_wholefish c 16 16 0.499995 6 204a formuladelip_wholefishdelip d 18 18 0.499995 _

Psidom · Answer

他に2つのオプションがあります。

1）列を1つの列にネストし、データ型を確認して要約をカスタマイズします。

df %>% group_by(Label, Code) %>% nest() %>% mutate(data = map(data, ~ as.tibble(map(.x, ~ if(is.numeric(.x)) sum(.x) else paste(.x, collapse="_"))) ) ) %>% unnest() # A tibble: 6 x 6 # Label Code Type Proportion N C # <fctr> <fctr> <chr> <dbl> <int> <int> #1 203c c wholefish 1.00000 1 1 #2 203c a flesh 1.00000 2 2 #3 204a a flesh_formula 0.99999 8 8 #4 204a b fleshdelip_formuladelip 0.99999 10 10 #5 204a c formula_wholefish 0.99999 16 16 #6 204a d formuladelip_wholefishdelip 0.99999 18 18

2）個別に要約してから、結果を結合します。

numeric <- df %>% group_by(Label, Code) %>% summarise_if(is.numeric, sum) character <- df %>% group_by(Label, Code) %>% summarise_if(~ is.character(.) || is.factor(.), ~ paste(., collapse="_")) inner_join(numeric, character, by = c("Label", "Code")) # A tibble: 6 x 6 # Groups: Label [?] # Label Code Proportion N C Type # <fctr> <fctr> <dbl> <int> <int> <chr> #1 203c a 1.00000 2 2 flesh #2 203c c 1.00000 1 1 wholefish #3 204a a 0.99999 8 8 flesh_formula #4 204a b 0.99999 10 10 fleshdelip_formuladelip #5 204a c 0.99999 16 16 formula_wholefish #6 204a d 0.99999 18 18 formuladelip_wholefishdelip

Jake Thompson · Answer

これがtidyverseソリューションで、group_byステートメントを同じに保ちます。重要なのは、最初に各変数タイプ（つまり、数値、文字）にmutate_ifを使用してから、個別の行を取得することです。

library(tidyverse) #> Loading tidyverse: ggplot2 #> Loading tidyverse: tibble #> Loading tidyverse: tidyr #> Loading tidyverse: readr #> Loading tidyverse: purrr #> Loading tidyverse: dplyr #> Conflicts with tidy packages ---------------------------------------------- #> filter(): dplyr, stats #> lag(): dplyr, stats Label <- c("203c", "203c", "204a", "204a", "204a", "204a", "204a", "204a", "204a", "204a") Type <- c("wholefish", "flesh", "flesh", "fleshdelip", "formula", "formuladelip", "formula", "formuladelip", "wholefish", "wholefishdelip") Proportion <- c(1, 1, 0.67714, 0.67714, 0.32285, 0.32285, 0.32285, 0.32285, 0.67714, 0.67714) N <- (1:10) C <- (1:10) Code <- c("c", "a", "a", "b", "a", "b", "c", "d", "c", "d") df <- data_frame(Label, Type, Proportion, N, C, Code) df #> # A tibble: 10 x 6 #> Label Type Proportion N C Code #> <chr> <chr> <dbl> <int> <int> <chr> #> 1 203c wholefish 1.00000 1 1 c #> 2 203c flesh 1.00000 2 2 a #> 3 204a flesh 0.67714 3 3 a #> 4 204a fleshdelip 0.67714 4 4 b #> 5 204a formula 0.32285 5 5 a #> 6 204a formuladelip 0.32285 6 6 b #> 7 204a formula 0.32285 7 7 c #> 8 204a formuladelip 0.32285 8 8 d #> 9 204a wholefish 0.67714 9 9 c #> 10 204a wholefishdelip 0.67714 10 10 d df %>% group_by(Label, Code) %>% mutate_if(is.numeric, sum) %>% mutate_if(is.character, funs(paste(unique(.), collapse = "_"))) %>% distinct() #> # A tibble: 6 x 6 #> # Groups: Label, Code [6] #> Label Type Proportion N C Code #> <chr> <chr> <dbl> <int> <int> <chr> #> 1 203c wholefish 1.00000 1 1 c #> 2 203c flesh 1.00000 2 2 a #> 3 204a flesh_formula 0.99999 8 8 a #> 4 204a fleshdelip_formuladelip 0.99999 10 10 b #> 5 204a formula_wholefish 0.99999 16 16 c #> 6 204a formuladelip_wholefishdelip 0.99999 18 18 d

mrhellmann · Answer

これは https://stackoverflow.com/a/15935166/7547327 から取得され、最後のType列はアンダースコアではなくコンマ区切り文字を使用します。

total <- df %>% group_by(Label, Code) %>% summarise( sums = sum(Proportion), Type2= toString(Type))

収量：

 # A tibble: 6 x 4 # Groups: Label [?] Label Code sums Type2 <fctr> <fctr> <dbl> <chr> 1 203c a 1.00000 flesh 2 203c c 1.00000 wholefish 3 204a a 0.99999 flesh, formula 4 204a b 0.99999 fleshdelip, formuladelip 5 204a c 0.99999 formula, wholefish 6 204a d 0.99999 formuladelip, wholefishdelip