dplyrを使用して特定の文字列を含む行をフィルタリングする

Question

基準として文字列RTBが含まれている行を使用して、データフレームをフィルタリングする必要があります。私はdplyrを使っています。

d.del <- df %.% group_by(TrackingPixel) %.% summarise(MonthDelivery = as.integer(sum(Revenue))) %.% arrange(desc(MonthDelivery))

filterで関数dplyrを使用できることはわかっていますが、文字列の内容をチェックするように指示する方法は正確にはわかりません。

特にTrackingPixel列の内容を確認したいです。文字列にラベルRTBが含まれる場合、結果から行を削除します。

alex23lemm · Answer

上記のコメントで、質問に対する回答は@latemailによってすでに投稿されています。 filterの2番目以降の引数には、次のように正規表現を使用できます。

dplyr::filter(df, !grepl("RTB",TrackingPixel))

元のデータを提供していないので、mtcarsデータセットを使用したおもちゃの例を追加します。あなたがマツダかトヨタによって生産された自動車だけに興味があると想像してください。

mtcars$type <- rownames(mtcars) dplyr::filter(mtcars, grepl('Toyota|Mazda', type)) mpg cyl disp hp drat wt qsec vs am gear carb type 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona

トヨタ車とマツダ車を除く、反対のやり方でそれをやりたい場合、filterコマンドは次のようになります。

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

Keiku · Answer

解決策

stringrパッケージに含まれるtidyverseパッケージのstr_detectを使用することは可能です。 str_detectは、指定されたベクトルに特定の文字列が含まれているかどうかについてTrueまたはFalseを返します。このブール値を使用してフィルタリングすることが可能です。 stringrパッケージについての詳細は stringrの紹介をご覧ください。

library(tidyverse) # ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─ # ✔ ggplot2 2.2.1 ✔ purrr 0.2.4 # ✔ tibble 1.4.2 ✔ dplyr 0.7.4 # ✔ tidyr 0.7.2 ✔ stringr 1.2.0 # ✔ readr 1.1.1 ✔ forcats 0.3.0 # ─ Conflicts ───────────────────── tidyverse_conflicts() ─ # ✖ dplyr::filter() masks stats::filter() # ✖ dplyr::lag() masks stats::lag() mtcars$type <- rownames(mtcars) mtcars %>% filter(str_detect(type, 'Toyota|Mazda')) # mpg cyl disp hp drat wt qsec vs am gear carb type # 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 # 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag # 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla # 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona

Stringrのいいところ

stringr::str_detect()ではなくbase::grepl()を使うべきです。これは次のような理由による。

stringrパッケージによって提供される関数はプレフィックスstr_で始まります。これはコードを読みやすくします。
stringrパッケージの関数の最初の引数は常にdata.frame（または値）で、その後にパラメータが続きます（ありがとう、Paolo）。

object <- "stringr" # The functions with the same prefix `str_`. # The first argument is an object. stringr::str_count(object) # -> 7 stringr::str_sub(object, 1, 3) # -> "str" stringr::str_detect(object, "str") # -> TRUE stringr::str_replace(object, "str", "") # -> "ingr" # The function names without common points. # The position of the argument of the object also does not match. base::nchar(object) # -> 7 base::substr(object, 1, 3) # -> "str" base::grepl("str", object) # -> TRUE base::sub("str", "", object) # -> "ingr"

ベンチマーク

ベンチマークテストの結果は次のとおりです。大きなデータフレームの場合、str_detectは高速です。

library(rbenchmark) library(tidyverse) # The data. Data expo 09. ASA Statistics Computing and Graphics # http://stat-computing.org/dataexpo/2009/the-data.html df <- read_csv("Downloads/2008.csv") print(dim(df)) # [1] 7009728 29 benchmark( "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))}, "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))}, replications = 10, columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self")) # test replications elapsed relative user.self sys.self # 2 grepl 10 16.480 1.513 16.195 0.248 # 1 str_detect 10 10.891 1.000 9.594 1.281

Nettle · Answer

この回答は他の回答と似ていますが、stringr::str_detectとdplyr rownames_to_columnを推奨します。

library(tidyverse) mtcars %>% rownames_to_column("type") %>% filter(stringr::str_detect(type, 'Toyota|Mazda') ) #> type mpg cyl disp hp drat wt qsec vs am gear carb #> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 #> 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 #> 3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 #> 4 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

2018-06-26に表示パッケージ（v0.2.0）によって作成されました。

Tjebo · Answer

任意の列で文字列を見つけたい場合は、

いずれかの列に特定の文字列が含まれている場合、行を削除します

基本的にはfilter_atまたはfilter_allを使うことです