数値といくつかのNA
値を持つ640 x 2500データフレームがあります。私の目標は、各行で最小75の連続するNA
値を見つけることです。そのような実行ごとに、前のandに続く50個のセルをNA
の値にも置き換えます。
以下は、1つの行の縮小された例です。
_x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
_
4つの連続するNA
の実行を検出し、実行前と実行後の3つの値をNA
に置き換えます。
_c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
_
最初に、連続するNA
sをrle
で識別しようとしましたが、rle(is.na(df))
を実行すると、エラー_'x' must be a vector of an atomic type
_が発生します。これは、単一の行を選択した場合でも発生します。
残念ながら、前と後の50個のセルをNAに変換するための次のステップはどうなるかわかりません。
事前に感謝します。
これが私の解決策です。でも私よりもきちんとした解決策はあるのでしょうか。
library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
as.data.table() # dataset created
# A function to do what you need
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times
if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
Vector[seq(NAs[1]-3,
NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
}
Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3
余談ですが、640 * 2500サイズの架空のデータフレームの場合、速度は非常に良好です。75以上のNAのストレッチを配置する必要があり、前後の50の値をNAに置き換える必要があります。
df <- matrix(nrow = 640,ncol = 2500)
for(i in 1:nrow(df)){
df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times
if(length(NAs >= 75)){ # Check if the condition is met
Vector[seq(NAs[1]-50,
NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
}
Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
as_tibble() %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose())
user system elapsed
0.216 0.002 0.220