数値といくつかの<code>NA</code>値を持つ640 x 2500データフレームがあります。私の目標は、各行で最小75の連続する<code>NA</code>値を見つけることです。そのような実行ごとに、前のandに続く50個のセルを<code>NA</code>の値にも置き換えます。以下は、1つの行の縮小された例です。<pre>_<code>x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3) # run of four NA: ^ ^ ^ ^ </code>_</pre>4つの連続する<code>NA</code>の実行を検出し、実行前と実行後の3つの値を<code>NA</code>に置き換えます。<pre>_<code>c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3) # ^ ^ ^ ^ ^ ^ </code>_</pre>最初に、連続する<code>NA</code>sを<code>rle</code>で識別しようとしましたが、<code>rle(is.na(df))</code>を実行すると、エラー_<code>'x' must be a vector of an atomic type</code>_が発生します。これは、単一の行を選択した場合でも発生します。残念ながら、前と後の50個のセルをNAに変換するための次のステップはどうなるかわかりません。事前に感謝します。

これが私の解決策です。でも私よりもきちんとした解決策はあるのでしょうか。<pre><code>library(data.table) df <- matrix(nrow = 1,ncol = 16) df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3) df <- df %>% as.data.table() # dataset created # A function to do what you need NA_replacer <- function(x){ Vector <- unlist(x) # pull the values into a vector NAs <- which(is.na(Vector)) # locate the positions of the NAs NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order NAs_Position_2 <- rle(NAs_Position_1) # Find their values NAs <- NAs[which( NAs_Position_1 == with(NAs_Position_2, values[which( lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs Vector[seq(NAs[1]-3, NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA Vector[seq(NAs[length(NAs)]+1, NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA } Vector } </code></pre><pre><code>> df # the original dataset V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3 </code></pre><pre><code># the transformed dataset apply(df, 1, function(x) NA_replacer(x)) %>% as.data.table() %>% data.table::transpose() V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3 </code></pre><hr>余談ですが、640 * 2500サイズの架空のデータフレームの場合、速度は非常に良好です。75以上のNAのストレッチを配置する必要があり、前後の50の値をNAに置き換える必要があります。<pre><code>df <- matrix(nrow = 640,ncol = 2500) for(i in 1:nrow(df)){ df[i,] <- c(1:100,rep(NA,75),rep(1,2325)) } NA_replacer <- function(x){ Vector <- unlist(x) # pull the values into a vector NAs <- which(is.na(Vector)) # locate the positions of the NAs NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order NAs_Position_2 <- rle(NAs_Position_1) # Find their values NAs <- NAs[which( NAs_Position_1 == with(NAs_Position_2, values[which( lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times if(length(NAs >= 75)){ # Check if the condition is met Vector[seq(NAs[1]-50, NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA Vector[seq(NAs[length(NAs)]+1, NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA } Vector } </code></pre><pre><code># Check how many NAs are present in the first row of the dataset prior to applying the function which(is.na(df %>% as_tibble() %>% slice(1) %>% unlist())) %>% # run the code till here to get the indices of the NAs length() [1] 75 </code></pre><pre><code>df <- apply(df, 1, function(x) NA_replacer(x)) %>% as.data.table() %>% data.table::transpose() # Check how many NAs are present in the first row post applying the function which(is.na(df %>% slice(1) %>% unlist())) %>% # run the code till here to get the indices of the NAs length() [1] 175 </code></pre><pre><code>system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>% as.data.table() %>% data.table::transpose()) user system elapsed 0.216 0.002 0.220 </code></pre>

特定の長さのランを延長する

数値といくつかのNA値を持つ640 x 2500データフレームがあります。私の目標は、各行で最小75の連続するNA値を見つけることです。そのような実行ごとに、前のandに続く50個のセルをNAの値にも置き換えます。

以下は、1つの行の縮小された例です。

_x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
#        run of four NA:  ^   ^   ^   ^     
_

4つの連続するNAの実行を検出し、実行前と実行後の3つの値をNAに置き換えます。

_c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3) 
#           ^   ^   ^                   ^   ^   ^
_

最初に、連続するNAsをrleで識別しようとしましたが、rle(is.na(df))を実行すると、エラー_'x' must be a vector of an atomic type_が発生します。これは、単一の行を選択した場合でも発生します。

残念ながら、前と後の50個のセルをNAに変換するための次のステップはどうなるかわかりません。

事前に感謝します。

rnarun-length-encoding

2020/05/16NickB

これが私の解決策です。でも私よりもきちんとした解決策はあるのでしょうか。

library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
  as.data.table() # dataset created

# A function to do what you need
NA_replacer <- function(x){
  Vector <- unlist(x) # pull the values into a vector

  NAs <- which(is.na(Vector)) # locate the positions of the NAs
  NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
  NAs_Position_2 <- rle(NAs_Position_1) # Find their values

  NAs <- NAs[which(
    NAs_Position_1 == with(NAs_Position_2,
                           values[which(
                             lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times

  if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
    Vector[seq(NAs[1]-3,
               NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
    Vector[seq(NAs[length(NAs)]+1,
               NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
  }
  Vector
}

> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1:  1  3  4  5  4  3 NA NA NA  NA   6   9   3   2   4   3

# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
  as.data.table() %>%
  data.table::transpose()

V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1:  1  3  4 NA NA NA NA NA NA  NA  NA  NA  NA   2   4   3

余談ですが、640 * 2500サイズの架空のデータフレームの場合、速度は非常に良好です。75以上のNAのストレッチを配置する必要があり、前後の50の値をNAに置き換える必要があります。

df <- matrix(nrow = 640,ncol = 2500)

for(i in 1:nrow(df)){
  df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}

NA_replacer <- function(x){
  Vector <- unlist(x) # pull the values into a vector

  NAs <- which(is.na(Vector)) # locate the positions of the NAs
  NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
  NAs_Position_2 <- rle(NAs_Position_1) # Find their values

  NAs <- NAs[which(
    NAs_Position_1 == with(NAs_Position_2,
                           values[which(
                             lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times

  if(length(NAs >= 75)){ # Check if the condition is met
    Vector[seq(NAs[1]-50,
               NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
    Vector[seq(NAs[length(NAs)]+1,
               NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
  }
  Vector
}

# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
              as_tibble() %>%
              slice(1) %>%
              unlist())) %>% # run the code till here to get the indices of the NAs
  length() 

[1] 75

df <- apply(df, 1, function(x) NA_replacer(x)) %>%
  as.data.table() %>%
  data.table::transpose()

# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
              slice(1) %>%
              unlist())) %>% # run the code till here to get the indices of the NAs
  length()

[1] 175

system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
              as.data.table() %>%
              data.table::transpose())
user  system elapsed 
  0.216   0.002   0.220

2020/05/16Anurag N. Sharma