未表示の文字列値を持つ新しいレコードをデータフレームに追加すると、未表示の因子レベルが発生し、警告が発生し、結果はNAになります

Question

2001年から2007年までの請求データを含むデータフレーム（14.5K行x 15列）があります。

alltime <- rbind(alltime,all2008)を使用して、新しい2008データを追加します。

残念ながら、それは警告を生成します：

> Warning message: In `[<-.factor`(`*tmp*`, ri, value = c(NA, NA, NA, NA, NA, NA, NA, : invalid factor level, NAs generated

私の推測では、以前のデータフレームに名前がなかった新しい患者がいるため、それらの患者にどのレベルを与えるかがわからないでしょう。同様に、「照会医師」列の新しい目に見えない名前。

解決策は何ですか？

Marek · Accepted Answer

2つの_data.frames_の型の不一致が原因である可能性があります。

すべてのチェックタイプ（クラス）の最初。診断目的でこれを行います：

_new2old <- rbind( alltime, all2008 ) # this gives you a warning old2new <- rbind( all2008, alltime ) # this should be without warning cbind( alltime = sapply( alltime, class), all2008 = sapply( all2008, class), new2old = sapply( new2old, class), old2new = sapply( old2new, class) ) _

私は次のような行があることを期待しています：

_ alltime all2008 new2old old2new ... ... ... ... ... some_column "factor" "numeric" "factor" "character" ... ... ... ... ... _

その場合、説明：rbind型の一致をチェックしないでください。 _rbind.data.frame_コードを分析すると、最初の引数が出力タイプを初期化したことがわかります。最初のdata.frameタイプが因子の場合、出力data.frame列はレベルunique(c(levels(x1),levels(x2)))の因子です。ただし、2番目のdata.frame列がファクターでない場合、levels(x2)はNULLであるため、レベルは拡張されません。

出力データが間違っていることを意味します！真の値の代わりにNAがあります

私はそれを仮定する：

別のR/RODBCバージョンを使用して古いデータを作成し、異なるメソッド（異なる設定-小数点区切り記号）で型が作成された
問題のある列にNULLまたは特定のデータがあります。誰かがデータベースの下の列を変更します。

解決：

間違った列を見つけ、それが間違っていて修正された理由を見つけます。症状ではなく原因を取り除きます。

Steve Lianoglou · Answer

「簡単な」方法は、テキストデータをインポートするときに、文字列を要素として設定しないことです。

read.{table,csv,...}関数はstringsAsFactorsパラメータを取ります。これはデフォルトでTRUEに設定されます。データをインポートしてFALSE- ingしているときにこれをrbindに設定できます。

列を最後の要素として設定する場合は、それも実行できます。

例えば：

alltime <- read.table("alltime.txt", stringsAsFactors=FALSE) all2008 <- read.table("all2008.txt", stringsAsFactors=FALSE) alltime <- rbind(alltime, all2008) # If you want the doctor column to be a factor, make it so: alltime$doctor <- as.factor(alltime$doctor)

Raffael · Answer

1）stringsAsFactorをFALSEに設定してデータフレームを作成します。これにより、要因の問題が解決するはずです

2）その後はrbindを使用しません-データフレームが空の場合、列名を台無しにします。単純に次のようにします：

df[nrow(df)+1,] <- c("d","gsgsgd",4)

/

> df <- data.frame(a = character(0), b=character(0), c=numeric(0)) > df[nrow(df)+1,] <- c("d","gsgsgd",4) Warnmeldungen: 1: In `[<-.factor`(`*tmp*`, iseq, value = "d") : invalid factor level, NAs generated 2: In `[<-.factor`(`*tmp*`, iseq, value = "gsgsgd") : invalid factor level, NAs generated > df <- data.frame(a = character(0), b=character(0), c=numeric(0), stringsAsFactors=F) > df[nrow(df)+1,] <- c("d","gsgsgd",4) > df a b c 1 d gsgsgd 4

rcs · Answer

前の回答で提案したように、列を文字として読み取り、rbindの後の因子への変換を行います。 SQLFetch（私は[〜＃〜] rodbc [〜＃〜]と仮定するとstringsAsFactorsまたはas.is引数は、文字の変換を制御します。許可される値はread.table、例：as.is=TRUEまたは列番号。

JSawyer · Answer

型の不一致、特に要因についても同じ問題がありました。互換性のない2つのデータセットを結合する必要がありました。

私の解決策は、両方のデータフレームの因子を「文字」に変換することです。それは魅力のように動作します:-)

 convert.factors.to.strings.in.dataframe <- function(dataframe) { class.data <- sapply(dataframe, class) factor.vars <- class.data[class.data == "factor"] for (colname in names(factor.vars)) { dataframe[,colname] <- as.character(dataframe[,colname]) } return (dataframe) }

2つのデータフレームの実行で型を確認する場合（変数名を変更する）：

 cbind("orig"=sapply(allSurveyData, class), "merge" = sapply(curSurveyDataMerge, class), "eq"=sapply(allSurveyData, class) == sapply(curSurveyDataMerge, class) )

smci · Answer

データフレームを作成するとき、文字列の列の要素を選択することができます（stringsAsFactors=T）、または文字列として保持します。

あなたの場合、文字列の列を要素にしないでください。文字列として保持し、追加は正常に機能します。最終的に要素にする必要がある場合は、最初にすべての挿入と追加を文字列として行い、最終的にそれらをファクターに変換します。

文字列列を因数分解してから、未表示の値を含む行を追加すると、新しい未表示の要因レベルごとに言及したエラーが発生し、その値はNAに置き換えられます...

> df <- data.frame(patient=c('Ann','Bob','Carol'), referring_doctor=c('X','Y','X'), stringsAsFactors=T) patient referring_doctor 1 Ann X 2 Bob Y 3 Carol X > df <- rbind(df, c('Denise','Z')) Warning messages: 1: In `[<-.factor`(`*tmp*`, ri, value = "Denise") : invalid factor level, NA generated 2: In `[<-.factor`(`*tmp*`, ri, value = "Z") : invalid factor level, NA generated > df patient referring_doctor 1 Ann X 2 Bob Y 3 Carol X 4 <NA> <NA>

文字列の列を考慮に入れないでください。文字列として保持し、追加しても問題ありません：

> df <- data.frame(patient=c('Ann','Bob','Carol'), referring_doctor=c('X','Y','X'), stringsAsFactors=F) > df <- rbind(df, c('Denise','Z')) patient referring_doctor 1 Ann X 2 Bob Y 3 Carol X 4 Denise Z

デフォルトの動作を変更するには：

options(stringsAsFactors=F)

個々の列を文字列または因子に/から変換するには

df$col <- as.character(df$col) df$col <- as.factor(df$col)

trycash2 · Answer

ここでは、2つのデータフレームの共通の行名を取得し、基本的にファクターであるフィールドを見つけるrbindを実行し、新しいファクターを追加してからrbindを実行する関数があります。これにより、要因の問題に対処する必要があります。

rbindCommonCols <-function（x、y）{

commonColNames = intersect(colnames(x), colnames(y)) x = x[,commonColNames] y = y[,commonColNames] colClassesX = sapply(x, class) colClassesY = sapply(y, class) classMatch = paste( colClassesX, colClassesY, sep = "-" ) factorColIdx = grep("factor", classMatch) for(n in factorColIdx){ x[,n] = as.factor(x[,n]) y[,n] = as.factor(y[,n]) } for(n in factorColIdx){ x[,n] = factor(x[,n], levels = unique(c( levels(x[,n]), levels(y[,n]) ))) y[,n] = factor(y[,n], levels = unique(c( levels(y[,n]), levels(x[,n]) ))) } res = rbind(x,y) res

}