空の列をSpark DataFrameに追加します

Question

多くその他の場所で述べたように、既存のDataFrameに新しい列を追加するのは簡単ではありません。残念ながら、特にDataFrameを使用して2つのunionAllsを連結しようとする場合、（分散環境では非効率的ですが）この機能を使用することが重要です。

null列をDataFrameに追加してunionAllを容易にする最もエレガントな回避策は何ですか？

私のバージョンは次のようになります：

from pyspark.sql.types import StringType from pyspark.sql.functions import UserDefinedFunction to_none = UserDefinedFunction(lambda x: None, StringType()) new_df = old_df.withColumn('new_column', to_none(df_old['any_col_from_old']))

zero323 · Accepted Answer

ここで必要なのは、リテラルとキャストだけです：

from pyspark.sql.functions import lit new_df = old_df.withColumn('new_column', lit(None).cast(StringType()))

完全な例：

df = sc.parallelize([row(1, "2"), row(2, "3")]).toDF() df.printSchema() ## root ## |-- foo: long (nullable = true) ## |-- bar: string (nullable = true) new_df = df.withColumn('new_column', lit(None).cast(StringType())) new_df.printSchema() ## root ## |-- foo: long (nullable = true) ## |-- bar: string (nullable = true) ## |-- new_column: string (nullable = true) new_df.show() ## +---+---+----------+ ## |foo|bar|new_column| ## +---+---+----------+ ## | 1| 2| null| ## | 2| 3| null| ## +---+---+----------+

A Scala同等のものはここにあります：空/ヌルフィールド値で新しいデータフレームを作成