SparkSQL：ユーザー定義関数でnull値を処理する方法

Question

タイプ1の列「x」を含む表1が与えられます。「x」で指定された日付文字列の整数表現である列「y」で表2を作成します。

必須は、列「y」にnull値を保持します。

表1（データフレームdf1）：

+----------+ | x| +----------+ |2015-09-12| |2015-09-13| | null| | null| +----------+ root |-- x: string (nullable = true)

表2（データフレームdf2）：

+----------+--------+ | x| y| +----------+--------+ | null| null| | null| null| |2015-09-12|20150912| |2015-09-13|20150913| +----------+--------+ root |-- x: string (nullable = true) |-- y: integer (nullable = true)

一方、列 "x"の値を列 "y"の値に変換するユーザー定義関数（udf）は次のとおりです。

val extractDateAsInt = udf[Int, String] ( (d:String) => d.substring(0, 10) .filterNot( "-".toSet) .toInt )

動作し、null値を処理することはできません。

にもかかわらず、私は次のようなことができます

val extractDateAsIntWithNull = udf[Int, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else 1 )

私はudfsを介してnull値を「生成」する方法を見つけませんでした（もちろん、Intsはnullにできないため）。

私の現在のdf2作成のソリューション（表2）は次のとおりです。

// holds data of table 1 val df1 = ... // filter entries from df1, that are not null val dfNotNulls = df1.filter(df1("x") .isNotNull) .withColumn("y", extractDateAsInt(df1("x"))) .withColumnRenamed("x", "right_x") // create df2 via a left join on df1 and dfNotNull having val df2 = df1.join( dfNotNulls, df1("x") === dfNotNulls("right_x"), "leftouter" ).drop("right_x")

質問：

現在の解決策は扱いにくいようです（そしておそらくパフォーマンスに関して効率的ではありません）。もっと良い方法はありますか？
@ Spark-developers：次のudfが可能になるように、計画/利用可能なタイプNullableIntがあります（コードの抜粋を参照）。

コードの抜粋

val extractDateAsNullableInt = udf[NullableInt, String] ( (d:String) => if (d != null) d.substring(0, 10).filterNot( "-".toSet).toInt else null )

zero323 · Accepted Answer

これはOptioncomesが便利な場所です：

val extractDateAsOptionInt = udf((d: String) => d match { case null => None case s => Some(s.substring(0, 10).filterNot("-".toSet).toInt) })

または、一般的な場合にわずかに安全にするため：

import scala.util.Try val extractDateAsOptionInt = udf((d: String) => Try( d.substring(0, 10).filterNot("-".toSet).toInt ).toOption)

すべてのクレジットは Dmitriy Selivanov にあります。誰がこのソリューションを（欠落している？）編集として指摘したか here 。

別の方法は、UDFの外部でnullを処理することです。

import org.Apache.spark.sql.functions.{lit, when} import org.Apache.spark.sql.types.IntegerType val extractDateAsInt = udf( (d: String) => d.substring(0, 10).filterNot("-".toSet).toInt ) df.withColumn("y", when($"x".isNull, lit(null)) .otherwise(extractDateAsInt($"x")) .cast(IntegerType) )

tristanbuckner · Answer

実際、Scalaにはニースのファクトリー関数Option（）があり、これによりさらに簡潔になります。

val extractDateAsOptionInt = udf((d: String) => Option(d).map(_.substring(0, 10).filterNot("-".toSet).toInt))

内部的には、Optionオブジェクトのapplyメソッドは単にnullチェックを実行しているだけです。

def apply[A](x: A): Option[A] = if (x == null) None else Some(x)

Martin Senne · Answer

補助コード

@ zero323のNice回答を使用して、次のコードを作成し、説明したようにnull値を処理するユーザー定義関数を使用できるようにしました。それが他の人に役立つことを願っています！

/** * Set of methods to construct [[org.Apache.spark.sql.UserDefinedFunction]]s that * handle `null` values. */ object NullableFunctions { import org.Apache.spark.sql.functions._ import scala.reflect.runtime.universe.{TypeTag} import org.Apache.spark.sql.UserDefinedFunction /** * Given a function A1 => RT, create a [[org.Apache.spark.sql.UserDefinedFunction]] such that * * if fnc input is null, None is returned. This will create a null value in the output Spark column. * * if A1 is non null, Some( f(input) will be returned, thus creating f(input) as value in the output column. * @param f function from A1 => RT * @tparam RT return type * @tparam A1 input parameter type * @return a [[org.Apache.spark.sql.UserDefinedFunction]] with the behaviour describe above */ def nullableUdf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction = { udf[Option[RT],A1]( (i: A1) => i match { case null => None case s => Some(f(i)) }) } /** * Given a function A1, A2 => RT, create a [[org.Apache.spark.sql.UserDefinedFunction]] such that * * if on of the function input parameters is null, None is returned. * This will create a null value in the output Spark column. * * if both input parameters are non null, Some( f(input) will be returned, thus creating f(input1, input2) * as value in the output column. * @param f function from A1 => RT * @tparam RT return type * @tparam A1 input parameter type * @tparam A2 input parameter type * @return a [[org.Apache.spark.sql.UserDefinedFunction]] with the behaviour describe above */ def nullableUdf[RT: TypeTag, A1: TypeTag, A2: TypeTag](f: Function2[A1, A2, RT]): UserDefinedFunction = { udf[Option[RT], A1, A2]( (i1: A1, i2: A2) => (i1, i2) match { case (null, _) => None case (_, null) => None case (s1, s2) => Some((f(s1,s2))) } ) } }