Spark DataFrameに列があるかどうかを検出する方法

Question

Spark SQLのJSONファイルからDataFrameを作成する場合、.selectを呼び出す前に特定の列が存在するかどうかを確認するにはどうすればよいですか

JSONスキーマの例：

{ "a": { "b": 1, "c": 2 } }

これは私がやりたいことです：

potential_columns = Seq("b", "c", "d") df = sqlContext.read.json(filename) potential_columns.map(column => if(df.hasColumn(column)) df.select(s"a.$column"))

しかし、hasColumnに適した関数が見つかりません。私が得た最も近いのは、列がこのやや厄介な配列にあるかどうかをテストすることです：

scala> df.select("a.*").columns res17: Array[String] = Array(b, c)

zero323 · Accepted Answer

存在すると仮定し、Tryで失敗させます。プレーンでシンプルで、任意のネストをサポートします。

import scala.util.Try import org.Apache.spark.sql.DataFrame def hasColumn(df: DataFrame, path: String) = Try(df(path)).isSuccess val df = sqlContext.read.json(sc.parallelize( """{"foo": [{"bar": {"foobar": 3}}]}""" :: Nil)) hasColumn(df, "foobar") // Boolean = false hasColumn(df, "foo") // Boolean = true hasColumn(df, "foo.bar") // Boolean = true hasColumn(df, "foo.bar.foobar") // Boolean = true hasColumn(df, "foo.bar.foobaz") // Boolean = false

またはさらに簡単：

val columns = Seq( "foobar", "foo", "foo.bar", "foo.bar.foobar", "foo.bar.foobaz") columns.flatMap(c => Try(df(c)).toOption) // Seq[org.Apache.spark.sql.Column] = List( // foo, foo.bar AS bar#12, foo.bar.foobar AS foobar#13)

同等のPython：

from pyspark.sql.utils import AnalysisException from pyspark.sql import Row def has_column(df, col): try: df[col] return True except AnalysisException: return False df = sc.parallelize([Row(foo=[Row(bar=Row(foobar=3))])]).toDF() has_column(df, "foobar") ## False has_column(df, "foo") ## True has_column(df, "foo.bar") ## True has_column(df, "foo.bar.foobar") ## True has_column(df, "foo.bar.foobaz") ## False

Jai Prakash · Answer

私が通常使用する別のオプションは

df.columns.contains("column-name-to-check")

これはブール値を返します

Daniel B. · Answer

実際、列を使用するためにselectを呼び出す必要はありません。データフレーム自体で呼び出すことができます。

// define test data case class Test(a: Int, b: Int) val testList = List(Test(1,2), Test(3,4)) val testDF = sqlContext.createDataFrame(testList) // define the hasColumn function def hasColumn(df: org.Apache.spark.sql.DataFrame, colName: String) = df.columns.contains(colName) // then you can just use it on the DF with a given column name hasColumn(testDF, "a") // <-- true hasColumn(testDF, "c") // <-- false

または、pimp my libraryパターンを使用して暗黙的なクラスを定義し、hasColumnメソッドをデータフレームで直接使用できるようにすることができます

implicit class DataFrameImprovements(df: org.Apache.spark.sql.DataFrame) { def hasColumn(colName: String) = df.columns.contains(colName) }

その後、次のように使用できます。

testDF.hasColumn("a") // <-- true testDF.hasColumn("c") // <-- false

Michael Lloyd Lee mlk · Answer

これに対する他のオプションは、df.columnsおよびpotential_columnsで配列操作（この場合はintersect）を実行することです。

// Loading some data (so you can just copy & paste right into spark-Shell) case class Document( a: String, b: String, c: String) val df = sc.parallelize(Seq(Document("a", "b", "c")), 2).toDF // The columns we want to extract val potential_columns = Seq("b", "c", "d") // Get the intersect of the potential columns and the actual columns, // we turn the array of strings into column objects // Finally turn the result into a vararg (: _*) df.select(potential_columns.intersect(df.columns).map(df(_)): _*).show

残念ながら、これは上記の内部オブジェクトシナリオでは機能しません。そのためのスキーマを調べる必要があります。

potential_columnsを完全修飾列名に変更します

val potential_columns = Seq("a.b", "a.c", "a.d") // Our object model case class Document( a: String, b: String, c: String) case class Document2( a: Document, b: String, c: String) // And some data... val df = sc.parallelize(Seq(Document2(Document("a", "b", "c"), "c2")), 2).toDF // We go through each of the fields in the schema. // For StructTypes we return an array of parentName.fieldName // For everything else we return an array containing just the field name // We then flatten the complete list of field names // Then we intersect that with our potential_columns leaving us just a list of column we want // we turn the array of strings into column objects // Finally turn the result into a vararg (: _*) df.select(df.schema.map(a => a.dataType match { case s : org.Apache.spark.sql.types.StructType => s.fieldNames.map(x => a.name + "." + x) case _ => Array(a.name) }).flatMap(x => x).intersect(potential_columns).map(df(_)) : _*).show

これは1レベルだけ深くなるので、汎用的にするには、さらに作業が必要になります。

Nitin Mathur · Answer

Tryは、決定を行う前にTry内の式を評価するため、最適ではありません。

大きなデータセットの場合、Scalaで以下を使用します。

df.schema.fieldNames.contains("column_name")

mfryar · Answer

Pythonソリューションを探してこれに遭遇した人のために、私は以下を使用します。

if 'column_name_to_check' in df.columns: # do something

Pythonを使用してdf.columns.contains('column-name-to-check')の@Jai Prakashの回答を試してみたところ、AttributeError: 'list' object has no attribute 'contains'を取得しました。

Shaun Ryan · Answer

ロードするときにスキーマ定義を使用してJSONを細断する場合、列を確認する必要はありません。 JSONソースにない場合は、null列として表示されます。

 val schemaJson = """ { "type": "struct", "fields": [ { "name": field1 "type": "string", "nullable": true, "metadata": {} }, { "name": field2 "type": "string", "nullable": true, "metadata": {} } ] } """ val schema = DataType.fromJson(schemaJson).asInstanceOf[StructType] val djson = sqlContext.read .schema(schema ) .option("badRecordsPath", readExceptionPath) .json(dataPath)