spark dataframes / spark sqlのスキーマでjsonを読む方法

Question

sql/dataframes、私を助けてください、またはこのjsonの読み方に関するいくつかの良い提案を提供してください

{ "billdate":"2016-08-08', "accountid":"xxx" "accountdetails":{ "total":"1.1" "category":[ { "desc":"one", "currentinfo":{ "value":"10" }, "subcategory":[ { "categoryDesc":"sub", "value":"10", "currentinfo":{ "value":"10" } }] }] } }

おかげで、

Ram Ghadiyaram · Accepted Answer

あなたのjsonが有効ではないようです。 plsチェック http://www.jsoneditoronline.org/

an-introduction-to-json-support-in-spark-sql.html をご覧ください

テーブルとして登録する場合は、以下のように登録してスキーマを印刷できます。

_DataFrame df = sqlContext.read().json("/path/to/validjsonfile").toDF(); df.registerTempTable("df"); df.printSchema(); _

以下はサンプルコードスニペットです

_DataFrame app = df.select("toplevel"); app.registerTempTable("toplevel"); app.printSchema(); app.show(); DataFrame appName = app.select("toplevel.sublevel"); appName.registerTempTable("sublevel"); appName.printSchema(); appName.show(); _

scalaの例：

_{"name":"Michael", "cities":["palo alto", "menlo park"], "schools":[{"sname":"stanford", "year":2010}, {"sname":"berkeley", "year":2012}]} {"name":"Andy", "cities":["santa cruz"], "schools":[{"sname":"ucsb", "year":2011}]} {"name":"Justin", "cities":["portland"], "schools":[{"sname":"berkeley", "year":2014}]} val people = sqlContext.read.json("people.json") people: org.Apache.spark.sql.DataFrame _

最上位フィールドの読み取り

_val names = people.select('name).collect() names: Array[org.Apache.spark.sql.Row] = Array([Michael], [Andy], [Justin]) names.map(row => row.getString(0)) res88: Array[String] = Array(Michael, Andy, Justin) _

Select（）メソッドを使用してトップレベルフィールドを指定し、collect（）を使用してそれをArray [Row]に収集し、getString（）メソッドを使用して各Row内の列にアクセスします。

JSON配列をフラット化して読み取る

各Personには「都市」の配列があります。これらの配列を平坦化し、すべての要素を読み取りましょう。

_val flattened = people.explode("cities", "city"){c: List[String] => c} flattened: org.Apache.spark.sql.DataFrame val allCities = flattened.select('city).collect() allCities: Array[org.Apache.spark.sql.Row] allCities.map(row => row.getString(0)) res92: Array[String] = Array(palo alto, menlo park, santa cruz, portland) _

Explode（）メソッドは、都市配列を「city」という名前の新しい列に展開または平坦化します。次に、select（）を使用して新しい列を選択し、collect（）を使用してそれをArray [Row]に収集し、getString（）を使用して各行内のデータにアクセスします。

ネストされたJSONオブジェクトの配列を読み取り、フラット化しない

ネストされたJSONオブジェクトの配列である「学校」データを読み取ります。配列の各要素には、学校名と年が含まれています。

_ val schools = people.select('schools).collect() schools: Array[org.Apache.spark.sql.Row] val schoolsArr = schools.map(row => row.getSeq[org.Apache.spark.sql.Row](0)) schoolsArr: Array[Seq[org.Apache.spark.sql.Row]] schoolsArr.foreach(schools => { schools.map(row => print(row.getString(0), row.getLong(1))) print("
") }) (stanford,2010)(berkeley,2012) (ucsb,2011) (berkeley,2014) _

select()およびcollect()を使用して「schools」配列を選択し、それを_Array[Row]_に収集します。現在、各「schools」配列は_List[Row]_型であるため、getSeq[Row]()メソッドを使用して読み取ります。最後に、学校名にgetString()を、学年にgetLong()を呼び出すことにより、個々の学校の情報を読み取ることができます。

Raghavan · Answer

Spark 2.2のスキーマに基づいてJSONファイルを読み取るには、次のコードを試すことができます

import org.Apache.spark.sql.types.{DataType, StructType} //Read Json Schema and Create Schema_Json val schema_json=spark.read.json("/user/Files/ActualJson.json").schema.json //add the schema val newSchema=DataType.fromJson(schema_json).asInstanceOf[StructType] //read the json files based on schema val df=spark.read.schema(newSchema).json("Json_Files/Folder Path")