カスタムオブジェクトをデータセットに格納する方法

Question

によると Sparkデータセットの紹介：

Spark 2.0を楽しみにして、我々は特にデータセットへのいくつかのエキサイティングな改良を計画しています。

Datasetにカスタムタイプを格納しようとすると、次のようなエラーが発生します。

データセットに格納されているタイプのエンコーダが見つかりません。プリミティブ型（Int、Stringなど）およびProduct型（ケースクラス）は、sqlContext.implicitsをインポートすることでサポートされます。

または

Java.lang.UnsupportedOperationException：....のエンコーダが見つかりません.

既存の回避策はありますか？

この質問はCommunity Wikiの回答の入り口としてのみ存在することに注意してください。質問と回答の両方を更新/改善してください。

zero323 · Answer

一般的なエンコーダを使う.

現在利用可能な2つの一般的なエンコーダがあります kryo と javaSerialization 後者は明示的に次のように記述されています：

非常に非効率的で、最後の手段としてのみ使用されるべきです。

次のクラスを想定
```
class Bar(i: Int) { override def toString = s"bar $i" def bar = i } 
```
暗黙のエンコーダを追加することでこれらのエンコーダを使用できます。
```
object BarEncoders { implicit def barEncoder: org.Apache.spark.sql.Encoder[Bar] = org.Apache.spark.sql.Encoders.kryo[Bar] } 
```
次のように一緒に使うことができます。
```
object Main { def main(args: Array[String]) { val sc = new SparkContext("local", "test", new SparkConf()) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ import BarEncoders._ val ds = Seq(new Bar(1)).toDS ds.show sc.stop() } } 
```
オブジェクトをbinary列として格納するため、DataFrameに変換すると、次のスキーマが得られます。
```
root |-- value: binary (nullable = true) 
```
特定のフィールドにkryoエンコーダーを使用してタプルをエンコードすることもできます。
```
val longBarEncoder = Encoders.Tuple(Encoders.scalaLong, Encoders.kryo[Bar]) spark.createDataset(Seq((1L, new Bar(1))))(longBarEncoder) // org.Apache.spark.sql.Dataset[(Long, Bar)] = [_1: bigint, _2: binary] 
```
ここでは暗黙のエンコーダには依存していませんが、明示的にエンコーダを渡しているのでtoDSメソッドではうまく動作しないことに注意してください。

暗黙的な変換を使う：

エンコード可能な表現とカスタムクラスの間の暗黙的な変換を提供します。次に例を示します。

object BarConversions { implicit def toInt(bar: Bar): Int = bar.bar implicit def toBar(i: Int): Bar = new Bar(i) } object Main { def main(args: Array[String]) { val sc = new SparkContext("local", "test", new SparkConf()) val sqlContext = new SQLContext(sc) import sqlContext.implicits._ import BarConversions._ type EncodedBar = Int val bars: RDD[EncodedBar] = sc.parallelize(Seq(new Bar(1))) val barsDS = bars.toDS barsDS.show barsDS.map(_.bar).show sc.stop() } }

関連する質問

Option型コンストラクタのためのエンコーダの作り方、例えばOption [Int]？

ChoppyTheLumberjack · Answer

あなたはUDTRegistrationを使用することができます、そして、ケースクラス、タプル、その他...あなたはすべてあなたのユーザー定義型で正しく動作します！

あなたがカスタムEnumを使いたいとしましょう：

trait CustomEnum { def value:String } case object Foo extends CustomEnum { val value = "F" } case object Bar extends CustomEnum { val value = "B" } object CustomEnum { def fromString(str:String) = Seq(Foo, Bar).find(_.value == str).get }

このように登録してください。

// First define a UDT class for it: class CustomEnumUDT extends UserDefinedType[CustomEnum] { override def sqlType: DataType = org.Apache.spark.sql.types.StringType override def serialize(obj: CustomEnum): Any = org.Apache.spark.unsafe.types.UTF8String.fromString(obj.value) // Note that this will be a UTF8String type override def deserialize(datum: Any): CustomEnum = CustomEnum.fromString(datum.toString) override def userClass: Class[CustomEnum] = classOf[CustomEnum] } // Then Register the UDT Class! // NOTE: you have to put this file into the org.Apache.spark package! UDTRegistration.register(classOf[CustomEnum].getName, classOf[CustomEnumUDT].getName)

それからそれを使ってください！

case class UsingCustomEnum(id:Int, en:CustomEnum) val seq = Seq( UsingCustomEnum(1, Foo), UsingCustomEnum(2, Bar), UsingCustomEnum(3, Foo) ).toDS() seq.filter(_.en == Foo).show() println(seq.collect())

多態的なレコードを使いたいとしましょう：

trait CustomPoly case class FooPoly(id:Int) extends CustomPoly case class BarPoly(value:String, secondValue:Long) extends CustomPoly

...そしてそれを次のように使う：

case class UsingPoly(id:Int, poly:CustomPoly) Seq( UsingPoly(1, new FooPoly(1)), UsingPoly(2, new BarPoly("Blah", 123)), UsingPoly(3, new FooPoly(1)) ).toDS polySeq.filter(_.poly match { case FooPoly(value) => value == 1 case _ => false }).show()

あなたはすべてをバイトにエンコードするカスタムUDTを書くことができます（私はここでJavaシリアライゼーションを使用しています、しかしそれはおそらくSparkのKryoコンテキストを計測する方が良いです）。

まずUDTクラスを定義します。

class CustomPolyUDT extends UserDefinedType[CustomPoly] { val kryo = new Kryo() override def sqlType: DataType = org.Apache.spark.sql.types.BinaryType override def serialize(obj: CustomPoly): Any = { val bos = new ByteArrayOutputStream() val oos = new ObjectOutputStream(bos) oos.writeObject(obj) bos.toByteArray } override def deserialize(datum: Any): CustomPoly = { val bis = new ByteArrayInputStream(datum.asInstanceOf[Array[Byte]]) val ois = new ObjectInputStream(bis) val obj = ois.readObject() obj.asInstanceOf[CustomPoly] } override def userClass: Class[CustomPoly] = classOf[CustomPoly] }

それを登録します。

// NOTE: The file you do this in has to be inside of the org.Apache.spark package! UDTRegistration.register(classOf[CustomPoly].getName, classOf[CustomPolyUDT].getName)

それからあなたはそれを使うことができます！

// As shown above: case class UsingPoly(id:Int, poly:CustomPoly) Seq( UsingPoly(1, new FooPoly(1)), UsingPoly(2, new BarPoly("Blah", 123)), UsingPoly(3, new FooPoly(1)) ).toDS polySeq.filter(_.poly match { case FooPoly(value) => value == 1 case _ => false }).show()

Sarvesh Kumar Singh · Answer

エンコーダはSpark2.0でもほぼ同じように動作します。そしてKryoはまだ推奨されているserializationの選択です。

あなたはspark-shellで以下の例を見ることができます

scala> import spark.implicits._ import spark.implicits._ scala> import org.Apache.spark.sql.Encoders import org.Apache.spark.sql.Encoders scala> case class NormalPerson(name: String, age: Int) { | def aboutMe = s"I am ${name}. I am ${age} years old." | } defined class NormalPerson scala> case class ReversePerson(name: Int, age: String) { | def aboutMe = s"I am ${name}. I am ${age} years old." | } defined class ReversePerson scala> val normalPersons = Seq( | NormalPerson("Superman", 25), | NormalPerson("Spiderman", 17), | NormalPerson("Ironman", 29) | ) normalPersons: Seq[NormalPerson] = List(NormalPerson(Superman,25), NormalPerson(Spiderman,17), NormalPerson(Ironman,29)) scala> val ds1 = sc.parallelize(normalPersons).toDS ds1: org.Apache.spark.sql.Dataset[NormalPerson] = [name: string, age: int] scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name)) ds2: org.Apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string] scala> ds1.show() +---------+---+ | name|age| +---------+---+ | Superman| 25| |Spiderman| 17| | Ironman| 29| +---------+---+ scala> ds2.show() +----+---------+ |name| age| +----+---------+ | 25| Superman| | 17|Spiderman| | 29| Ironman| +----+---------+ scala> ds1.foreach(p => println(p.aboutMe)) I am Ironman. I am 29 years old. I am Superman. I am 25 years old. I am Spiderman. I am 17 years old. scala> val ds2 = ds1.map(np => ReversePerson(np.age, np.name)) ds2: org.Apache.spark.sql.Dataset[ReversePerson] = [name: int, age: string] scala> ds2.foreach(p => println(p.aboutMe)) I am 17. I am Spiderman years old. I am 25. I am Superman years old. I am 29. I am Ironman years old.

今のところ]現在のスコープにはappropriate encodersがありませんでしたので、私たちの人はbinary値としてエンコードされませんでした。しかし、implicitシリアル化を使ってKryoエンコーダを提供すると、それは変わります。

// Provide Encoders scala> implicit val normalPersonKryoEncoder = Encoders.kryo[NormalPerson] normalPersonKryoEncoder: org.Apache.spark.sql.Encoder[NormalPerson] = class[value[0]: binary] scala> implicit val reversePersonKryoEncoder = Encoders.kryo[ReversePerson] reversePersonKryoEncoder: org.Apache.spark.sql.Encoder[ReversePerson] = class[value[0]: binary] // Ecoders will be used since they are now present in Scope scala> val ds3 = sc.parallelize(normalPersons).toDS ds3: org.Apache.spark.sql.Dataset[NormalPerson] = [value: binary] scala> val ds4 = ds3.map(np => ReversePerson(np.age, np.name)) ds4: org.Apache.spark.sql.Dataset[ReversePerson] = [value: binary] // now all our persons show up as binary values scala> ds3.show() +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ scala> ds4.show() +--------------------+ | value| +--------------------+ |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| |[01 00 24 6C 69 6...| +--------------------+ // Our instances still work as expected scala> ds3.foreach(p => println(p.aboutMe)) I am Ironman. I am 29 years old. I am Spiderman. I am 17 years old. I am Superman. I am 25 years old. scala> ds4.foreach(p => println(p.aboutMe)) I am 25. I am Superman years old. I am 29. I am Ironman years old. I am 17. I am Spiderman years old.

Akash Mahajan · Answer

Java Beanクラスの場合、これは役に立ちます。

import spark.sqlContext.implicits._ import org.Apache.spark.sql.Encoders implicit val encoder = Encoders.bean[MyClasss](classOf[MyClass])

これで、dataFrameをカスタムDataFrameとして読み取ることができます。

dataFrame.as[MyClass]

これはカスタムクラスエンコーダーを作成し、バイナリエンコーダーを作成しません。

Jimmy Da · Answer

私の例はJavaになりますが、Scalaに適応するのが難しいとは思いません。

Fruitが単純である限り、 spark.createDataset および Encoders.bean を使用してRDD<Fruit>をDataset<Fruit>に変換することに成功しています Java Bean 。

ステップ1：単純なJava Beanを作成します。

public class Fruit implements Serializable { private String name = "default-fruit"; private String color = "default-color"; // AllArgsConstructor public Fruit(String name, String color) { this.name = name; this.color = color; } // NoArgsConstructor public Fruit() { this("default-fruit", "default-color"); } // ...create getters and setters for above fields // you figure it out }

私はDataBricksの人々が彼らのエンコーダを強化する前に、フィールドとしてプリミティブ型とStringを持つクラスに固執するでしょう。ネストしたオブジェクトを持つクラスがある場合は、そのすべてのフィールドをフラット化した別の単純なJava Beanを作成します。そのため、RDD変換を使用して複雑な型をより単純な型にマッピングできます。確かにそれは少し余分な作業ですが、私はそれがフラットスキーマで動作するパフォーマンスの上で大いに役立つだろうと思います。

ステップ2：RDDからデータセットを取得します

SparkSession spark = SparkSession.builder().getOrCreate(); JavaSparkContext jsc = new JavaSparkContext(); List<Fruit> fruitList = ImmutableList.of( new Fruit("Apple", "red"), new Fruit("orange", "orange"), new Fruit("grape", "purple")); JavaRDD<Fruit> fruitJavaRDD = jsc.parallelize(fruitList); RDD<Fruit> fruitRDD = fruitJavaRDD.rdd(); Encoder<Fruit> fruitBean = Encoders.bean(Fruit.class); Dataset<Fruit> fruitDataset = spark.createDataset(rdd, bean);

そして、やあ！むしろ、すすぎ、繰り返します。

Taeheon Kwon · Answer

私の状況で私の答えをここにも載せるかもしれない人のために。

具体的には、

SQLContextから 'Set typed data'を読んでいました。オリジナルのデータフォーマットはDataFrameです。

val sample = spark.sqlContext.sql("select 1 as a, collect_set(1) as b limit 1") sample.show()

+---+---+ | a| b| +---+---+ | 1|[1]| +---+---+
その後、mutable.WrappedArray型のrdd.map（）を使用してそれをRDDに変換します。

sample .rdd.map(r => (r.getInt(0), r.getAs[mutable.WrappedArray[Int]](1).toSet)) .collect() .foreach(println)

結果：

(1,Set(1))

Matt · Answer

すでに示した提案に加えて、私が最近発見したもう1つのオプションは、特性org.Apache.spark.sql.catalyst.DefinedByConstructorParamsを含むカスタムクラスを宣言できることです。

これは、クラスにExpressionEncoderが理解できる型を使用するコンストラクタ、つまりプリミティブ値と標準コレクションがある場合に機能します。クラスをケースクラスとして宣言できないが、データセットに含まれるたびにエンコードするためにKryoを使用したくない場合に便利です。

たとえば、Breezeベクトルを含むケースクラスを宣言したいと思います。それを扱うことができるだろう唯一のエンコーダは通常Kryoでしょう。しかし、Breeze DenseVectorとDefinedByConstructorParamsを拡張するサブクラスを宣言した場合、ExpressionEncoderはそれをDoubleの配列としてシリアル化できることを理解していました。

これを宣言したのは次のとおりです。

class SerializableDenseVector(values: Array[Double]) extends breeze.linalg.DenseVector[Double](values) with DefinedByConstructorParams implicit def BreezeVectorToSerializable(bv: breeze.linalg.DenseVector[Double]): SerializableDenseVector = bv.asInstanceOf[SerializableDenseVector]

これで、簡単なExpressionEncoderを使用してKryoを使用せずにSerializableDenseVectorをデータセットで（直接または製品の一部として）使用できます。 Breeze DenseVectorと同じように機能しますが、Array [Double]としてシリアル化されます。