次のkafkaコードを使用して sparkストリーミングとkafka統合Scalaブローカーバージョン0.10.1. を使用すると、次の例外で失敗します。
16/11/13 12:55:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
Java.io.NotSerializableException: org.Apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.Apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = local1, partition = 0, offset = 10000, CreateTime = 1479012919187, checksum = 1713832959, serialized key size = -1, serialized value size = 1, key = null, value = a))
- element of array (index: 0)
- array (class [Lorg.Apache.kafka.clients.consumer.ConsumerRecord;, size 11)
at org.Apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
どうして?それを修正するには?
コード:
import org.Apache.kafka.clients.consumer.ConsumerRecord
import org.Apache.kafka.common.serialization.StringDeserializer
import org.Apache.spark.streaming.kafka010._
import org.Apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.Apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.Apache.spark._
import org.Apache.commons.codec.StringDecoder
import org.Apache.spark.streaming._
object KafkaConsumer_spark_test {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("KafkaConsumer_spark_test").setMaster("local[4]")
val ssc = new StreamingContext(conf, Seconds(1))
ssc.checkpoint("./checkpoint")
val kafkaParams =Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "example",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: Java.lang.Boolean)
)
val topics = Array("local1")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
stream.print()
ssc.start()
ssc.awaitTermination()
}
}
例外:
16/11/13 12:55:20 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
Java.io.NotSerializableException: org.Apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.Apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = local1, partition = 0, offset = 10000, CreateTime = 1479012919187, checksum = 1713832959, serialized key size = -1, serialized value size = 1, key = null, value = a))
- element of array (index: 0)
- array (class [Lorg.Apache.kafka.clients.consumer.ConsumerRecord;, size 11)
at org.Apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.Apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.Apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:313)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
at Java.lang.Thread.run(Thread.Java:745)
16/11/13 12:55:20 ERROR TaskSetManager: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.Apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.Apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = local1, partition = 0, offset = 10000, CreateTime = 1479012919187, checksum = 1713832959, serialized key size = -1, serialized value size = 1, key = null, value = a))
- element of array (index: 0)
- array (class [Lorg.Apache.kafka.clients.consumer.ConsumerRecord;, size 11); not retrying
16/11/13 12:55:20 ERROR JobScheduler: Error running job streaming job 1479012920000 ms.0
org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.Apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.Apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = local1, partition = 0, offset = 10000, CreateTime = 1479012919187, checksum = 1713832959, serialized key size = -1, serialized value size = 1, key = null, value = a))
- element of array (index: 0)
- array (class [Lorg.Apache.kafka.clients.consumer.ConsumerRecord;, size 11)
at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.Apache.spark.streaming.kafka010.KafkaRDD.take(KafkaRDD.scala:122)
at org.Apache.spark.streaming.kafka010.KafkaRDD.take(KafkaRDD.scala:50)
at org.Apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$3$1.apply(DStream.scala:734)
at org.Apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$3$1.apply(DStream.scala:733)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.Apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:245)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:244)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
at Java.lang.Thread.run(Thread.Java:745)
Exception in thread "main" org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a not serializable result: org.Apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.Apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = local1, partition = 0, offset = 10000, CreateTime = 1479012919187, checksum = 1713832959, serialized key size = -1, serialized value size = 1, key = null, value = a))
- element of array (index: 0)
- array (class [Lorg.Apache.kafka.clients.consumer.ConsumerRecord;, size 11)
at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
at scala.Option.foreach(Option.scala:257)
at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1871)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1884)
at org.Apache.spark.streaming.kafka010.KafkaRDD.take(KafkaRDD.scala:122)
at org.Apache.spark.streaming.kafka010.KafkaRDD.take(KafkaRDD.scala:50)
at org.Apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$3$1.apply(DStream.scala:734)
at org.Apache.spark.streaming.dstream.DStream$$anonfun$print$2$$anonfun$foreachFunc$3$1.apply(DStream.scala:733)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:51)
at org.Apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at org.Apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50)
at scala.util.Try$.apply(Try.scala:192)
at org.Apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:245)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:245)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.Apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:244)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
at Java.lang.Thread.run(Thread.Java:745)
コンシューマーレコードオブジェクトは、Dstreamから受信されます。印刷しようとすると、そのオブジェクトは直列化できないため、エラーが発生します。代わりに、ConsumerRecordオブジェクトから値を取得して印刷する必要があります。
stream.print()の代わりに、次のようにします。
stream.map(record=>(record.value().toString)).print
これで問題が解決するはずです。
[〜#〜] gotcha [〜#〜]
この例外が表示される他の人にとっては、checkpoint
を呼び出すとpersist
がstorageLevel = MEMORY_ONLY_SER
で呼び出されるため、checkpoint
を呼び出すまでmap
を呼び出さないでください] _
KafkaUtils.createDirectStreamはorg.Apache.spark.streaming.dstream.DStreamとして作成します。 RDDではありません。 Sparkストリーミングは実行時にRDDを一時的に作成します。RDDを取得するには、stream.foreach()を使用してRDDを取得し、次にRDD.foreachを使用してRDD内の各オブジェクトを取得します。これらは= Kafka使用するConsumerRecordsは、value()メソッドを使用してKafkaトピックからメッセージを読み取ります。
stream.foreachRDD { rdd =>
rdd.foreach { record =>
val value = record.value()
println(map.get(value))
}
}
シリアル化を必要とする操作(永続化またはウィンドウ印刷など)を実行する場合、ConsumerRecordはシリアル化を実装しません。エラーを回避するには、以下の設定を追加する必要があります。
sparkConf.set("spark.serializer","org.Apache.spark.serializer.KryoSerialize");
sparkConf.registerKryoClasses((Class<ConsumerRecord>[] )Arrays.asList(ConsumerRecord.class).toArray());