pyspark / EMRの大きなDataFrameでのcollect（）またはtoPandas（）

Question

私は1台のマシン「c3.8xlarge」のEMRクラスターを持っています。いくつかのリソースを読んだ後、pysparkを使用しているため、適切な量のメモリオフヒープを許可する必要があることを理解し、次のようにクラスターを構成しました。

1人のエグゼキュータ：

spark.executor.memory 6g
spark.executor.cores 10
spark.yarn.executor.memoryOverhead 4096

ドライバー：

spark.driver.memory 21g

DataFrameをcache()すると、約3.6GBのメモリが必要になります。

ここで、DataFrameでcollect()またはtoPandas()を呼び出すと、プロセスがクラッシュします。

ドライバーに大量のデータを持ち込んでいることはわかっていますが、それほど大きくなく、クラッシュの原因を特定することはできません。

collect()またはtoPandas()を呼び出すと、次のエラーが発生します。

Py4JJavaError: An error occurred while calling o181.collectToPython. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 6.0 failed 4 times, most recent failure: Lost task 5.3 in stage 6.0 (TID 110, ip-10-0-47-207.prod.eu-west-1.hs.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_1511879540686_0005_01_000016 on Host: ip-10-0-47-207.prod.eu-west-1.hs.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Container exited with a non-zero exit code 137 Killed by external signal Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at scala.Option.foreach(Option.scala:257) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849) at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.Apache.spark.rdd.RDD.collect(RDD.scala:935) at org.Apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800) at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.Apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823) at org.Apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:748)

====更新====

@ user6910411が示唆したように、言及された解決策を試してみました here とすると、次のエラーが発生します。

Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2.0 failed 4 times, most recent failure: Lost task 7.3 in stage 2.0 (TID 41, ip-10-0-33-57.prod.eu-west-1.hs.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 13.5 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at scala.Option.foreach(Option.scala:257) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849) at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.Apache.spark.rdd.RDD.collect(RDD.scala:935) at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458) at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:748)

ここで何が起こっているかについてのヒントはありますか？

zero323 · Accepted Answer

TL; DRメモリ要件を深刻に過小評価していると思います。

データが完全にキャッシュされていると仮定しても、ストレージ情報には、データをドライバーに戻すために必要なピークメモリのごく一部しか表示されません。

まず第一にSpark SQLはキャッシングに圧縮された列ストレージを使用します。データ分散と圧縮アルゴリズムによっては、メモリ内のサイズが非圧縮よりもはるかに小さくなる可能性がありますPandas出力、プレーンList[Row]。後者には列名も格納されるため、メモリ使用量がさらに増加します。
データ収集は間接的であり、データはJVM側とPython側の両方に保存されます。データがソケットを通過するとJVMメモリを解放できますが、ピークメモリ使用量は両方を考慮する必要があります。
単純なtoPandas実装は最初にRowsを収集します次にPandas DataFrameをローカルで作成。これはさらに増加します（おそらく2倍になります））メモリ使用量。幸いなことに、この部分はすでにマスター（Spark 2.3）で対処されており、Arrowシリアライゼーションを使用したより直接的なアプローチが使用されています（ SPARK-13534-Apache ArrowシリアライザーをSpark DataFrame.toPandas ）で使用します。

Apache Arrowに依存しない可能な解決策については、Apache Spark Developer Listで Faster and Lower memory implementation toPandas を確認できます。

データは実際にはかなり大きいので、Parquetに書き込んでから直接読み取ることを検討しますPython using PyArrow（ Apache Parquet Formatの読み取りと書き込み）すべてを完全にスキップします中間段階。

Dafni Argyro Krystallidou · Answer

上記のように、toPandas（）を呼び出すと、DataFrameのすべてのレコードがドライバープログラムに収集されるため、データの小さなサブセットで実行する必要があります。（ https://spark.Apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html ）