「Java.util.concurrent.TimeoutException：Futures timed out after [300 seconds]」で参加が失敗するのはなぜですか？

Question

Spark 1.5を使用しています。

次の形式の2つのデータフレームがあります。

scala> libriFirstTable50Plus3DF res1: org.Apache.spark.sql.DataFrame = [basket_id: string, family_id: int] scala> linkPersonItemLessThan500DF res2: org.Apache.spark.sql.DataFrame = [person_id: int, family_id: int]

libriFirstTable50Plus3DFには766,151レコードがありますが、linkPersonItemLessThan500DFには26,694,353レコードがあります。後でこれら2つに参加するつもりなので、linkPersonItemLessThan500DFでrepartition(number)を使用していることに注意してください。上記のコードをフォローアップしています：

val userTripletRankDF = linkPersonItemLessThan500DF .join(libriFirstTable50Plus3DF, Seq("family_id")) .take(20) .foreach(println(_))

私はこの出力を得ています：

16/12/13 15:07:10 INFO scheduler.TaskSetManager: Finished task 172.0 in stage 3.0 (TID 473) in 520 ms on mlhdd01.mondadori.it (199/200) Java.util.concurrent.TimeoutException: Futures timed out after [300 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala: at scala.concurrent.Await$.result(package.scala:107) at org.Apache.spark.sql.execution.joins.BroadcastHashJoin.doExecute(BroadcastHashJoin.scala:110) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.Apache.spark.sql.execution.TungstenProject.doExecute(basicOperators.scala:86) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.Apache.spark.sql.execution.ConvertToSafe.doExecute(rowFormatConverters.scala:63) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138) at org.Apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:190) at org.Apache.spark.sql.execution.Limit.executeCollect(basicOperators.scala:207) at org.Apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386) at org.Apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1386) at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) at org.Apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1904) at org.Apache.spark.sql.DataFrame.collect(DataFrame.scala:1385) at org.Apache.spark.sql.DataFrame.head(DataFrame.scala:1315) at org.Apache.spark.sql.DataFrame.take(DataFrame.scala:1378) at org.Apache.spark.sql.DataFrame.showString(DataFrame.scala:178) at org.Apache.spark.sql.DataFrame.show(DataFrame.scala:402) at org.Apache.spark.sql.DataFrame.show(DataFrame.scala:363) at org.Apache.spark.sql.DataFrame.show(DataFrame.scala:371) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:72) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:77) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:79) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:81) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:83) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:85) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:87) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:89) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:91) at $iwC$$iwC$$iwC.<init>(<console>:93) at $iwC$$iwC.<init>(<console>:95) at $iwC.<init>(<console>:97) at <init>(<console>:99) at .<init>(<console>:103) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:606) at org.Apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) at org.Apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340) at org.Apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) at org.Apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) at org.Apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) at org.Apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) at org.Apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) at org.Apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) at org.Apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) at org.Apache.spark.repl.Main$.main(Main.scala:31) at org.Apache.spark.repl.Main.main(Main.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:606) at org.Apache.spark.deploy.SparkSubmit$.org$Apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.Apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.Apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.Apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.Apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

そして、私は問題が何であるか理解していません。待ち時間を増やすのと同じくらい簡単ですか？結合は集中的ですか？もっとメモリが必要ですか？シャッフルは集中的ですか？誰でも助けることができますか？

T. Gawęda · Accepted Answer

これは、Sparkがブロードキャストハッシュ結合を実行しようとし、DataFrameの1つが非常に大きいため、送信に時間がかかるために発生します。

あなたはできる：

spark.sql.broadcastTimeoutを高く設定してタイムアウトを増やします-spark.conf.set("spark.sql.broadcastTimeout", newValueForExample36000)
persist()両方のDataFrames、次にSparkはシャッフル結合を使用します- here からの参照

PySpark

PySparkでは、次の方法でsparkコンテキストを構築するときに構成を設定できます。

spark = SparkSession .builder .appName("Your App") .config("spark.sql.broadcastTimeout", "36000") .getOrCreate()

Jacek Laskowski · Answer

@ T。Gawędaからの非常に簡潔な回答にコードコンテキストを追加するだけです。

Sparkアプリケーションで、Spark SQLは、結合のためにブロードキャストハッシュ結合を選択しました"libriFirstTable50Plus3DFには766,151レコードがある"いわゆるブロードキャストしきい値（デフォルトは10MB）よりも。

spark.sql.autoBroadcastJoinThreshold 構成プロパティを使用して、ブロードキャストのしきい値を制御できます。

spark.sql.autoBroadcastJoinThreshold結合を実行するときにすべてのワーカーノードにブロードキャストされるテーブルの最大サイズをバイト単位で構成します。この値を-1に設定すると、ブロードキャストを無効にできます。現在、統計は、コマンドANALYZE TABLE COMPUTE STATISTICS noscanが実行されたHive Metastoreテーブルでのみサポートされていることに注意してください。

スタックトレースで特定の種類の結合を見つけることができます。

org.Apache.spark.sql.execution.joins.BroadcastHashJoin.doExecute（BroadcastHashJoin.scala：110）

Spark SQLのBroadcastHashJoin物理演算子は、ブロードキャスト変数を使用して、より小さなデータセットをSpark executorに配布します（コピーを毎回送信するのではなく）仕事）。

explainを使用して物理クエリプランを確認した場合、クエリは BroadcastExchangeExec 物理演算子を使用していることに気付くでしょう。これは、より小さいテーブルをブロードキャストするための基礎となる機構（およびタイムアウト）を確認できる場所です。

override protected[sql] def doExecuteBroadcast[T](): broadcast.Broadcast[T] = { ThreadUtils.awaitResult(relationFuture, timeout).asInstanceOf[broadcast.Broadcast[T]] }

doExecuteBroadcastは、Spark SQLのすべての物理演算子が従うSparkPlanコントラクトの一部であり、必要に応じてブロードキャストを許可します。 BroadcastExchangeExecはたまたま必要です。

timeout パラメーターは、探しているものです。

private val timeout: Duration = { val timeoutValue = sqlContext.conf.broadcastTimeout if (timeoutValue < 0) { Duration.Inf } else { timeoutValue.seconds } }

ご覧のとおり、ブロードキャスト変数が無期限にエグゼキューターに送られるのを待つか、 spark.sql.broadcastTimeout であるsqlContext.conf.broadcastTimeoutを使用することを暗示する完全に無効にすることができます（負の値を使用）。 _構成プロパティ。デフォルト値は5 * 60秒で、スタックトレースで確認できます：

Java.util.concurrent.TimeoutException：[300秒]後に先物がタイムアウトしました

Pedro H · Answer

私の場合、大きなデータフレームでのブロードキャストが原因でした。

df.join(broadcast(largeDF))

そこで、以前の回答に基づいて、ブロードキャストを削除して修正しました。

df.join(largeDF)

lasclocker · Answer

両方のDataFrames spark.sql.broadcastTimeoutまたはpersist（）を増やすことに加えて、

あなたが試すことができます：

1. spark.sql.autoBroadcastJoinThresholdを-1に設定してブロードキャストを無効にします

2. spark.driver.memoryをより高い値に設定して、sparkドライバーのメモリを増やします。