Sparkウィンドウ関数でシングルパーティションモードのパフォーマンスへの影響を回避する

Question

私の質問は、spark dataframe。

たとえば、私は：

>>> df.show() +-----+----------+ |index| col1| +-----+----------+ | 0.0|0.58734024| | 1.0|0.67304325| | 2.0|0.85154736| | 3.0| 0.5449719| +-----+----------+

「ウィンドウ」関数を使用してこれらを計算することを選択した場合、そのように計算できます。

>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc()) >>> import pyspark.sql.functions as f >>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show() +-----+----------+-----------+ |index| col1| diffs_col1| +-----+----------+-----------+ | 0.0|0.58734024|0.085703015| | 1.0|0.67304325| 0.17850411| | 2.0|0.85154736|-0.30657548| | 3.0| 0.5449719| null| +-----+----------+-----------+

質問：データフレームを明示的に単一のパーティションに分割しました。これによるパフォーマンスへの影響は何ですか。ある場合、なぜそうなのか、またどうすればそれを回避できるのでしょうか。パーティションを指定しないと、次の警告が表示されるためです。

16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

user6910411 · Accepted Answer

実際には、パフォーマンスへの影響は、partitionBy句を省略した場合とほぼ同じです。すべてのレコードが1つのパーティションにシャッフルされ、ローカルで並べ替えられ、1つずつ順番に繰り返されます。

違いは、合計で作成されるパーティションの数だけです。 10のパーティションと1000のレコードを持つ単純なデータセットを使用する例でそれを説明しましょう：

df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))

句によるパーティションなしでフレームを定義する場合

w_unpart = Window.orderBy(f.col("index").asc())

lagで使用します

df_lag_unpart = df.withColumn( "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") )

合計で1つのパーティションのみが存在します。

df_lag_unpart.rdd.glom().map(len).collect()

[1000]

ダミーインデックスを使用したそのフレーム定義と比較します（コードと比べて少し簡略化されています）。

w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())

spark.sql.shuffle.partitionsに等しいパーティション数を使用します：

spark.conf.set("spark.sql.shuffle.partitions", 11) df_lag_part = df.withColumn( "diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1") ) df_lag_part.rdd.glom().count()

空でないパーティションが1つだけの場合：

df_lag_part.rdd.glom().filter(lambda x: x).count()

残念ながら、PySparkでこの問題に対処するために使用できる汎用的なソリューションはありません。これは、分散処理モデルと組み合わせた、実装に固有のメカニズムにすぎません。

index列はシーケンシャルであるため、ブロックごとに固定数のレコードを使用して人工パーティションキーを生成できます。

rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions")) df_with_block = df.withColumn( "block", (f.col("index") / rec_per_block).cast("int") )

フレーム仕様を定義するために使用します。

w_with_block = Window.partitionBy("block").orderBy("index") df_lag_with_block = df_with_block.withColumn( "diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1") )

これは予想されるパーティション数を使用します：

df_lag_with_block.rdd.glom().count()

ほぼ均一なデータ分散（ハッシュの衝突は避けられません）：

df_lag_with_block.rdd.glom().map(len).collect()

[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]

しかし、ブロック境界にいくつかのギャップがあります：

df_lag_with_block.where(f.col("diffs_col1").isNull()).count()

境界は簡単に計算できるので：

from itertools import chain boundary_idxs = sorted(chain.from_iterable( # Here we depend on sequential identifiers # This could be generalized to any monotonically increasing # id by taking min and max per block (idx - 1, idx) for idx in df_lag_with_block.groupBy("block").min("index") .drop("block").rdd.flatMap(lambda x: x) .collect()))[2:] # The first boundary doesn't carry useful inf.

いつでも選択できます：

missing = df_with_block.where(f.col("index").isin(boundary_idxs))

これらを個別に入力してください：

# We use window without partitions here. Since number of records # will be small this won't be a performance issue # but will generate "Moving all data to a single partition" warning missing_with_lag = missing.withColumn( "diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1") ).select("index", f.col("diffs_col1").alias("diffs_fill"))

およびjoin：

combined = (df_lag_with_block .join(missing_with_lag, ["index"], "leftouter") .withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))

望ましい結果を得るには：

mismatched = combined.join(df_lag_unpart, ["index"], "outer").where( combined["diffs_col1"] != df_lag_unpart["diffs_col1"] ) assert mismatched.count() == 0