Spark DataFrame：orderByの後のgroupByはその順序を維持しますか？

Question

私はSpark 2.0 dataframe exampleを次の構造で持っています：

_id, hour, count id1, 0, 12 id1, 1, 55 .. id1, 23, 44 id2, 0, 12 id2, 1, 89 .. id2, 23, 34 etc. _

各IDの24エントリ（1日の各時間に1つ）が含まれ、orderBy関数を使用してID、時間の順に並べられます。

アグリゲーターgroupConcatを作成しました：

_ def groupConcat(separator: String, columnToConcat: Int) = new Aggregator[Row, String, String] with Serializable { override def zero: String = "" override def reduce(b: String, a: Row) = b + separator + a.get(columnToConcat) override def merge(b1: String, b2: String) = b1 + b2 override def finish(b: String) = b.substring(1) override def bufferEncoder: Encoder[String] = Encoders.STRING override def outputEncoder: Encoder[String] = Encoders.STRING }.toColumn _

列を文字列に連結して、この最終的なデータフレームを取得するのに役立ちます。

_id, hourly_count id1, 12:55:..:44 id2, 12:89:..:34 etc. _

私の質問は、example.orderBy($"id",$"hour").groupBy("id").agg(groupConcat(":",2) as "hourly_count")を実行すると、毎時カウントがそれぞれのバケットで正しく順序付けられることを保証しますか？

私はこれが必ずしもRDDの場合ではないことを読んでいます（キーでスパークソートし、グループ化してイテレート可能にする？を参照してください）が、おそらくDataFramesでは異なりますか？

そうでない場合、どうすれば回避できますか？

Adair · Accepted Answer

他の人が指摘したように、orderByの後のgroupByは順序を維持しません。あなたがしたいのは、ウィンドウ関数を使用することです-時間のIDと順序のパーティション。これに対してcollect_listを実行し、結果リストを累積的に取得するため、結果リストの最大値（最大値）を取得できます（つまり、最初の1時間はリストにのみ存在し、2時間目はリストに2つの要素が含まれます）。

完全なサンプルコード：

import org.Apache.spark.sql.functions._ import org.Apache.spark.sql.expressions.Window import spark.implicits._ val data = Seq(( "id1", 0, 12), ("id1", 1, 55), ("id1", 23, 44), ("id2", 0, 12), ("id2", 1, 89), ("id2", 23, 34)).toDF("id", "hour", "count") val mergeList = udf{(strings: Seq[String]) => strings.mkString(":")} data.withColumn("collected", collect_list($"count") .over(Window.partitionBy("id") .orderBy("hour"))) .groupBy("id") .agg(max($"collected").as("collected")) .withColumn("hourly_count", mergeList($"collected")) .select("id", "hourly_count").show

これにより、DataFrameの世界に留まります。また、OPが使用していたUDFコードを簡素化しました。

出力：

+---+------------+ | id|hourly_count| +---+------------+ |id1| 12:55:44| |id2| 12:89:34| +---+------------+

Shyam · Answer

Java（ScalaとPythonは同様である必要があります）で実装を回避したい場合：

example.orderBy(“hour”) .groupBy(“id”) .agg(functions.sort_array( functions.collect_list( functions.struct(dataRow.col(“hour”), dataRow.col(“count”))),false) .as(“hourly_count”));

Kat · Answer

注文が常に維持されない場合があります：時々はい、ほとんどはいいえ。

私のデータフレームには、Spark 1.6

df_group_sort = data.orderBy(times).groupBy(group_key).agg( F.sort_array(F.collect_list(times)), F.collect_list(times) )

順序を確認するには、戻り値を比較します

F.sort_array(F.collect_list(times))

そして

F.collect_list(times)

例えば（左：sort_array（collect_list（））;右：collect_list（））

2016-12-19 08:20:27.172000 2016-12-19 09:57:03.764000 2016-12-19 08:20:30.163000 2016-12-19 09:57:06.763000 2016-12-19 08:20:33.158000 2016-12-19 09:57:09.763000 2016-12-19 08:20:36.158000 2016-12-19 09:57:12.763000 2016-12-19 08:22:27.090000 2016-12-19 09:57:18.762000 2016-12-19 08:22:30.089000 2016-12-19 09:57:33.766000 2016-12-19 08:22:57.088000 2016-12-19 09:57:39.811000 2016-12-19 08:23:03.085000 2016-12-19 09:57:45.770000 2016-12-19 08:23:06.086000 2016-12-19 09:57:57.809000 2016-12-19 08:23:12.085000 2016-12-19 09:59:56.333000 2016-12-19 08:23:15.086000 2016-12-19 10:00:11.329000 2016-12-19 08:23:18.087000 2016-12-19 10:00:14.331000 2016-12-19 08:23:21.085000 2016-12-19 10:00:17.329000 2016-12-19 08:23:24.085000 2016-12-19 10:00:20.326000

左の列は常にソートされますが、右の列はソートされたブロックのみで構成されます。 take（）の異なる実行では、右側の列のブロックの順序が異なります。

Ashish · Answer

パーティションの数とデータの分布に応じて、順序は同じでも異なっていてもかまいません。 rdd自体を使用して解決できます。

例えば：：

以下のサンプルデータをファイルに保存し、hdfsにロードしました。

1,type1,300 2,type1,100 3,type2,400 4,type2,500 5,type1,400 6,type3,560 7,type2,200 8,type3,800

以下のコマンドを実行しました：

sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3).groupBy(_(1)).mapValues(x=>x.toList.sortBy(_(2)).map(_(0)).mkString("~")).collect()

出力：

Array[(String, String)] = Array((type3,6~8), (type1,2~1~5), (type2,7~3~4))

つまり、データをタイプ別にグループ化し、その後価格でソートし、IDを「〜」で区切って連結しました。上記のコマンドは、次のように分割できます。

val validData=sc.textFile("/spark_test/test.txt").map(x=>x.split(",")).filter(x=>x.length==3) val groupedData=validData.groupBy(_(1)) //group data rdds val sortedJoinedData=groupedData.mapValues(x=>{ val list=x.toList val sortedList=list.sortBy(_(2)) val idOnlyList=sortedList.map(_(0)) idOnlyList.mkString("~") } ) sortedJoinedData.collect()

次に、コマンドを使用して特定のグループを取ることができます

sortedJoinedData.filter(_._1=="type1").collect()

出力：

Array[(String, String)] = Array((type1,2~1~5))

ChoppyTheLumberjack · Answer

いいえ、groupByKey内のソートは必ずしも維持されませんが、これは1つのノードのメモリで再現するのが難しいことで有名です。前述したように、これが起こる最も一般的な方法は、groupByKeyを実行するために物事を再分割する必要がある場合です。 repartitionの後にsortを手動で実行することで、これを再現できました。次に、結果をgroupByKeyに渡しました。

case class Numbered(num:Int, group:Int, otherData:Int) // configure spark with "spark.sql.shuffle.partitions" = 2 or some other small number val v = (1 to 100000) // Make waaay more groups then partitions. I added an extra integer just to mess with the sort hash computation (i.e. so it won't be monotonic, not sure if needed) .map(Numbered(_, Random.nextInt(300), Random.nextInt(1000000))).toDS() // Be sure they are stored in a small number of partitions .repartition(2) .sort($"num") // Repartition again with a waaay bigger number then there are groups so that when things need to be merged you can get them out of order. .repartition(200) .groupByKey(_.group) .mapGroups { case (g, nums) => nums // all you need is .sortBy(_.num) here to fix the problem .map(_.num) .mkString("~") } .collect() // Walk through the concatenated strings. If any number ahead // is smaller than the number before it, you know that something // is out of order. v.zipWithIndex.map { case (r, i) => r.split("~").map(_.toInt).foldLeft(0) { case (prev, next) => if (next < prev) { println(s"*** Next: ${next} less then ${prev} for dataset ${i + 1} ***") } next } }