SPARK mysql GROUP_CONCAT集計関数のSQL置換

Question

2つの文字列型の列（ユーザー名、友人）のテーブルがあり、ユーザー名ごとに、すべての友人を1行で収集し、文字列（ 'username1'、 'friends1、friends2、friends3'）として連結します。 MySqlはGROUP_CONCATでこれを行うことを知っていますが、SPARK SQLでこれを行う方法はありますか？

ありがとう

zero323 · Accepted Answer

続行する前に：この操作は、さらに別のgroupByKeyです。複数の正当なアプリケーションがありますが、比較的高価なので、必要な場合にのみ使用してください。

完全に簡潔または効率的なソリューションではありませんが、Spark 1.5.0で導入されたUserDefinedAggregateFunctionを使用できます。

object GroupConcat extends UserDefinedAggregateFunction { def inputSchema = new StructType().add("x", StringType) def bufferSchema = new StructType().add("buff", ArrayType(StringType)) def dataType = StringType def deterministic = true def initialize(buffer: MutableAggregationBuffer) = { buffer.update(0, ArrayBuffer.empty[String]) } def update(buffer: MutableAggregationBuffer, input: Row) = { if (!input.isNullAt(0)) buffer.update(0, buffer.getSeq[String](0) :+ input.getString(0)) } def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { buffer1.update(0, buffer1.getSeq[String](0) ++ buffer2.getSeq[String](0)) } def evaluate(buffer: Row) = UTF8String.fromString( buffer.getSeq[String](0).mkString(",")) }

使用例：

val df = sc.parallelize(Seq( ("username1", "friend1"), ("username1", "friend2"), ("username2", "friend1"), ("username2", "friend3") )).toDF("username", "friend") df.groupBy($"username").agg(GroupConcat($"friend")).show ## +---------+---------------+ ## | username| friends| ## +---------+---------------+ ## |username1|friend1,friend2| ## |username2|friend1,friend3| ## +---------+---------------+

Spark：PythonまたはPython User DefinedでScalaをマップする方法に示すように、Javaラッパーを作成することもできます。関数？

実際には、RDD、groupByKey、mkStringを抽出してDataFrameを再構築する方が高速です。

collect_list関数（Spark> = 1.6.0）とconcat_wsを組み合わせることで、同様の効果を得ることができます。

import org.Apache.spark.sql.functions.{collect_list, udf, lit} df.groupBy($"username") .agg(concat_ws(",", collect_list($"friend")).alias("friends"))

iec2011007 · Answer

Collect_list関数を試すことができます

sqlContext.sql("select A, collect_list(B), collect_list(C) from Table1 group by A

または、次のようなUDFをregieterすることができます

sqlContext.udf.register("myzip",(a:Long,b:Long)=>(a+","+b))

クエリでこの関数を使用できます

sqlConttext.sql("select A,collect_list(myzip(B,C)) from tbl group by A")

rikturr · Answer

PySparkで使用できる機能は次のとおりです。

import pyspark.sql.functions as F def group_concat(col, distinct=False, sep=','): if distinct: collect = F.collect_set(col.cast(StringType())) else: collect = F.collect_list(col.cast(StringType())) return F.concat_ws(sep, collect) table.groupby('username').agg(F.group_concat('friends').alias('friends'))

SQLの場合：

select username, concat_ws(',', collect_list(friends)) as friends from table group by username

ksindi · Answer

残念ながらユーザー定義の集計関数をサポートしていないpyspark <1.6でそれを行う1つの方法：

byUsername = df.rdd.reduceByKey(lambda x, y: x + ", " + y)

そしてそれを再びデータフレームにしたい場合：

sqlContext.createDataFrame(byUsername, ["username", "friends"])

1.6以降では、 collect_list を使用して、作成されたリストに参加できます。

from pyspark.sql import functions as F from pyspark.sql.types import StringType join_ = F.udf(lambda x: ", ".join(x), StringType()) df.groupBy("username").agg(join_(F.collect_list("friend").alias("friends"))

Christos Hadjinikolis · Answer

言語：Scala Sparkバージョン：1.5.2

私は同じ問題を抱えており、udfsを使用して解決しようとしましたが、残念ながら、これにより型の不整合によりコードの後半でより多くの問題が発生しました。この問題を回避するには、最初にDFをRDDに変換し、次にgrouping byに変更し、必要な方法でデータを操作してから、RDDは、次のようにDFに戻ります。

val df = sc .parallelize(Seq( ("username1", "friend1"), ("username1", "friend2"), ("username2", "friend1"), ("username2", "friend3"))) .toDF("username", "friend") +---------+-------+ | username| friend| +---------+-------+ |username1|friend1| |username1|friend2| |username2|friend1| |username2|friend3| +---------+-------+ val dfGRPD = df.map(Row => (Row(0), Row(1))) .groupByKey() .map{ case(username:String, groupOfFriends:Iterable[String]) => (username, groupOfFriends.mkString(","))} .toDF("username", "groupOfFriends") +---------+---------------+ | username| groupOfFriends| +---------+---------------+ |username1|friend2,friend1| |username2|friend3,friend1| +---------+---------------+

Akshay Patel · Answer

Group_concat機能を実現するPythonベースのコードの下。

入力データ：

Cust_No、Cust_Cars

1、トヨタ

2、BMW

1、アウディ

2、ヒュンダイ

from pyspark.sql import SparkSession from pyspark.sql.types import StringType from pyspark.sql.functions import udf import pyspark.sql.functions as F spark = SparkSession.builder.master('yarn').getOrCreate() # Udf to join all list elements with "|" def combine_cars(car_list,sep='|'): collect = sep.join(car_list) return collect test_udf = udf(combine_cars,StringType()) car_list_per_customer.groupBy("Cust_No").agg(F.collect_list("Cust_Cars").alias("car_list")).select("Cust_No",test_udf("car_list").alias("Final_List")).show(20,False)

出力データ：Cust_No、Final_List

1、トヨタ|アウディ

2、BMW |ヒュンダイ