web-dev-qa-db-ja.com

データ型doubleの列のspark sqlContextで中央値を計算する方法

サンプルテーブルを提供しました。各グループの「ソース」列の「値」列から中央値を取得したいと思います。ソース列がStringDataTypeの場合、値列はdoubleDataTypeです。

scala> sqlContext.sql("SELECT * from tTab order by source").show

+---------------+-----+                                                         
|         Source|value|
+---------------+-----+
|131.183.222.110|  1.0|
| 131.183.222.85|  1.0|
| 131.183.222.85|  0.0|
| 131.183.222.85|  0.5|
| 131.183.222.85|  1.0|
| 131.183.222.85|  1.0|
|   43.230.146.7|  0.0|
|   43.230.146.7|  1.0|
|   43.230.146.7|  1.0|
|   43.230.146.8|  1.0|
|   43.230.146.8|  1.0| 
+---------------+-----+

scala> tTab.printSchema

root
 |-- Source: string (nullable = true)
 |-- value: double (nullable = true)

期待される答え:

+---------------+-----+
|         Source|value|
+---------------+-----+
|131.183.222.110|  1.0|
| 131.183.222.85|  1.0|
|   43.230.146.7|  1.0|
|   43.230.146.8|  1.0|
+---------------+-----+

「value」列がIntの場合、以下のクエリが機能します。 「value」のデータ型はdoubleであるため、エラーが発生します。

 sqlContext.sql("SELECT source , percentile(value,0.5) OVER (PARTITION BY source) AS Median from tTab ").show

エラー:

org.Apache.hadoop.Hive.ql.exec.NoMatchingMethodException: No matching method for class org.Apache.hadoop.Hive.ql.udf.UDAFPercentile with (double, double). Possible choices: _FUNC_(bigint, array<double>)  _FUNC_(bigint, double)  
    at org.Apache.hadoop.Hive.ql.exec.FunctionRegistry.getMethodInternal(FunctionRegistry.Java:1164)
    at org.Apache.hadoop.Hive.ql.exec.DefaultUDAFEvaluatorResolver.getEvaluatorClass(DefaultUDAFEvaluatorResolver.Java:83)
    at org.Apache.hadoop.Hive.ql.udf.generic.GenericUDAFBridge.getEvaluator(GenericUDAFBridge.Java:56)
    at org.Apache.hadoop.Hive.ql.udf.generic.AbstractGenericUDAFResolver.getEvaluator(AbstractGenericUDAFResolver.Java:47)
    at org.Apache.spark.sql.Hive.HiveWindowFunction.evaluator$lzycompute(hiveUDFs.scala:351)
    at org.Apache.spark.sql.Hive.HiveWindowFunction.evaluator(hiveUDFs.scala:349)
    at org.Apache.spark.sql.Hive.HiveWindowFunction.returnInspector$lzycompute(hiveUDFs.scala:357)
    at org.Apache.spark.sql.Hive.HiveWindowFunction.returnInspector(hiveUDFs.scala:356)
    at org.Apache.spark.sql.Hive.HiveWindowFunction.dataType(hiveUDFs.scala:362)
    at org.Apache.spark.sql.catalyst.expressions.WindowExpression.dataType(windowExpressions.scala:313)
    at org.Apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:140)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$35$$anonfun$apply$15.applyOrElse(Analyzer.scala:856)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$35$$anonfun$apply$15.applyOrElse(Analyzer.scala:852)
    at org.Apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
    at org.Apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
    at org.Apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
    at org.Apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$35.apply(Analyzer.scala:852)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$35.apply(Analyzer.scala:863)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.AbstractTraversable.map(Traversable.scala:105)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.org$Apache$spark$sql$catalyst$analysis$Analyzer$ExtractWindowExpressions$$addWindow(Analyzer.scala:849)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$16.applyOrElse(Analyzer.scala:957)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$$anonfun$apply$16.applyOrElse(Analyzer.scala:913)
    at org.Apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
    at org.Apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
    at org.Apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
    at org.Apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.apply(Analyzer.scala:913)
    at org.Apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions$.apply(Analyzer.scala:745)
    at org.Apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:83)
    at org.Apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:80)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.Apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:80)
    at org.Apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:72)
    at scala.collection.immutable.List.foreach(List.scala:318)
    at org.Apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:72)
    at org.Apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:916)
    at org.Apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:916)
    at org.Apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:914)
    at org.Apache.spark.sql.DataFrame.<init>(DataFrame.scala:132)
    at org.Apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
    at org.Apache.spark.sql.SQLContext.sql(SQLContext.scala:725)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:20)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:25)
    at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:27)
    at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:29)
    at $iwC$$iwC$$iwC$$iwC.<init>(<console>:31)
    at $iwC$$iwC$$iwC.<init>(<console>:33)
    at $iwC$$iwC.<init>(<console>:35)
    at $iwC.<init>(<console>:37)
    at <init>(<console>:39)
    at .<init>(<console>:43)
    at .<clinit>(<console>)
    at .<init>(<console>:7)
    at .<clinit>(<console>)
    at $print(<console>)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:606)
    at org.Apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
    at org.Apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
    at org.Apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
    at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
    at org.Apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
    at org.Apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
    at org.Apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
    at org.Apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
    at org.Apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
    at org.Apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
    at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at org.Apache.spark.repl.SparkILoop$$anonfun$org$Apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
    at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
    at org.Apache.spark.repl.SparkILoop.org$Apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
    at org.Apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
    at org.Apache.spark.repl.Main$.main(Main.scala:31)
    at org.Apache.spark.repl.Main.main(Main.scala)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:57)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:606)
    at org.Apache.spark.deploy.SparkSubmit$.org$Apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
    at org.Apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.Apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.Apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
    at org.Apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

どうもありがとうございます!

6
Ahmad

非整数値の場合は、percentile_approxUDFを使用する必要があります。

import org.Apache.spark.mllib.random.RandomRDDs

val df = RandomRDDs.normalRDD(sc, 1000, 10, 1).map(Tuple1(_)).toDF("x")
df.registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df").show

// +--------------------+
// |                 _c0|
// +--------------------+
// |0.035379710486199915|
// +--------------------+

一方で、GROUP BYではなくPARTITION BYを使用する必要があります。後者はウィンドウ関数に使用され、予想とは異なる効果があります。

SELECT source, percentile_approx(value, 0.5) FROM df GROUP BY source

参照 Sparkを使用して中央値を見つける方法

20
zero323

Spark Scalaデータフレーム関数を使用してこれを行う方法は次のとおりです。これは、ImputerがSpark> = 2.2https://github.com/Apache/spark/blob/master/mllib/src/main/scala/org/Apache/spark/ ml/feature/Imputer.scala -

  df.select(colName)
        .stat
        .approxQuantile(colName, Array(0.5), 0.001) //median
        .head
1
saurzcode

DataFrame.describe()メソッドを試しましたか?

https://spark.Apache.org/docs/latest/api/Java/org/Apache/spark/sql/DataFrame.html#describe(Java.lang.String ...)

それがあなたが探しているものと正確に一致するかどうかはわかりませんが、あなたを近づけるかもしれません。

0
Chris Fregly