PySpark-UDFにパラメーターとしてリストを渡す

Question

リストをUDFに渡す必要があります。リストは、距離のスコア/カテゴリを決定します。今のところ、すべての距離を4番目のスコアになるようにハードコーディングしています。

a= spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"]) from pyspark.sql.functions import udf def cate(label, feature_list): if feature_list == 0: return label[4] label_list = ["Great", "Good", "OK", "Please Move", "Dead"] udf_score=udf(cate, StringType()) a.withColumn("category", udf_score(label_list,a["distances"])).show(10)

このようなことを試みると、このエラーが発生します。

Py4JError: An error occurred while calling z:org.Apache.spark.sql.functions.col. Trace: py4j.Py4JException: Method col([class Java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.Java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.Java:339) at py4j.Gateway.invoke(Gateway.Java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:745)

Prem · Accepted Answer

お役に立てれば！

from pyspark.sql.functions import udf, col #sample data a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"]) label_list = ["Great", "Good", "OK", "Please Move", "Dead"] def cate(label, feature_list): if feature_list == 0: return label[4] else: #you may need to add 'else' condition as well otherwise 'null' will be added in this case return 'I am not sure!' def udf_score(label_list): return udf(lambda l: cate(l, label_list)) a.withColumn("category", udf_score(label_list)(col("distances"))).show()

出力は次のとおりです。

+------+---------+--------------+ |Letter|distances| category| +------+---------+--------------+ | A| 20|I am not sure!| | B| 30|I am not sure!| | D| 80|I am not sure!| +------+---------+--------------+

ags29 · Answer

関数をカレー化して、DataFrame呼び出しの唯一の引数が、関数を実行する列の名前になるようにします。

udf_score=udf(lambda x: cate(label_list,x), StringType()) a.withColumn("category", udf_score("distances")).show(10)

Nav · Answer

これは、変数のデフォルト値としてリストを渡すことで役立つと思います

from pyspark.sql.functions import udf, col #sample data a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80),("E",0)],["Letter", "distances"]) label_list = ["Great", "Good", "OK", "Please Move", "Dead"] #Passing List as Default value to a variable def cate( feature_list,label=label_list): if feature_list == 0: return label[4] else: #you may need to add 'else' condition as well otherwise 'null' will be added in this case return 'I am not sure!' udfcate = udf(cate, StringType()) a.withColumn("category", udfcate("distances")).show()

出力：

+------+---------+--------------+ |Letter|distances| category| +------+---------+--------------+ | A| 20|I am not sure!| | B| 30|I am not sure!| | D| 80|I am not sure!| | E| 0| Dead| +------+---------+--------------+