ValueError：列をブールに変換できません

Question

私は以下のようにデータフレームに新しい列を構築しようとしています：

l = [(2, 1), (1,1)] df = spark.createDataFrame(l) def calc_dif(x,y): if (x>y) and (x==1): return x-y dfNew = df.withColumn("calc", calc_dif(df["_1"], df["_2"])) dfNew.show()

しかし、私は得ます：

Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 346, in <module> Exception: Traceback (most recent call last): File "/tmp/zeppelin_pyspark-2807412651452069487.py", line 334, in <module> File "<stdin>", line 38, in <module> File "<stdin>", line 36, in calc_dif File "/usr/hdp/current/spark2-client/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

なぜそれが起こるのですか？どうすれば修正できますか？

Alper t. Turker · Accepted Answer

udfを使用するか：

from pyspark.sql.functions import udf @udf("integer") def calc_dif(x,y): if (x>y) and (x==1): return x-y

または場合（推奨）

from pyspark.sql.functions import when def calc_dif(x,y): when(( x > y) & (x == 1), x - y)

最初のものはPythonオブジェクトで計算し、2番目のものはSpark Columnsで計算します

mkaran · Answer

Calc_dif関数に、各行の実際のデータではなく、列オブジェクト全体を与えるため、これは不満です。 calc_dif関数をラップするには、udfを使用する必要があります。

from pyspark.sql.types import IntegerType from pyspark.sql.functions import udf l = [(2, 1), (1,1)] df = spark.createDataFrame(l) def calc_dif(x,y): # using the udf the calc_dif is called for every row in the dataframe # x and y are the values of the two columns if (x>y) and (x==1): return x-y udf_calc = udf(calc_dif, IntegerType()) dfNew = df.withColumn("calc", udf_calc("_1", "_2")) dfNew.show() # since x < y calc_dif returns None +---+---+----+ | _1| _2|calc| +---+---+----+ | 2| 1|null| | 1| 1|null| +---+---+----+

Anne · Answer

Pandasオブジェクトが必要なときにrddを渡そうとして、同じエラーが発生しました。明らかに、 "。toPandas（）で簡単に解決できました。」