列を分解する方法は？

Question

後：

val df = Seq((1, Vector(2, 3, 4)), (1, Vector(2, 3, 4))).toDF("Col1", "Col2")

Apache SparkにこのDataFrameがあります。

+------+---------+ | Col1 | Col2 | +------+---------+ | 1 |[2, 3, 4]| | 1 |[2, 3, 4]| +------+---------+

これをどのように変換しますか：

+------+------+------+------+ | Col1 | Col2 | Col3 | Col4 | +------+------+------+------+ | 1 | 2 | 3 | 4 | | 1 | 2 | 3 | 4 | +------+------+------+------+

sgvd · Accepted Answer

RDDとの間で変換しないソリューション：

df.select($"Col1", $"Col2"(0) as "Col2", $"Col2"(1) as "Col3", $"Col2"(2) as "Col3")

または議論の余地があります：

val nElements = 3 df.select(($"Col1" +: Range(0, nElements).map(idx => $"Col2"(idx) as "Col" + (idx + 2)):_*))

Spark配列列のサイズは固定されていません。たとえば、次のようにすることができます。

+----+------------+ |Col1| Col2| +----+------------+ | 1| [2, 3, 4]| | 1|[2, 3, 4, 5]| +----+------------+

したがって、the列の数を取得して作成する方法はありません。サイズが常に同じであることがわかっている場合は、次のようにnElementsを設定できます。

val nElements = df.select("Col2").first.getList(0).size

Shane Halloran · Answer

sgvd's answer のPysparkバージョンを提供するだけです。配列列がCol2にある場合、このselectステートメントはCol2の各配列の最初のnElementsを独自の列に移動します。

from pyspark.sql import functions as F df.select([F.col('Col2').getItem(i) for i in range(nElements)])

Yuan Zhao · Answer

sgvd'sソリューションに追加するだけです：

サイズが常に同じではない場合、次のようにnElementsを設定できます。

val nElements = df.select(size('Col2).as("Col2_count")) .select(max("Col2_count")) .first.getInt(0)

Carlos Vilchez · Answer

マップを使用できます：

df.map { case Row(col1: Int, col2: mutable.WrappedArray[Int]) => (col1, col2(0), col2(1), col2(2)) }.toDF("Col1", "Col2", "Col3", "Col4").show()