ベクターを列に分割する方法-PySparkを使用する

Question

コンテキスト： Wordとvectorの2列のDataFrameがあります。「ベクトル」の列タイプはVectorUDTです。

例：

Word | vector assert | [435,323,324,212...]

そして、私はこれを取得したいです：

Word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert | 435 | 5435| 698| 356|....

質問：

PySparkを使用して各次元の複数の列にベクトルを持つ列を分割するにはどうすればよいですか？

前もって感謝します

zero323 · Accepted Answer

可能なアプローチの1つは、RDDとの間で変換を行うことです。

from pyspark.ml.linalg import Vectors df = sc.parallelize([ ("assert", Vectors.dense([1, 2, 3])), ("require", Vectors.sparse(3, {1: 2})) ]).toDF(["Word", "vector"]) def extract(row): return (row.Word, ) + Tuple(row.vector.toArray().tolist()) df.rdd.map(extract).toDF(["Word"]) # Vector values will be named _2, _3, ... ## +-------+---+---+---+ ## | Word| _2| _3| _4| ## +-------+---+---+---+ ## | assert|1.0|2.0|3.0| ## |require|0.0|2.0|0.0| ## +-------+---+---+---+

別の解決策は、UDFを作成することです。

from pyspark.sql.functions import udf, col from pyspark.sql.types import ArrayType, DoubleType def to_array(col): def to_array_(v): return v.toArray().tolist() return udf(to_array_, ArrayType(DoubleType()))(col) (df .withColumn("xs", to_array(col("vector"))) .select(["Word"] + [col("xs")[i] for i in range(3)])) ## +-------+-----+-----+-----+ ## | Word|xs[0]|xs[1]|xs[2]| ## +-------+-----+-----+-----+ ## | assert| 1.0| 2.0| 3.0| ## |require| 0.0| 2.0| 0.0| ## +-------+-----+-----+-----+

Scalaと同等のものは Spark Scala：Dataframe [vector]をDataFrame [f1：Double、...、fn：Double）に変換する方法] を参照してください。

Shuai Liu · Answer

def splitVecotr(df, new_features=['f1','f2']): schema = df.schema cols = df.columns for col in new_features: # new_features should be the same length as vector column length schema = schema.add(col,DoubleType(),True) return spark.createDataFrame(df.rdd.map(lambda row: [row[i] for i in cols]+row.features.tolist()), schema)

この関数は、特徴ベクトル列を個別の列に変換します