SparseVector列を持つRDDをVectorとして列を持つDataFrameに変換するにはどうすればよいですか？

Question

タプルの値（String、SparseVector）を持つ[〜＃〜] rdd [〜＃〜]があり、DataFrame[〜＃〜] rdd [〜＃〜]を使用します。（label：string、features：vector）DataFrameを取得するには、これはほとんどのmlアルゴリズムのライブラリに必要なスキーマです。 HashingTF mlライブラリはDataFrameの機能列を指定すると、ベクトルを出力するため、これを実行できることはわかっています。

temp_df = sqlContext.createDataFrame(temp_rdd, StructType([ StructField("label", DoubleType(), False), StructField("tokens", ArrayType(StringType()), False) ])) #assumming there is an RDD (double,array(strings)) hashingTF = HashingTF(numFeatures=COMBINATIONS, inputCol="tokens", outputCol="features") ndf = hashingTF.transform(temp_df) ndf.printSchema() #outputs #root #|-- label: double (nullable = false) #|-- tokens: array (nullable = false) #| |-- element: string (containsNull = true) #|-- features: vector (nullable = true)

だから私の質問は、どういうわけか[〜＃〜] rdd [〜＃〜]の（String、SparseVector）をDataFrameof（String、vector）。通常のsqlContext.createDataFrameで試しましたが、私が持っているニーズに合う DataType はありません。

df = sqlContext.createDataFrame(rdd,StructType([ StructField("label" , StringType(),True), StructField("features" , ?Type(),True) ]))

zero323 · Accepted Answer

ここではVectorUDTを使用する必要があります：

# In Spark 1.x # from pyspark.mllib.linalg import SparseVector, VectorUDT from pyspark.ml.linalg import SparseVector, VectorUDT temp_rdd = sc.parallelize([ (0.0, SparseVector(4, {1: 1.0, 3: 5.5})), (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))]) schema = StructType([ StructField("label", DoubleType(), True), StructField("features", VectorUDT(), True) ]) temp_rdd.toDF(schema).printSchema() ## root ## |-- label: double (nullable = true) ## |-- features: vector (nullable = true)

完全を期すためにScala同等：

import org.Apache.spark.sql.Row import org.Apache.spark.rdd.RDD import org.Apache.spark.sql.types.{DoubleType, StructType} // In Spark 1x. // import org.Apache.spark.mllib.linalg.{Vectors, VectorUDT} import org.Apache.spark.ml.linalg.Vectors import org.Apache.spark.ml.linalg.SQLDataTypes.VectorType val schema = new StructType() .add("label", DoubleType) // In Spark 1.x //.add("features", new VectorUDT()) .add("features",VectorType) val temp_rdd: RDD[Row] = sc.parallelize(Seq( Row(0.0, Vectors.sparse(4, Seq((1, 1.0), (3, 5.5)))), Row(1.0, Vectors.sparse(4, Seq((0, -1.0), (2, 0.5)))) )) spark.createDataFrame(temp_rdd, schema).printSchema // root // |-- label: double (nullable = true) // |-- features: vector (nullable = true)

meyerson · Answer

@ zero323の答え https://stackoverflow.com/a/32745924/1333621 は理にかなっていますが、それがうまくいくことを願っています-データフレームの基礎となるrdd、sqlContext.createDataFrame（temp_rdd、schema）、まだ含まれているSparseVectorsタイプDenseVectorタイプに変換するには、次のことを行う必要がありました-誰かがもっと短い/より良い方法を知りたい場合

temp_rdd = sc.parallelize([ (0.0, SparseVector(4, {1: 1.0, 3: 5.5})), (1.0, SparseVector(4, {0: -1.0, 2: 0.5}))]) schema = StructType([ StructField("label", DoubleType(), True), StructField("features", VectorUDT(), True) ]) temp_rdd.toDF(schema).printSchema() df_w_ftr = temp_rdd.toDF(schema) print 'original convertion method: ',df_w_ftr.take(5) print('
') temp_rdd_dense = temp_rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray()))) print type(temp_rdd_dense), type(temp_rdd) print 'using map and toArray:', temp_rdd_dense.take(5) temp_rdd_dense.toDF().show() root |-- label: double (nullable = true) |-- features: vector (nullable = true) original convertion method: [Row(label=0.0, features=SparseVector(4, {1: 1.0, 3: 5.5})), Row(label=1.0, features=SparseVector(4, {0: -1.0, 2: 0.5}))] <class 'pyspark.rdd.PipelinedRDD'> <class 'pyspark.rdd.RDD'> using map and toArray: [Row(features=DenseVector([0.0, 1.0, 0.0, 5.5]), label=0.0), Row(features=DenseVector([-1.0, 0.0, 0.5, 0.0]), label=1.0)] +------------------+-----+ | features|label| +------------------+-----+ | [0.0,1.0,0.0,5.5]| 0.0| |[-1.0,0.0,0.5,0.0]| 1.0| +------------------+-----+

cipri.l · Answer

これはscala for spark 2.1の例です

_import org.Apache.spark.ml.linalg.Vector def featuresRDD2DataFrame(features: RDD[Vector]): DataFrame = { import sparkSession.implicits._ val rdd: RDD[(Double, Vector)] = features.map(x => (0.0, x)) val df = rdd.toDF("label","features").select("features") df } _

toDF()は、機能rddでコンパイラーによって認識されませんでした