Spark、Scala、DataFrame：特徴ベクトルを作成する

Question

次のようなDataFrameがあります。

userID, category, frequency 1,cat1,1 1,cat2,3 1,cat9,5 2,cat4,6 2,cat9,2 2,cat10,1 3,cat1,5 3,cat7,16 3,cat8,2

個別のカテゴリの数は10です。各userIDの特徴ベクトルを作成し、欠落しているカテゴリをゼロで埋めたいと思います。

したがって、出力は次のようになります。

userID,feature 1,[1,3,0,0,0,0,0,0,5,0] 2,[0,0,0,6,0,0,0,0,2,1] 3,[5,0,0,0,0,0,16,2,0,0]

これは単なる例示です。実際には、約200,000の一意のユーザーIDと300の一意のカテゴリがあります。

フィーチャーDataFrameを作成する最も効率的な方法は何ですか？

Odomontois · Accepted Answer

仮定：

_val cs: SparkContext val sc: SQLContext val cats: DataFrame _

ここで、userIdとfrequencyはbigint列であり、_scala.Long_に対応します。

中間マッピングを作成していますRDD：

_val catMaps = cats.rdd .groupBy(_.getAs[Long]("userId")) .map { case (id, rows) => id -> rows .map { row => row.getAs[String]("category") -> row.getAs[Long]("frequency") } .toMap } _

次に、提示されたすべてのカテゴリを辞書式順序で収集します

_val catNames = cs.broadcast(catMaps.map(_._2.keySet).reduce(_ union _).toArray.sorted) _

または手動で作成する

_val catNames = cs.broadcast(1 to 10 map {n => s"cat$n"} toArray) _

最後に、存在しない値の値が0の配列にマップを変換しています

_import sc.implicits._ val catArrays = catMaps .map { case (id, catMap) => id -> catNames.value.map(catMap.getOrElse(_, 0L)) } .toDF("userId", "feature") _

今catArrays.show()は次のようなものを出力します

_+------+--------------------+ |userId| feature| +------+--------------------+ | 2|[0, 1, 0, 6, 0, 0...| | 1|[1, 0, 3, 0, 0, 0...| | 3|[5, 0, 0, 0, 16, ...| +------+--------------------+ _

私はこのスパークの領域にほとんど精通していないため、これはデータフレームにとって最も洗練されたソリューションではない可能性があります。

catNamesを手動で作成して、欠落している_cat3_、_cat5_、..のゼロを追加できることに注意してください。

また、それ以外の場合はcatMaps RDDが2回操作されることに注意してください。これは、.persist()it

zero323 · Answer

もう少しDataFrame中心のソリューション：

import org.Apache.spark.ml.feature.VectorAssembler val df = sc.parallelize(Seq( (1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2))).toDF("userID", "category", "frequency") // Create a sorted array of categories val categories = df .select($"category") .distinct.map(_.getString(0)) .collect .sorted // Prepare vector assemble val assembler = new VectorAssembler() .setInputCols(categories) .setOutputCol("features") // Aggregation expressions val exprs = categories.map( c => sum(when($"category" === c, $"frequency").otherwise(lit(0))).alias(c)) val transformed = assembler.transform( df.groupBy($"userID").agg(exprs.head, exprs.tail: _*)) .select($"userID", $"features")

およびUDAFの代替：

import org.Apache.spark.sql.expressions.{ MutableAggregationBuffer, UserDefinedAggregateFunction} import org.Apache.spark.mllib.linalg.Vectors import org.Apache.spark.sql.types.{ StructType, ArrayType, DoubleType, IntegerType} import scala.collection.mutable.WrappedArray class VectorAggregate (n: Int) extends UserDefinedAggregateFunction { def inputSchema = new StructType() .add("i", IntegerType) .add("v", DoubleType) def bufferSchema = new StructType().add("buff", ArrayType(DoubleType)) def dataType = new VectorUDT() def deterministic = true def initialize(buffer: MutableAggregationBuffer) = { buffer.update(0, Array.fill(n)(0.0)) } def update(buffer: MutableAggregationBuffer, input: Row) = { if (!input.isNullAt(0)) { val i = input.getInt(0) val v = input.getDouble(1) val buff = buffer.getAs[WrappedArray[Double]](0) buff(i) += v buffer.update(0, buff) } } def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = { val buff1 = buffer1.getAs[WrappedArray[Double]](0) val buff2 = buffer2.getAs[WrappedArray[Double]](0) for ((x, i) <- buff2.zipWithIndex) { buff1(i) += x } buffer1.update(0, buff1) } def evaluate(buffer: Row) = Vectors.dense( buffer.getAs[Seq[Double]](0).toArray) }

使用例：

import org.Apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("category_idx") .fit(df) val indexed = indexer.transform(df) .withColumn("category_idx", $"category_idx".cast("integer")) .withColumn("frequency", $"frequency".cast("double")) val n = indexer.labels.size + 1 val transformed = indexed .groupBy($"userID") .agg(new VectorAggregate(n)($"category_idx", $"frequency").as("vec")) transformed.show // +------+--------------------+ // |userID| vec| // +------+--------------------+ // | 1|[1.0,5.0,0.0,3.0,...| // | 2|[0.0,2.0,0.0,0.0,...| // | 3|[5.0,0.0,16.0,0.0...| // +------+--------------------+

この場合、値の順序はindexer.labelsによって定義されます。

indexer.labels // Array[String] = Array(cat1, cat9, cat7, cat2, cat8, cat4, cat10)

実際には、 Odomontois による解決策を好むので、これらは主に参照用に提供されています。

Marsellus Wallace · Answer

あなたの入力を考えると：

val df = Seq((1, "cat1", 1), (1, "cat2", 3), (1, "cat9", 5), (2, "cat4", 6), (2, "cat9", 2), (2, "cat10", 1), (3, "cat1", 5), (3, "cat7", 16), (3, "cat8", 2)) .toDF("userID", "category", "frequency") df.show +------+--------+---------+ |userID|category|frequency| +------+--------+---------+ | 1| cat1| 1| | 1| cat2| 3| | 1| cat9| 5| | 2| cat4| 6| | 2| cat9| 2| | 2| cat10| 1| | 3| cat1| 5| | 3| cat7| 16| | 3| cat8| 2| +------+--------+---------+

ただ走れ：

val pivoted = df.groupBy("userID").pivot("category").avg("frequency") val dfZeros = pivoted.na.fill(0) dzZeros.show +------+----+-----+----+----+----+----+----+ |userID|cat1|cat10|cat2|cat4|cat7|cat8|cat9| +------+----+-----+----+----+----+----+----+ | 1| 1.0| 0.0| 3.0| 0.0| 0.0| 0.0| 5.0| | 3| 5.0| 0.0| 0.0| 0.0|16.0| 2.0| 0.0| | 2| 0.0| 1.0| 0.0| 6.0| 0.0| 0.0| 2.0| +------+----+-----+----+----+----+----+----+

最後に、 VectorAssembler を使用してorg.Apache.spark.ml.linalg.Vectorを作成します

注：これに関するパフォーマンスはまだ確認していません...

編集：おそらくより複雑ですが、おそらくより効率的です！

def toSparseVectorUdf(size: Int) = udf[Vector, Seq[Row]] { (data: Seq[Row]) => { val indices = data.map(_.getDouble(0).toInt).toArray val values = data.map(_.getInt(1).toDouble).toArray Vectors.sparse(size, indices, values) } } val indexer = new StringIndexer().setInputCol("category").setOutputCol("idx") val indexerModel = indexer.fit(df) val totalCategories = indexerModel.labels.size val dataWithIndices = indexerModel.transform(df) val data = dataWithIndices.groupBy("userId").agg(sort_array(collect_list(struct($"idx", $"frequency".as("val")))).as("data")) val dataWithFeatures = data.withColumn("features", toSparseVectorUdf(totalCategories)($"data")).drop("data") dataWithFeatures.show(false) +------+--------------------------+ |userId|features | +------+--------------------------+ |1 |(7,[0,1,3],[1.0,5.0,3.0]) | |3 |(7,[0,2,4],[5.0,16.0,2.0])| |2 |(7,[1,5,6],[2.0,6.0,1.0]) | +------+--------------------------+

注： StringIndexer 頻度でカテゴリを並べ替えます=>最も頻度の高いカテゴリはindexerModel.labelsのindex = 0になります。必要に応じて、独自のマッピングを自由に使用して、それをtoSparseVectorUdfに直接渡します。