DataFrameで複数の特徴ベクトルをマージする方法は？

Question

Spark MLトランスフォーマーを使用して、各行が次のように見えるDataFrameに到達しました：

Row(object_id, text_features_vector, color_features, type_features)

どこ text_featuresは項の重みのスパースベクトル、color_featuresは、色の小さな20要素（ワンホットエンコーダー）の密なベクトルであり、type_featuresも、タイプのワンホットエンコーダの密なベクトルです。

これらの機能を1つの大きな配列にマージして、2つのオブジェクト間のコサイン距離のようなものを測定するための（Sparkの機能を使用した）良いアプローチは何でしょうか？

zero323 · Accepted Answer

VectorAssembler を使用できます。

import org.Apache.spark.ml.feature.VectorAssembler import org.Apache.spark.sql.DataFrame val df: DataFrame = ??? val assembler = new VectorAssembler() .setInputCols(Array("text_features", "color_features", "type_features")) .setOutputCol("features") val transformed = assembler.transform(df)

PySparkの例については、以下を参照してください。 PySparkで複数の機能をエンコードおよびアセンブル