spark-mlでカテゴリ機能を処理する方法は？

Question

カテゴリデータの処理方法 spark-ml ではなく spark-mllib？

ドキュメントはあまり明確ではないと思ったが、分類器、例えばRandomForestClassifier、LogisticRegressionには、featuresCol内のフィーチャの列の名前を指定するDataFrame引数と、labelColがありますDataFrameのラベル付きクラスの列の名前を指定する引数。

明らかに、予測に複数の機能を使用したいので、VectorAssemblerを使用して、すべての機能をfeaturesColの下の単一のベクトルに入れようとしました。

ただし、VectorAssemblerは数値型、ブール型、ベクトル型（Spark Webサイト）のみ）を受け入れるため、機能ベクトルに文字列を配置できません。

どうすればいいですか？

eliasah · Answer

ホールデンの答えを完成させたかっただけです。

Spark 2.3.0以来、OneHotEncoderは廃止され、3.0.0で削除されます。代わりにOneHotEncoderEstimatorを使用してください。

InScala：

import org.Apache.spark.ml.Pipeline import org.Apache.spark.ml.feature.{OneHotEncoderEstimator, StringIndexer} val df = Seq((0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)).toDF("id", "category1", "category2") val indexer = new StringIndexer().setInputCol("category1").setOutputCol("category1Index") val encoder = new OneHotEncoderEstimator() .setInputCols(Array(indexer.getOutputCol, "category2")) .setOutputCols(Array("category1Vec", "category2Vec")) val pipeline = new Pipeline().setStages(Array(indexer, encoder)) pipeline.fit(df).transform(df).show // +---+---------+---------+--------------+-------------+-------------+ // | id|category1|category2|category1Index| category1Vec| category2Vec| // +---+---------+---------+--------------+-------------+-------------+ // | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])| // | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])| // | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| // | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| // | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| // | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| // +---+---------+---------+--------------+-------------+-------------+

Python：

from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator df = spark.createDataFrame([(0, "a", 1), (1, "b", 2), (2, "c", 3), (3, "a", 4), (4, "a", 4), (5, "c", 3)], ["id", "category1", "category2"]) indexer = StringIndexer(inputCol="category1", outputCol="category1Index") inputs = [indexer.getOutputCol(), "category2"] encoder = OneHotEncoderEstimator(inputCols=inputs, outputCols=["categoryVec1", "categoryVec2"]) pipeline = Pipeline(stages=[indexer, encoder]) pipeline.fit(df).transform(df).show() # +---+---------+---------+--------------+-------------+-------------+ # | id|category1|category2|category1Index| categoryVec1| categoryVec2| # +---+---------+---------+--------------+-------------+-------------+ # | 0| a| 1| 0.0|(2,[0],[1.0])|(4,[1],[1.0])| # | 1| b| 2| 2.0| (2,[],[])|(4,[2],[1.0])| # | 2| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| # | 3| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| # | 4| a| 4| 0.0|(2,[0],[1.0])| (4,[],[])| # | 5| c| 3| 1.0|(2,[1],[1.0])|(4,[3],[1.0])| # +---+---------+---------+--------------+-------------+-------------+

Spark 1.4.0なので、MLLibは OneHotEncoder 機能も提供します。これは、ラベルインデックスの列をバイナリベクトルの列にマッピングします。、最大で1つの1つの値。

このエンコードにより、ロジスティック回帰などの連続的な機能を期待するアルゴリズムがカテゴリ機能を使用できます

次のDataFrameを考えてみましょう：

val df = Seq((0, "a"),(1, "b"),(2, "c"),(3, "a"),(4, "a"),(5, "c")) .toDF("id", "category")

最初の手順は、DataFrameを使用してインデックス付きStringIndexerを作成することです。

import org.Apache.spark.ml.feature.StringIndexer val indexer = new StringIndexer() .setInputCol("category") .setOutputCol("categoryIndex") .fit(df) val indexed = indexer.transform(df) indexed.show // +---+--------+-------------+ // | id|category|categoryIndex| // +---+--------+-------------+ // | 0| a| 0.0| // | 1| b| 2.0| // | 2| c| 1.0| // | 3| a| 0.0| // | 4| a| 0.0| // | 5| c| 1.0| // +---+--------+-------------+

その後、categoryIndexをOneHotEncoderでエンコードできます：

import org.Apache.spark.ml.feature.OneHotEncoder val encoder = new OneHotEncoder() .setInputCol("categoryIndex") .setOutputCol("categoryVec") val encoded = encoder.transform(indexed) encoded.select("id", "categoryVec").show // +---+-------------+ // | id| categoryVec| // +---+-------------+ // | 0|(2,[0],[1.0])| // | 1| (2,[],[])| // | 2|(2,[1],[1.0])| // | 3|(2,[0],[1.0])| // | 4|(2,[0],[1.0])| // | 5|(2,[1],[1.0])| // +---+-------------+

hamel · Answer

Spark ML（not MLlib））のツリーベースのモデルに関するカテゴリ機能についても疑問に思っていたので、別の観点から答えを提供します、そしてドキュメントはそれほど明確ではありませんすべてがどのように機能するか。

_pyspark.ml.feature.StringIndexer_を使用してデータフレーム内の列を変換すると、特別なメタデータがデータフレームに保存され、変換されたフィーチャがカテゴリフィーチャとして明確にマークされます。

データフレームを印刷すると、数値（カテゴリ値の1つに対応するインデックス）が表示され、スキーマを見ると、新しい変換された列がdouble型であることがわかります。ただし、_pyspark.ml.feature.StringIndexer.transform_を使用して作成したこの新しい列は、通常の二重列ではなく、非常に重要な追加のメタデータが関連付けられています。データフレームのスキーマの適切なフィールドのmetadataプロパティを調べることにより、このメタデータを調べることができます（yourdataframe.schemaを見ると、データフレームのスキーマオブジェクトにアクセスできます）

この追加のメタデータには、2つの重要な意味があります。

ツリーベースのモデルを使用しているときに.fit()を呼び出すと、データフレームのメタデータをスキャンし、_pyspark.ml.feature.StringIndexer_などのトランスフォーマーでカテゴリーとしてエンコードしたフィールドを認識します（上記のように_pyspark.ml.feature.VectorIndexer_）など、この効果を持つ他のトランスフォーマー。このため、spark MLでツリーベースのモデルを使用する場合、StringIndxerで機能を変換した後、機能をワンホットエンコードする必要はありません（ただし、線形回帰などのカテゴリを自然に処理しない他のモデルを使用する場合のホットエンコーディングなど）。
このメタデータはデータフレームに格納されるため、_pyspark.ml.feature.IndexToString_を使用して、いつでも数値インデックスを元のカテゴリ値（多くの場合は文字列）に戻すことができます。

Holden · Answer

合理的な方法で文字列をDoubleに変換するために使用できるStringIndexerというMLパイプラインのコンポーネントがあります。 http://spark.Apache.org/docs/latest/api/scala/index.html#org.Apache.spark.ml.feature.StringIndexer にはさらにドキュメントがあり、 http ：//spark.Apache.org/docs/latest/ml-guide.html は、パイプラインの構築方法を示しています。

Jim · Answer

Spark dataFrameの単一列をoneHotEncodingするために次のメソッドを使用します。

def ohcOneColumn(df, colName, debug=False): colsToFillNa = [] if debug: print("Entering method ohcOneColumn") countUnique = df.groupBy(colName).count().count() if debug: print(countUnique) collectOnce = df.select(colName).distinct().collect() for uniqueValIndex in range(countUnique): uniqueVal = collectOnce[uniqueValIndex][0] if debug: print(uniqueVal) newColName = str(colName) + '_' + str(uniqueVal) + '_TF' df = df.withColumn(newColName, df[colName]==uniqueVal) colsToFillNa.append(newColName) df = df.drop(colName) df = df.na.fill(False, subset=colsToFillNa) return df

OneHotEncoding Spark dataFramesには次のメソッドを使用します。

from pyspark.sql.functions import col, countDistinct, approxCountDistinct from pyspark.ml.feature import StringIndexer from pyspark.ml.feature import OneHotEncoderEstimator def detectAndLabelCat(sparkDf, minValCount=5, debug=False, excludeCols=['Target']): if debug: print("Entering method detectAndLabelCat") newDf = sparkDf colList = sparkDf.columns for colName in sparkDf.columns: uniqueVals = sparkDf.groupBy(colName).count() if debug: print(uniqueVals) countUnique = uniqueVals.count() dtype = str(sparkDf.schema[colName].dataType) #dtype = str(df.schema[nc].dataType) if (colName in excludeCols): if debug: print(str(colName) + ' is in the excluded columns list.') Elif countUnique == 1: newDf = newDf.drop(colName) if debug: print('dropping column ' + str(colName) + ' because it only contains one unique value.') #end if debug #Elif (1==2): Elif ((countUnique < minValCount) | (dtype=="String") | (dtype=="StringType")): if debug: print(len(newDf.columns)) oldColumns = newDf.columns newDf = ohcOneColumn(newDf, colName, debug=debug) if debug: print(len(newDf.columns)) newColumns = set(newDf.columns) - set(oldColumns) print('Adding:') print(newColumns) for newColumn in newColumns: if newColumn in newDf.columns: try: newUniqueValCount = newDf.groupBy(newColumn).count().count() print("There are " + str(newUniqueValCount) + " unique values in " + str(newColumn)) except: print('Uncaught error discussing ' + str(newColumn)) #else: # newColumns.remove(newColumn) print('Dropping:') print(set(oldColumns) - set(newDf.columns)) else: if debug: print('Nothing done for column ' + str(colName)) #end if countUnique == 1, Elif countUnique other condition #end outer for return newDf

Vadim Smolyakov · Answer

Cast関数を使用して、sparkデータフレームのstring列タイプを数値データタイプにキャストできます。

from pyspark.sql import SQLContext from pyspark.sql.types import DoubleType, IntegerType sqlContext = SQLContext(sc) dataset = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load('./data/titanic.csv') dataset = dataset.withColumn("Age", dataset["Age"].cast(DoubleType())) dataset = dataset.withColumn("Survived", dataset["Survived"].cast(IntegerType()))

上記の例では、csvファイルをデータフレームとして読み取り、デフォルトの文字列データ型を整数と倍精度にキャストし、元のデータフレームを上書きします。次に、VectorAssemblerを使用して、単一のベクターの機能をマージし、お気に入りのSpark MLアルゴリズムを適用します。