カテゴリー機能に関するOneHotEncoderの問題

Question

データセットの10個のフィーチャのうち3個のカテゴリフィーチャをエンコードしたい。 sklearn.preprocessing のpreprocessingを使用して、次のようにします。

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)

ただし、このエラーが発生しているため、続行できませんでした。

 array = np.array(array, dtype=dtype, order=order, copy=copy) ValueError: could not convert string to float: PG

文字列を変換することになっているのに文句を言っている理由に驚いています!!ここに何かが足りませんか？

piman314 · Accepted Answer

OneHotEncoderのドキュメントを読むと、fitの入力が「int型の入力配列」であることがわかります。したがって、1つのホットエンコードされたデータに対して2つのステップを実行する必要があります

from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.LabelEncoder() enc.fit(cat_features) new_cat_features = enc.transform(cat_features) print new_cat_features # [1 2 0] new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read print ohe.fit_transform(new_cat_features)

出力：

[[ 0. 1. 0.] [ 0. 0. 1.] [ 1. 0. 0.]]

編集

0.20では、OneHotEncoderが文字列を適切に処理するようになっただけでなく、ColumnTransformerを使用して複数の列を簡単に変換できるため、これが少し簡単になりました。

from sklearn.compose import ColumnTransformer from sklearn.preprocessing import LabelEncoder, OneHotEncoder import numpy as np X = np.array([['Apple', 'red', 1, 'round', 0], ['orange', 'orange', 2, 'round', 0.1], ['bannana', 'yellow', 2, 'long', 0], ['Apple', 'green', 1, 'round', 0.2]]) ct = ColumnTransformer( [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),], # the column numbers I want to apply this to remainder='passthrough' # This leaves the rest of my columns in place ) print(ct2.fit_transform(X)) # Notice the output is a string

出力：

[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0'] ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1'] ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0'] ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]

Fallou Tall · Answer

LabelBinarizerクラスを使用して、1つのショットで両方の変換（テキストカテゴリから整数カテゴリへ、次に整数カテゴリからワンホットベクトルへ）を適用できます。

cat_features = ['color', 'director_name', 'actor_2_name'] encoder = LabelBinarizer() new_cat_features = encoder.fit_transform(cat_features) new_cat_features

これはデフォルトで密なNumPy配列を返すことに注意してください。代わりに、sparse_output = TrueをLabelBinarizerコンストラクターに渡すことで、スパース行列を取得できます。

ソース Scikit-LearnとTensorFlowによるハンズオン機械学習

HappyCoding · Answer

データセットがpandasデータフレームにある場合、

pandas.get_dummies

より簡単になります。

* pandas.get_getdummiesからpandas.get_dummiesに修正

Abhishek Thakur · Answer

ドキュメントから：

categorical_features : “all” or array of indices or mask Specify what features are treated as categorical. ‘all’ (default): All features are treated as categorical. array of indices: Array of categorical feature indices. mask: Array of length n_features and with dtype=bool.

pandasデータフレームの列名は機能しません。カテゴリフィーチャが列番号0、2、6の場合は、次を使用します。

from sklearn import preprocessing cat_features = [0, 2, 6] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)

これらのカテゴリフィーチャがラベルエンコードされていない場合、LabelEncoderを使用する前にこれらのフィーチャでOneHotEncoderを使用する必要があることにも注意する必要があります。

Bahman Engheta · Answer

@Medo、

私は同じ振る舞いに出くわし、イライラしました。他の人が指摘しているように、Scikit-Learnはcategorical_featuresパラメーターで提供される列の選択を検討する前にallデータが数値であることを要求します。

具体的には、列の選択は/sklearn/preprocessing/data.pyの_transform_selected()メソッドによって処理され、そのメソッドの最初の行は

X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)。

提供されたデータフレームXのデータのanyをfloatに正常に変換できない場合、このチェックは失敗します。

Sklearn.preprocessing.OneHotEncoderのドキュメントがその点で非常に誤解を招くことに同意します。

Little Bobby Tables · Answer

私のように、あなたがこれにイライラするなら、簡単な修正があります。 Category Encoders 'OneHotEncoder を使用するだけです。これはSklearn Contribパッケージであるため、scikit-learn APIで非常にうまく機能します。

これは直接の置き換えとして機能し、退屈なラベルエンコーディングを行います。

from category_encoders import OneHotEncoder cat_features = ['color', 'director_name', 'actor_2_name'] enc = OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values)