sklearn.OneHotEncoder変換を元に戻して元のデータを復元する方法は？

Question

sklearn.OneHotEncoderを使用してカテゴリデータをエンコードし、ランダムフォレスト分類器に供給しました。すべてが機能しているようで、予測した出力が返されました。

エンコーディングを逆にし、出力を元の状態に戻す方法はありますか？

Mack · Answer

これを理解するための優れた体系的な方法は、いくつかのテストデータから開始し、それを使って sklearn.OneHotEncoder ソースを処理することです。それがどのように機能するかについてあまり気にせず、簡単な答えを求めている場合は、最後までスキップしてください。

X = np.array([ [3, 10, 15, 33, 54, 55, 78, 79, 80, 99], [5, 1, 3, 7, 8, 12, 15, 19, 20, 8] ]).T

n_values_

行1763-1786 n_values_パラメータを決定します。 n_values='auto'（デフォルト）を設定すると、これは自動的に決定されます。または、すべての機能の最大値（int）または機能ごとの最大値（配列）を指定できます。デフォルトを使用しているとします。したがって、次の行が実行されます。

n_samples, n_features = X.shape # 10, 2 n_values = np.max(X, axis=0) + 1 # [100, 21] self.n_values_ = n_values

feature_indices_

次に、feature_indices_パラメータが計算されます。

n_values = np.hstack([[0], n_values]) # [0, 100, 21] indices = np.cumsum(n_values) # [0, 100, 121] self.feature_indices_ = indices

したがって、feature_indices_は、先頭に0を付加したn_values_の累積合計です。

疎行列の構築

次に、データから scipy.sparse.coo_matrix が作成されます。これは、スパースデータ（すべて1）、行インデックス、および列インデックスの3つの配列から初期化されます。

column_indices = (X + indices[:-1]).ravel() # array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108]) row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features) # array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32) data = np.ones(n_samples * n_features) # array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) out = sparse.coo_matrix((data, (row_indices, column_indices)), shape=(n_samples, indices[-1]), dtype=self.dtype).tocsr() # <10x121 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format>

coo_matrixはすぐに scipy.sparse.csr_matrix に変換されることに注意してください。 coo_matrixは、「スパースフォーマット間の高速変換を容易にする」ため、中間フォーマットとして使用されます。

active_features_

ここで、n_values='auto'の場合、スパースcsr行列は、アクティブな機能を持つ列のみに圧縮されます。スパースcsr_matrixは、sparse=Trueの場合に返されます。それ以外の場合は、戻る前に高密度化されます。

if self.n_values == 'auto': mask = np.array(out.sum(axis=0)).ravel() != 0 active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120]) out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' with 20 stored elements in Compressed Sparse Row format> self.active_features_ = active_features return out if self.sparse else out.toarray()

解読

逆に作業してみましょう。上記で説明したX機能とともに返されるスパース行列を考慮して、OneHotEncoderを回復する方法を知りたいです。新しいOneHotEncoderをインスタンス化してデータXでfit_transformを実行することにより、実際に上記のコードを実行したと仮定します。

from sklearn import preprocessing ohc = preprocessing.OneHotEncoder() # all default params out = ohc.fit_transform(X)

この問題を解決するための重要な洞察は、active_features_とout.indicesの関係を理解することです。 csr_matrixの場合、インデックス配列には各データポイントの列番号が含まれます。ただし、これらの列番号のソートは保証されていません。それらをソートするには、sorted_indicesメソッドを使用できます。

out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32) out = out.sorted_indices() out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)

ソートする前に、インデックスが実際に行に沿って反転していることがわかります。つまり、最後の列が最初に、最初の列が最後に並べられます。これは、最初の2つの要素[12、0]から明らかです。 3は最初のアクティブな列に割り当てられた最小要素であるため、0はXの最初の列の3に対応します。 12は、Xの2列目の5に対応します。最初の行は10個の異なる列を占めるため、2番目の列（1）の最小要素はインデックス10を取得します。次に小さい（3）はインデックス11を取得し、3番目に小さい（5）はインデックス12を取得します。ソート後、インデックスは期待通りに注文しました。

次に、active_features_を見てみましょう。

ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])

データ内の個別の要素の数に対応する19の要素があることに注意してください（1つの要素、8が1回繰り返されました）。これらが順番に配置されていることにも注意してください。 Xの最初の列にある機能は同じであり、2番目の列の機能は単にohc.feature_indices_[1]に対応する100と合計されています。

out.indicesを振り返ってみると、最大の列番号は18であることがわかります。これは、1から、エンコーディングの19のアクティブな機能を引いたものです。ここでの関係について少し考えると、ohc.active_features_のインデックスがohc.indicesの列番号に対応していることがわかります。これで、デコードできます：

import numpy as np decode_columns = np.vectorize(lambda col: ohc.active_features_[col]) decoded = decode_columns(out.indices).reshape(X.shape)

これは私たちに与えます：

array([[ 3, 105], [ 10, 101], [ 15, 103], [ 33, 107], [ 54, 108], [ 55, 112], [ 78, 115], [ 79, 119], [ 80, 120], [ 99, 108]])

そして、ohc.feature_indices_からオフセットを差し引くことにより、元の機能値に戻すことができます。

recovered_X = decoded - ohc.feature_indices_[:-1] array([[ 3, 5], [10, 1], [15, 3], [33, 7], [54, 8], [55, 12], [78, 15], [79, 19], [80, 20], [99, 8]])

Xの元の形状（単に(n_samples, n_features)）が必要であることに注意してください。

TL; DR

ohcと呼ばれるsklearn.OneHotEncoderインスタンスが与えられた場合、scipy.sparse.csr_matrixまたはohc.fit_transformからのエンコードされたデータ（ohc.transform）出力がoutと呼ばれ、形状元のデータ(n_samples, n_feature)の元のデータXを次のように復元します：

recovered_X = np.array([ohc.active_features_[col] for col in out.sorted_indices().indices]) .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]

Bohumir Zamecnik · Answer

エンコードされた値のドット積をohe.active_features_で計算するだけです。これは、疎と密の両方の表現で機能します。例：

from sklearn.preprocessing import OneHotEncoder import numpy as np orig = np.array([6, 9, 8, 2, 5, 4, 5, 3, 3, 6]) ohe = OneHotEncoder() encoded = ohe.fit_transform(orig.reshape(-1, 1)) # input needs to be column-wise decoded = encoded.dot(ohe.active_features_).astype(int) assert np.allclose(orig, decoded)

重要な洞察は、OHEモデルのactive_features_属性が各バイナリ列の元の値を表すということです。したがって、active_features_でドット積を計算するだけで、バイナリエンコードされた数値をデコードできます。各データポイントには、元の値の位置が1つだけ1あります。

Martin Thoma · Answer

ワンホットエンコードの方法

参照 https://stackoverflow.com/a/42874726/562769

import numpy as np nb_classes = 6 data = [[2, 3, 4, 0]] def indices_to_one_hot(data, nb_classes): """Convert an iterable of indices to one-hot encoded labels.""" targets = np.array(data).reshape(-1) return np.eye(nb_classes)[targets]

反転する方法

def one_hot_to_indices(data): indices = [] for el in data: indices.append(list(el).index(1)) return indices hot = indices_to_one_hot(orig_data, nb_classes) indices = one_hot_to_indices(hot) print(orig_data) print(indices)

与える：

[[2, 3, 4, 0]] [2, 3, 4, 0]

Shawn · Answer

[1,2,4,5,6]のように特徴が密で、いくつかの数が欠落している場合。次に、それらを対応する位置にマッピングできます。

>>> import numpy as np >>> from scipy import sparse >>> def _sparse_binary(y): ... # one-hot codes of y with scipy.sparse matrix. ... row = np.arange(len(y)) ... col = y - y.min() ... data = np.ones(len(y)) ... return sparse.csr_matrix((data, (row, col))) ... >>> y = np.random.randint(-2,2, 8).reshape([4,2]) >>> y array([[ 0, -2], [-2, 1], [ 1, 0], [ 0, -2]]) >>> yc = [_sparse_binary(y[:,i]) for i in xrange(2)] >>> for i in yc: print i.todense() ... [[ 0. 0. 1. 0.] [ 1. 0. 0. 0.] [ 0. 0. 0. 1.] [ 0. 0. 1. 0.]] [[ 1. 0. 0. 0.] [ 0. 0. 0. 1.] [ 0. 0. 1. 0.] [ 1. 0. 0. 0.]] >>> [i.shape for i in yc] [(4, 4), (4, 4)]

これは妥協したシンプルな方法ですが、機能し、argmax（）で簡単に元に戻すことができます。例：

>>> np.argmax(yc[0].todense(), 1) + y.min(0)[0] matrix([[ 0], [-2], [ 1], [ 0]])

blueberryfields · Answer

短い答えは「いいえ」です。エンコーダーはカテゴリデータを受け取り、それを適切な数のセットに自動的に変換します。

より長い答えは「自動的ではない」です。ただし、n_valuesパラメーターを使用して明示的なマッピングを提供する場合は、おそらく反対側で独自のデコードを実装できます。それがどのように行われるかについてのいくつかのヒントについては documentation を参照してください。

とはいえ、これはかなり奇妙な質問です。代わりに、 DictVectorizer を使用することもできます

melqkiades · Answer

Scikit-learnのバージョン0.20以降、OneHotEncoderクラスのactive_features_属性は非推奨となったため、代わりにcategories_属性に依存することをお勧めします。

以下の関数は、ワンホットエンコードされたマトリックスから元のデータを復元するのに役立ちます。

def reverse_one_hot(X, y, encoder): reversed_data = [{} for _ in range(len(y))] all_categories = list(itertools.chain(*encoder.categories_)) category_names = ['category_{}'.format(i+1) for i in range(len(encoder.categories_))] category_lengths = [len(encoder.categories_[i]) for i in range(len(encoder.categories_))] for row_index, feature_index in Zip(*X.nonzero()): category_value = all_categories[feature_index] category_name = get_category_name(feature_index, category_names, category_lengths) reversed_data[row_index][category_name] = category_value reversed_data[row_index]['target'] = y[row_index] return reversed_data def get_category_name(index, names, lengths): counter = 0 for i in range(len(lengths)): counter += lengths[i] if index < counter: return names[i] raise ValueError('The index is higher than the number of categorical values')

それをテストするために、ユーザーがユーザーに与えた評価を含む小さなデータセットを作成しました

data = [ {'user_id': 'John', 'item_id': 'The Matrix', 'rating': 5}, {'user_id': 'John', 'item_id': 'Titanic', 'rating': 1}, {'user_id': 'John', 'item_id': 'Forrest Gump', 'rating': 2}, {'user_id': 'John', 'item_id': 'Wall-E', 'rating': 2}, {'user_id': 'Lucy', 'item_id': 'The Matrix', 'rating': 5}, {'user_id': 'Lucy', 'item_id': 'Titanic', 'rating': 1}, {'user_id': 'Lucy', 'item_id': 'Die Hard', 'rating': 5}, {'user_id': 'Lucy', 'item_id': 'Forrest Gump', 'rating': 2}, {'user_id': 'Lucy', 'item_id': 'Wall-E', 'rating': 2}, {'user_id': 'Eric', 'item_id': 'The Matrix', 'rating': 2}, {'user_id': 'Eric', 'item_id': 'Die Hard', 'rating': 3}, {'user_id': 'Eric', 'item_id': 'Forrest Gump', 'rating': 5}, {'user_id': 'Eric', 'item_id': 'Wall-E', 'rating': 4}, {'user_id': 'Diane', 'item_id': 'The Matrix', 'rating': 4}, {'user_id': 'Diane', 'item_id': 'Titanic', 'rating': 3}, {'user_id': 'Diane', 'item_id': 'Die Hard', 'rating': 5}, {'user_id': 'Diane', 'item_id': 'Forrest Gump', 'rating': 3}, ] data_frame = pandas.DataFrame(data) data_frame = data_frame[['user_id', 'item_id', 'rating']] ratings = data_frame['rating'] data_frame.drop(columns=['rating'], inplace=True)

予測モデルを構築している場合、エンコードする前にDataFrameから従属変数（この場合は評価）を削除することを忘れないでください。

ratings = data_frame['rating'] data_frame.drop(columns=['rating'], inplace=True)

次に、エンコーディングを行います

ohc = OneHotEncoder() encoded_data = ohc.fit_transform(data_frame) print(encoded_data)

その結果：

 (0, 2) 1.0 (0, 6) 1.0 (1, 2) 1.0 (1, 7) 1.0 (2, 2) 1.0 (2, 5) 1.0 (3, 2) 1.0 (3, 8) 1.0 (4, 3) 1.0 (4, 6) 1.0 (5, 3) 1.0 (5, 7) 1.0 (6, 3) 1.0 (6, 4) 1.0 (7, 3) 1.0 (7, 5) 1.0 (8, 3) 1.0 (8, 8) 1.0 (9, 1) 1.0 (9, 6) 1.0 (10, 1) 1.0 (10, 4) 1.0 (11, 1) 1.0 (11, 5) 1.0 (12, 1) 1.0 (12, 8) 1.0 (13, 0) 1.0 (13, 6) 1.0 (14, 0) 1.0 (14, 7) 1.0 (15, 0) 1.0 (15, 4) 1.0 (16, 0) 1.0 (16, 5) 1.0

エンコードした後、次のように、上記で定義したreverse_one_hot関数を使用してリバースできます。

reverse_data = matrix_utils.reverse_one_hot(encoded_data, ratings, ohc) print(pandas.DataFrame(reverse_data))

それは私たちに与えます：

 category_1 category_2 target 0 John The Matrix 5 1 John Titanic 1 2 John Forrest Gump 2 3 John Wall-E 2 4 Lucy The Matrix 5 5 Lucy Titanic 1 6 Lucy Die Hard 5 7 Lucy Forrest Gump 2 8 Lucy Wall-E 2 9 Eric The Matrix 2 10 Eric Die Hard 3 11 Eric Forrest Gump 5 12 Eric Wall-E 4 13 Diane The Matrix 4 14 Diane Titanic 3 15 Diane Die Hard 5 16 Diane Forrest Gump 3

S_Ymln · Answer

パンダのアプローチ：カテゴリー変数をバイナリ変数に変換するには、_pd.get_dummies_がそれを行い、それらを元に戻すには、pd.Series.idxmax()を使用して、1がある場所の値のインデックスを見つけることができます。次に、リスト（元のデータに従ってインデックスを付ける）または辞書にマップできます。

_import pandas as pd import numpy as np col = np.random.randint(1,5,20) df = pd.DataFrame({'A': col}) df.head() A 0 2 1 2 2 1 3 1 4 3 df_dum = pd.get_dummies(df['A']) df_dum.head() 1 2 3 4 0 0 1 0 0 1 0 1 0 0 2 1 0 0 0 3 1 0 0 0 4 0 0 1 0 df_n = df_dum.apply(lambda x: x.idxmax(), axis = 1) df_n.head() 0 2 1 2 2 1 3 1 4 3 _

Nico · Answer

numpy.argmax() をaxis = 1とともに使用します。

例：

ohe_encoded = np.array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [1, 0, 0]]) ohe_encoded > array([[0, 0, 1], [0, 1, 0], [0, 1, 0], [1, 0, 0]]) np.argmax(ohe_encoded, axis = 1) > array([2, 1, 1, 0], dtype=int64)