すべての列間の相関を計算し、pythonまたはpandasを使用して相関の高い列を削除する方法

Question

私は巨大なデータセットを持っていますが、機械学習モデリングの前に、まず、相関の高い記述子（列）を削除する必要があります。列と列の相関を計算し、しきい値を持つ列を削除するには、すべての列または記述子を削除する必要があります相関が0.8を超える。また、データを削減するためにヘッダーを保持する必要があります。

データセットの例

 GA PN PC MBP GR AP 0.033 6.652 6.681 0.194 0.874 3.177 0.034 9.039 6.224 0.194 1.137 3.4 0.035 10.936 10.304 1.015 0.911 4.9 0.022 10.11 9.603 1.374 0.848 4.566 0.035 2.963 17.156 0.599 0.823 9.406 0.033 10.872 10.244 1.015 0.574 4.871 0.035 21.694 22.389 1.015 0.859 9.259 0.035 10.936 10.304 1.015 0.911 4.5

助けてください....

NISHA DAGA · Answer

ここに私が使用したアプローチがあります-

def correlation(dataset, threshold): col_corr = set() # Set of all the names of deleted columns corr_matrix = dataset.corr() for i in range(len(corr_matrix.columns)): for j in range(i): if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr): colname = corr_matrix.columns[i] # getting the name of column col_corr.add(colname) if colname in dataset.columns: del dataset[colname] # deleting the column from the dataset print(dataset)

お役に立てれば！

Cherry Wu · Answer

ここでの方法は私にとってはうまくいきました。ほんの数行のコード： https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np # Create correlation matrix corr_matrix = df.corr().abs() # Select upper triangle of correlation matrix upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)) # Find features with correlation greater than 0.95 to_drop = [column for column in upper.columns if any(upper[column] > 0.95)] # Drop features df.drop(df.columns[to_drop], axis=1)

Mojgan Mazouchi · Answer

特定のデータフレームdfに対して以下を使用できます。

corr_matrix = df.corr().abs() high_corr_var=np.where(corr_matrix>0.8) high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in Zip(*high_corr_var) if x!=y and x<y]

user3025698 · Answer

私はTomDobbsの答えを修正するために自由を取りました。コメントで報告されたバグは現在削除されています。また、新しい関数は負の相関も除外します。

def corr_df(x, corr_val): ''' Obj: Drops features that are strongly correlated to other features. This lowers model complexity, and aids in generalizing the model. Inputs: df: features df (x) corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8) Output: df that only includes uncorrelated features ''' # Creates Correlation Matrix and Instantiates corr_matrix = x.corr() iters = range(len(corr_matrix.columns) - 1) drop_cols = [] # Iterates through Correlation Matrix Table to find correlated columns for i in iters: for j in range(i): item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)] col = item.columns row = item.index val = item.values if abs(val) >= corr_val: # Prints the correlated feature set and the corr val print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2)) drop_cols.append(i) drops = sorted(set(drop_cols))[::-1] # Drops the correlated columns for i in drops: col = x.iloc[:, (i+1):(i+2)].columns.values x = x.drop(col, axis=1) return x

Jamie Bull · Answer

まず、PCAのようなものを次元削減メソッドとして使用することをお勧めしますが、独自にロールする必要がある場合は、質問の制約が不十分です。 2つの列が関連付けられている場合、どちらを削除しますか？列Aが列Bと相関し、列Bが列Cと相関しているが列Aと相関していない場合はどうなりますか？

アルゴリズムの開発に役立つDataFrame.corr()（ docs ）を呼び出すことにより、相関のペアワイズ行列を取得できますが、最終的には保持する列のリストに変換する必要があります。

TomDobbs · Answer

機能データフレームをこの関数に接続し、相関しきい値を設定するだけです。列を自動的にドロップしますが、手動で行う場合は、ドロップした列の診断も提供します。

def corr_df(x, corr_val): ''' Obj: Drops features that are strongly correlated to other features. This lowers model complexity, and aids in generalizing the model. Inputs: df: features df (x) corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8) Output: df that only includes uncorrelated features ''' # Creates Correlation Matrix and Instantiates corr_matrix = x.corr() iters = range(len(corr_matrix.columns) - 1) drop_cols = [] # Iterates through Correlation Matrix Table to find correlated columns for i in iters: for j in range(i): item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)] col = item.columns row = item.index val = item.values if val >= corr_val: # Prints the correlated feature set and the corr val print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2)) drop_cols.append(i) drops = sorted(set(drop_cols))[::-1] # Drops the correlated columns for i in drops: col = x.iloc[:, (i+1):(i+2)].columns.values df = x.drop(col, axis=1) return df

Ryan · Answer

User3025698によって投稿されたソリューションの小さな改訂により、最初の2つの列間の相関がキャプチャされず、一部のデータ型チェックの問題が解決されます。

def filter_df_corr(inp_data, corr_val): ''' Returns an array or dataframe (based on type(inp_data) adjusted to drop \ columns with high correlation to one another. Takes second arg corr_val that defines the cutoff ---------- inp_data : np.array, pd.DataFrame Values to consider corr_val : float Value [0, 1] on which to base the correlation cutoff ''' # Creates Correlation Matrix if isinstance(inp_data, np.ndarray): inp_data = pd.DataFrame(data=inp_data) array_flag = True else: array_flag = False corr_matrix = inp_data.corr() # Iterates through Correlation Matrix Table to find correlated columns drop_cols = [] n_cols = len(corr_matrix.columns) for i in range(n_cols): for k in range(i+1, n_cols): val = corr_matrix.iloc[k, i] col = corr_matrix.columns[i] row = corr_matrix.index[k] if abs(val) >= corr_val: # Prints the correlated feature set and the corr val print(col, "|", row, "|", round(val, 2)) drop_cols.append(col) # Drops the correlated columns drop_cols = set(drop_cols) inp_data = inp_data.drop(columns=drop_cols) # Return same type as inp if array_flag: return inp_data.values else: return inp_data