Pythonの分散インフレ率

Question

私はPythonの簡単なデータセットの各列の分散インフレ率（VIF）を計算しようとしています：

a b c d 1 2 4 4 1 2 6 3 2 3 7 4 3 2 8 5 4 1 9 4

sdm library のvif関数を使用してRでこれを既に実行しました。これにより、次の結果が得られます。

a <- c(1, 1, 2, 3, 4) b <- c(2, 2, 3, 2, 1) c <- c(4, 6, 7, 8, 9) d <- c(4, 3, 4, 5, 4) df <- data.frame(a, b, c, d) vif_df <- vif(df) print(vif_df) Variables VIF a 22.95 b 3.00 c 12.95 d 3.00

ただし、python statsmodel vif function を使用して同じことを行うと、結果は次のようになります。

a = [1, 1, 2, 3, 4] b = [2, 2, 3, 2, 1] c = [4, 6, 7, 8, 9] d = [4, 3, 4, 5, 4] ck = np.column_stack([a, b, c, d]) vif = [variance_inflation_factor(ck, i) for i in range(ck.shape[1])] print(vif) Variables VIF a 47.136986301369774 b 28.931506849315081 c 80.31506849315096 d 40.438356164383549

入力が同じであっても、結果は大きく異なります。一般的に、statsmodel VIF関数の結果は間違っているように見えますが、これが私がそれを呼び出している方法によるものなのか、それとも関数自体の問題なのかはわかりません。

Statsmodel関数を誤って呼び出しているか、結果の矛盾を説明しているかどうかを誰かが理解できるように願っていました。関数の問題である場合、PythonにVIFの代替手段がありますか？

Drverzal · Accepted Answer

この理由は、PythonのOLSの違いによるものだと思います。 python分散インフレ率の計算で使用されるOLSは、デフォルトではインターセプトを追加しません。ただし、インターセプトが必要です。

あなたがしたいことは、定数を表すために1で満たされたマトリックスにもう1列、ckを追加することです。これが方程式の切片項になります。これが完了すると、値が適切に一致するはずです。

編集：ゼロを1に置き換えました

Alexander · Answer

他の人や、関数の作成者であるJosef Perktoldによる this post で述べられているように、variance_inflation_factorは、説明変数の行列に定数が存在することを期待しています。 statsmodelsのadd_constantを使用して、値を関数に渡す前に必要な定数をデータフレームに追加できます。

from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant df = pd.DataFrame( {'a': [1, 1, 2, 3, 4], 'b': [2, 2, 3, 2, 1], 'c': [4, 6, 7, 8, 9], 'd': [4, 3, 4, 5, 4]} ) X = add_constant(df) >>> pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns) const 136.875 a 22.950 b 3.000 c 12.950 d 3.000 dtype: float64

assignを使用して、データフレームの右端の列に定数を追加することもできます。

X = df.assign(const=1) >>> pd.Series([variance_inflation_factor(X.values, i) for i in range(X.shape[1])], index=X.columns) a 22.950 b 3.000 c 12.950 d 3.000 const 136.875 dtype: float64

ソースコード自体はかなり簡潔です。

def variance_inflation_factor(exog, exog_idx): """ exog : ndarray, (nobs, k_vars) design matrix with all explanatory variables, as for example used in regression exog_idx : int index of the exogenous variable in the columns of exog """ k_vars = exog.shape[1] x_i = exog[:, exog_idx] mask = np.arange(k_vars) != exog_idx x_noti = exog[:, mask] r_squared_i = OLS(x_i, x_noti).fit().rsquared vif = 1. / (1. - r_squared_i) return vif

また、すべてのVIFをシリーズとして返すようにコードを変更するのもかなり簡単です。

from statsmodels.regression.linear_model import OLS from statsmodels.tools.tools import add_constant def variance_inflation_factors(exog_df): ''' Parameters ---------- exog_df : dataframe, (nobs, k_vars) design matrix with all explanatory variables, as for example used in regression. Returns ------- vif : Series variance inflation factors ''' exog_df = add_constant(exog_df) vifs = pd.Series( [1 / (1. - OLS(exog_df[col].values, exog_df.loc[:, exog_df.columns != col].values).fit().rsquared) for col in exog_df], index=exog_df.columns, name='VIF' ) return vifs >>> variance_inflation_factors(df) const 136.875 a 22.950 b 3.000 c 12.950 Name: VIF, dtype: float64

T_T · Answer

このスレッドへの将来の来訪者（私のような）：

import numpy as np import scipy as sp a = [1, 1, 2, 3, 4] b = [2, 2, 3, 2, 1] c = [4, 6, 7, 8, 9] d = [4, 3, 4, 5, 4] ck = np.column_stack([a, b, c, d]) cc = sp.corrcoef(ck, rowvar=False) VIF = np.linalg.inv(cc) VIF.diagonal()

このコードは

array([22.95, 3. , 12.95, 3. ])

[編集]

コメントに応えて、DataFrameを可能な限り使用しようとしました（マトリックスを反転するにはnumpyが必要です）。

import pandas as pd import numpy as np a = [1, 1, 2, 3, 4] b = [2, 2, 3, 2, 1] c = [4, 6, 7, 8, 9] d = [4, 3, 4, 5, 4] df = pd.DataFrame({'a':a,'b':b,'c':c,'d':d}) df_cor = df.corr() pd.DataFrame(np.linalg.inv(df.corr().values), index = df_cor.index, columns=df_cor.columns)

コードは与える

 a b c d a 22.950000 6.453681 -16.301917 -6.453681 b 6.453681 3.000000 -4.080441 -2.000000 c -16.301917 -4.080441 12.950000 4.080441 d -6.453681 -2.000000 4.080441 3.000000

対角要素はVIFを提供します。

steven · Answer

variance_inflation_factorとadd_constantを扱いたくない場合。次の2つの機能を考慮してください。

1。statasmodelsで式を使用：

import pandas as pd import statsmodels.formula.api as smf def get_vif(exogs, data): '''Return VIF (variance inflation factor) DataFrame Args: exogs (list): list of exogenous/independent variables data (DataFrame): the df storing all variables Returns: VIF and Tolerance DataFrame for each exogenous variable Notes: Assume we have a list of exogenous variable [X1, X2, X3, X4]. To calculate the VIF and Tolerance for each variable, we regress each of them against other exogenous variables. For instance, the regression model for X3 is defined as: X3 ~ X1 + X2 + X4 And then we extract the R-squared from the model to calculate: VIF = 1 / (1 - R-squared) Tolerance = 1 - R-squared The cutoff to detect multicollinearity: VIF > 10 or Tolerance < 0.1 ''' # initialize dictionaries vif_dict, tolerance_dict = {}, {} # create formula for each exogenous variable for exog in exogs: not_exog = [i for i in exogs if i != exog] formula = f"{exog} ~ {' + '.join(not_exog)}" # extract r-squared from the fit r_squared = smf.ols(formula, data=data).fit().rsquared # calculate VIF vif = 1/(1 - r_squared) vif_dict[exog] = vif # calculate tolerance tolerance = 1 - r_squared tolerance_dict[exog] = tolerance # return VIF DataFrame df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict}) return df_vif

2。sklearnでLinearRegressionを使用：

# import warnings # warnings.simplefilter(action='ignore', category=FutureWarning) import pandas as pd from sklearn.linear_model import LinearRegression def sklearn_vif(exogs, data): # initialize dictionaries vif_dict, tolerance_dict = {}, {} # form input data for each exogenous variable for exog in exogs: not_exog = [i for i in exogs if i != exog] X, y = data[not_exog], data[exog] # extract r-squared from the fit r_squared = LinearRegression().fit(X, y).score(X, y) # calculate VIF vif = 1/(1 - r_squared) vif_dict[exog] = vif # calculate tolerance tolerance = 1 - r_squared tolerance_dict[exog] = tolerance # return VIF DataFrame df_vif = pd.DataFrame({'VIF': vif_dict, 'Tolerance': tolerance_dict}) return df_vif

例：

import seaborn as sns df = sns.load_dataset('car_crashes') exogs = ['alcohol', 'speeding', 'no_previous', 'not_distracted'] [In] %%timeit -n 100 get_vif(exogs=exogs, data=df) [Out] VIF Tolerance alcohol 3.436072 0.291030 no_previous 3.113984 0.321132 not_distracted 2.668456 0.374749 speeding 1.884340 0.530690 69.6 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) [In] %%timeit -n 100 sklearn_vif(exogs=exogs, data=df) [Out] VIF Tolerance alcohol 3.436072 0.291030 no_previous 3.113984 0.321132 not_distracted 2.668456 0.374749 speeding 1.884340 0.530690 15.7 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Saqib Mujtaba · Answer

ボストンデータの例：

[〜＃〜] vif [〜＃〜]は補助回帰によって計算されるため、実際の適合度に依存しません。

下記参照：

from patsy import dmatrices from statsmodels.stats.outliers_influence import variance_inflation_factor import statsmodels.api as sm # Break into left and right hand side; y and X y, X = dmatrices(formula="medv ~ crim + zn + nox + ptratio + black + rm ", data=boston, return_type="dataframe") # For each Xi, calculate VIF vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] # Fit X to y result = sm.OLS(y, X).fit()

Chef1075 · Answer

この関数は、StackとCrossValidatedで見た他の投稿に基づいて作成しました。しきい値を超えている機能が表示され、機能が削除された新しいデータフレームが返されます。

from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant def calculate_vif_(df, thresh=5): ''' Calculates VIF each feature in a pandas dataframe A constant must be added to variance_inflation_factor or the results will be incorrect :param X: the pandas dataframe :param thresh: the max VIF value before the feature is removed from the dataframe :return: dataframe with features removed ''' const = add_constant(df) cols = const.columns variables = np.arange(const.shape[1]) vif_df = pd.Series([variance_inflation_factor(const.values, i) for i in range(const.shape[1])], index=const.columns).to_frame() vif_df = vif_df.sort_values(by=0, ascending=False).rename(columns={0: 'VIF'}) vif_df = vif_df.drop('const') vif_df = vif_df[vif_df['VIF'] > thresh] print 'Features above VIF threshold:
' print vif_df[vif_df['VIF'] > thresh] col_to_drop = list(vif_df.index) for i in col_to_drop: print 'Dropping: {}'.format(i) df = df.drop(columns=i) return df