scikit-learnのLinearRegressionでp値（有意性）を求める

Question

各係数のp値（有意性）を見つける方法

lm = sklearn.linear_model.LinearRegression() lm.fit(x,y)

JARH · Answer

これはやり過ぎですが、試してみましょう。まず、statsmodelを使ってp値がどうなるべきかを調べましょう。

import pandas as pd import numpy as np from sklearn import datasets, linear_model from sklearn.linear_model import LinearRegression import statsmodels.api as sm from scipy import stats diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target X2 = sm.add_constant(X) est = sm.OLS(y, X2) est2 = est.fit() print(est2.summary())

そして私達は得る

 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.518 Model: OLS Adj. R-squared: 0.507 Method: Least Squares F-statistic: 46.27 Date: Wed, 08 Mar 2017 Prob (F-statistic): 3.83e-62 Time: 10:08:24 Log-Likelihood: -2386.0 No. Observations: 442 AIC: 4794. Df Residuals: 431 BIC: 4839. Df Model: 10 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 152.1335 2.576 59.061 0.000 147.071 157.196 x1 -10.0122 59.749 -0.168 0.867 -127.448 107.424 x2 -239.8191 61.222 -3.917 0.000 -360.151 -119.488 x3 519.8398 66.534 7.813 0.000 389.069 650.610 x4 324.3904 65.422 4.958 0.000 195.805 452.976 x5 -792.1842 416.684 -1.901 0.058 -1611.169 26.801 x6 476.7458 339.035 1.406 0.160 -189.621 1143.113 x7 101.0446 212.533 0.475 0.635 -316.685 518.774 x8 177.0642 161.476 1.097 0.273 -140.313 494.442 x9 751.2793 171.902 4.370 0.000 413.409 1089.150 x10 67.6254 65.984 1.025 0.306 -62.065 197.316 ============================================================================== Omnibus: 1.506 Durbin-Watson: 2.029 Prob(Omnibus): 0.471 Jarque-Bera (JB): 1.404 Skew: 0.017 Prob(JB): 0.496 Kurtosis: 2.726 Cond. No. 227. ==============================================================================

さて、これを再現しましょう。 Matrix Algebraを使った線形回帰分析をほとんど再現しているので、それはやり過ぎです。しかし、一体何でしょう。

lm = LinearRegression() lm.fit(X,y) params = np.append(lm.intercept_,lm.coef_) predictions = lm.predict(X) newX = pd.DataFrame({"Constant":np.ones(len(X))}).join(pd.DataFrame(X)) MSE = (sum((y-predictions)**2))/(len(newX)-len(newX.columns)) # Note if you don't want to use a DataFrame replace the two lines above with # newX = np.append(np.ones((len(X),1)), X, axis=1) # MSE = (sum((y-predictions)**2))/(len(newX)-len(newX[0])) var_b = MSE*(np.linalg.inv(np.dot(newX.T,newX)).diagonal()) sd_b = np.sqrt(var_b) ts_b = params/ sd_b p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b] sd_b = np.round(sd_b,3) ts_b = np.round(ts_b,3) p_values = np.round(p_values,3) params = np.round(params,4) myDF3 = pd.DataFrame() myDF3["Coefficients"],myDF3["Standard Errors"],myDF3["t values"],myDF3["Probabilites"] = [params,sd_b,ts_b,p_values] print(myDF3)

そしてこれは私たちに与えます。

 Coefficients Standard Errors t values Probabilites 0 152.1335 2.576 59.061 0.000 1 -10.0122 59.749 -0.168 0.867 2 -239.8191 61.222 -3.917 0.000 3 519.8398 66.534 7.813 0.000 4 324.3904 65.422 4.958 0.000 5 -792.1842 416.684 -1.901 0.058 6 476.7458 339.035 1.406 0.160 7 101.0446 212.533 0.475 0.635 8 177.0642 161.476 1.097 0.273 9 751.2793 171.902 4.370 0.000 10 67.6254 65.984 1.025 0.306

だから私たちはstatsmodelから値を再現することができます。

elyase · Answer

scikit-learnのLinearRegressionはこの情報を計算しませんが、クラスを簡単に拡張してそれを行うことができます。

from sklearn import linear_model from scipy import stats import numpy as np class LinearRegression(linear_model.LinearRegression): """ LinearRegression class after sklearn's, but calculate t-statistics and p-values for model coefficients (betas). Additional attributes available after .fit() are `t` and `p` which are of the shape (y.shape[1], X.shape[1]) which is (n_features, n_coefs) This class sets the intercept to 0 by default, since usually we include it in X. """ def __init__(self, *args, **kwargs): if not "fit_intercept" in kwargs: kwargs['fit_intercept'] = False super(LinearRegression, self)\ .__init__(*args, **kwargs) def fit(self, X, y, n_jobs=1): self = super(LinearRegression, self).fit(X, y, n_jobs) sse = np.sum((self.predict(X) - y) ** 2, axis=0) / float(X.shape[0] - X.shape[1]) se = np.array([ np.sqrt(np.diagonal(sse[i] * np.linalg.inv(np.dot(X.T, X)))) for i in range(sse.shape[0]) ]) self.t = self.coef_ / se self.p = 2 * (1 - stats.t.cdf(np.abs(self.t), y.shape[0] - X.shape[1])) return self

ここから盗まれた。

Pythonでのこの種の統計分析については、 statsmodels をご覧ください。

Pinna_be · Answer

あなたはsklearn.feature_selection.f_regressionを使うことができます。

scikit-learnページはここをクリック

Alex · Answer

Elyaseの答えのコード https://stackoverflow.com/a/27928411/424041 実際には動作しません。 sseがスカラーであることに注意してください。それからそれを反復しようとします。次のコードは修正版です。驚くほどきれいではありませんが、多かれ少なかれうまくいくと思います。

class LinearRegression(linear_model.LinearRegression): def __init__(self,*args,**kwargs): # *args is the list of arguments that might go into the LinearRegression object # that we don't know about and don't want to have to deal with. Similarly, **kwargs # is a dictionary of key words and values that might also need to go into the orginal # LinearRegression object. We put *args and **kwargs so that we don't have to look # these up and write them down explicitly here. Nice and easy. if not "fit_intercept" in kwargs: kwargs['fit_intercept'] = False super(LinearRegression,self).__init__(*args,**kwargs) # Adding in t-statistics for the coefficients. def fit(self,x,y): # This takes in numpy arrays (not matrices). Also assumes you are leaving out the column # of constants. # Not totally sure what 'super' does here and why you redefine self... self = super(LinearRegression, self).fit(x,y) n, k = x.shape yHat = np.matrix(self.predict(x)).T # Change X and Y into numpy matricies. x also has a column of ones added to it. x = np.hstack((np.ones((n,1)),np.matrix(x))) y = np.matrix(y).T # Degrees of freedom. df = float(n-k-1) # Sample variance. sse = np.sum(np.square(yHat - y),axis=0) self.sampleVariance = sse/df # Sample variance for x. self.sampleVarianceX = x.T*x # Covariance Matrix = [(s^2)(X'X)^-1]^0.5. (sqrtm = matrix square root. ugly) self.covarianceMatrix = sc.linalg.sqrtm(self.sampleVariance[0,0]*self.sampleVarianceX.I) # Standard erros for the difference coefficients: the diagonal elements of the covariance matrix. self.se = self.covarianceMatrix.diagonal()[1:] # T statistic for each beta. self.betasTStat = np.zeros(len(self.se)) for i in xrange(len(self.se)): self.betasTStat[i] = self.coef_[0,i]/self.se[i] # P-value for each beta. This is a two sided t-test, since the betas can be # positive or negative. self.betasPValue = 1 - t.cdf(abs(self.betasTStat),df)

Jules K · Answer

多変数回帰の場合、 @ JARH の答えに誤りがある可能性があります。（私はコメントするのに十分な評判がありません。）

次の行で

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-1))) for i in ts_b]、

t値は、len(newX)-1のカイ2乗分布ではなく、len(newX)-len(newX.columns)-1のカイ2乗分布に従います。

だからこれはする必要があります：

p_values =[2*(1-stats.t.cdf(np.abs(i),(len(newX)-len(newX.columns)-1))) for i in ts_b]

（詳しくは OLS回帰のt値を参照）

Afshin Amiri · Answer

p_valueはf統計情報の1つです。値を取得したい場合は、この数行のコードを使用してください。

import statsmodels.api as sm from scipy import stats diabetes = datasets.load_diabetes() X = diabetes.data y = diabetes.target X2 = sm.add_constant(X) est = sm.OLS(y, X2) print(est.fit().f_pvalue)

benaou mouad · Answer

p値を引き出す簡単な方法は、統計モデル回帰のコマンドを使用することです。

import statsmodels.api as sm mod = sm.OLS(Y,X) fii = mod.fit() p_values = fii.summary2().tables[1]['P>|t|']

操作可能な一連のp値が得られます（たとえば、各p値を評価することによって保持する順序をxhooseにします）。

Ali Mirzaei · Answer

P値にはscipyを使用できます。このコードはscipyドキュメントからのものです。

>>> from scipy import stats >>> import numpy as np >>> x = np.random.random(10) >>> y = np.random.random(10) >>> slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)