多重線形回帰式を計算し、調整されたR-2乗を確認したい。スコア関数を使用すると、rの2乗を見ることができますが、調整されていません。
import pandas as pd #import the pandas module
import numpy as np
df = pd.read_csv ('/Users/jeangelj/Documents/training/linexdata.csv', sep=',')
df
AverageNumberofTickets NumberofEmployees ValueofContract Industry
0 1 51 25750 Retail
1 9 68 25000 Services
2 20 67 40000 Services
3 1 124 35000 Retail
4 8 124 25000 Manufacturing
5 30 134 50000 Services
6 20 157 48000 Retail
7 8 190 32000 Retail
8 20 205 70000 Retail
9 50 230 75000 Manufacturing
10 35 265 50000 Manufacturing
11 65 296 75000 Services
12 35 336 50000 Manufacturing
13 60 359 75000 Manufacturing
14 85 403 81000 Services
15 40 418 60000 Retail
16 75 437 53000 Services
17 85 451 90000 Services
18 65 465 70000 Retail
19 95 491 100000 Services
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
model.score(X, y)
>>0.87764337132340009
手動で確認したところ、0.87764はRの2乗です。一方、0.863248は調整済みのR 2乗です。
R^2
およびadjusted R^2
を計算するにはさまざまな方法がありますが、そのうちのいくつかを以下に示します(提供されたデータで計算)。
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832
# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832
# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832