Pythonでのピアソン相関と有意性の計算

Question

入力として2つのリストを受け取り、 Pearson correlation と相関の意味を返す関数を探しています。

Sacha · Answer

scipy.stats を見てください。

from pydoc import help from scipy.stats.stats import pearsonr help(pearsonr) >>> Help on function pearsonr in module scipy.stats.stats: pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-correlation. The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson's correlation requires that each dataset be normally distributed. Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation. Correlations of -1 or +1 imply an exact linear relationship. Positive correlations imply that as x increases, so does y. Negative correlations imply that as x increases, y decreases. The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. The p-values are not entirely reliable but are probably reasonable for datasets larger than 500 or so. Parameters ---------- x : 1D array y : 1D array the same length as x Returns ------- (Pearson's correlation coefficient, 2-tailed p-value) References ---------- http://www.statsoft.com/textbook/glosp.html#Pearson%20Correlation

winerd · Answer

ピアソンの相関は、numpyの corrcoef で計算できます。

import numpy numpy.corrcoef(list1, list2)[0, 1]

Salvador Dali · Answer

別の方法としては、 linregress のネイティブscipy関数があります。

勾配：回帰直線の勾配

切片：回帰直線の切片

r値：相関係数

p値：帰無仮説が勾配がゼロであるという仮説検定の両側p値

stderr：見積もりの標準誤差

そして、これが例です。

a = [15, 12, 8, 8, 7, 7, 7, 6, 5, 3] b = [10, 25, 17, 11, 13, 17, 20, 13, 9, 15] from scipy.stats import linregress linregress(a, b)

あなたを返します：

LinregressResult(slope=0.20833333333333337, intercept=13.375, rvalue=0.14499815458068521, pvalue=0.68940144811669501, stderr=0.50261704627083648)

Jeff Hammerbacher · Answer

Scipyをインストールしたくない場合は、 Programming Collective Intelligence から若干変更したこのクイックハックを使用しました。

（正確さのために編集されています。）

from itertools import imap def pearsonr(x, y): # Assume len(x) == len(y) n = len(x) sum_x = float(sum(x)) sum_y = float(sum(y)) sum_x_sq = sum(map(lambda x: pow(x, 2), x)) sum_y_sq = sum(map(lambda x: pow(x, 2), y)) psum = sum(imap(lambda x, y: x * y, x, y)) num = psum - (sum_x * sum_y/n) den = pow((sum_x_sq - pow(sum_x, 2) / n) * (sum_y_sq - pow(sum_y, 2) / n), 0.5) if den == 0: return 0 return num / den

dfrankow · Answer

次のコードは、定義の単純な解釈です。

import math def average(x): assert len(x) > 0 return float(sum(x)) / len(x) def pearson_def(x, y): assert len(x) == len(y) n = len(x) assert n > 0 avg_x = average(x) avg_y = average(y) diffprod = 0 xdiff2 = 0 ydiff2 = 0 for idx in range(n): xdiff = x[idx] - avg_x ydiff = y[idx] - avg_y diffprod += xdiff * ydiff xdiff2 += xdiff * xdiff ydiff2 += ydiff * ydiff return diffprod / math.sqrt(xdiff2 * ydiff2)

テスト：

print pearson_def([1,2,3], [1,5,7])

戻る

0.981980506062

これはExcel、この計算機、 SciPy （また NumPy ）に一致し、それぞれ0.981980506と0.9819805060619657、0.98198050606196574を返します。

R ：

> cor( c(1,2,3), c(1,5,7)) [1] 0.9819805

EDIT：コメント投稿者が指摘したバグを修正しました。

Martin Thoma · Answer

これは pandas.DataFrame.corr でもできます。

import pandas as pd a = [[1, 2, 3], [5, 6, 9], [5, 6, 11], [5, 6, 13], [5, 3, 13]] df = pd.DataFrame(data=a) df.corr()

これは与える

 0 1 2 0 1.000000 0.745601 0.916579 1 0.745601 1.000000 0.544248 2 0.916579 0.544248 1.000000

compski · Answer

私の答えはコードを書くのが一番簡単で手順を理解する Pearson Correlation Coefficient（PCC）を計算するのが一番だと思います。

import math # calculates the mean def mean(x): sum = 0.0 for i in x: sum += i return sum / len(x) # calculates the sample standard deviation def sampleStandardDeviation(x): sumv = 0.0 for i in x: sumv += (i - mean(x))**2 return math.sqrt(sumv/(len(x)-1)) # calculates the PCC using both the 2 functions above def pearson(x,y): scorex = [] scorey = [] for i in x: scorex.append((i - mean(x))/sampleStandardDeviation(x)) for j in y: scorey.append((j - mean(y))/sampleStandardDeviation(y)) # multiplies both lists together into 1 list (hence Zip) and sums the whole list return (sum([i*j for i,j in Zip(scorex,scorey)]))/(len(x)-1)

PCCの重要性は基本的にどのように強く相関している 2つの変数/リストがあるかを示すことです。 PCC値の範囲は-1から1であることに注意することが重要です。 0から1までの値は正の相関を表します。値0 =最高の変動（相関はまったくありません）。 -1から0までの値は負の相関を示します。

Martin F Thomsen · Answer

うーん、これらの応答の多くは長くて読みにくいコードを持っています...

配列を扱うときには、numpyをその気の利いた機能と共に使用することをお勧めします。

import numpy as np def pcc(X, Y): ''' Compute Pearson Correlation Coefficient. ''' # Normalise X and Y X -= X.mean(0) Y -= Y.mean(0) # Standardise X and Y X /= X.std(0) Y /= Y.std(0) # Compute mean product return np.mean(X*Y) # Using it on a random example from random import random X = np.array([random() for x in xrange(100)]) Y = np.array([random() for x in xrange(100)]) pcc(X, Y)

Mojtaba Khodadadi · Answer

これは、numpyを使ったPearson Correlation関数の実装です。

 def corr(data1, data2): "data1 & data2 should be numpy arrays." mean1 = data1.mean() mean2 = data2.mean() std1 = data1.std() std2 = data2.std() # corr = ((data1-mean1)*(data2-mean2)).mean()/(std1*std2) corr = ((data1*data2).mean()-mean1*mean2)/(std1*std2) return corr

Web Ster · Answer

Pythonのパンダを使ったピアソン係数の計算：あなたのデータにはリストが含まれているので、私はこのアプローチを試すことをお勧めします。データ構造を視覚化して必要に応じて更新することができるため、データとやり取りしてコンソールから操作するのは簡単です。データセットをエクスポートして保存し、後で分析するためにpythonコンソールから新しいデータを追加することもできます。このコードはより単純で、コード行数が少なくなっています。さらに分析するためにデータをスクリーニングするには、数行のコードが必要です。

例：

data = {'list 1':[2,4,6,8],'list 2':[4,16,36,64]} import pandas as pd #To Convert your lists to pandas data frames convert your lists into pandas dataframes df = pd.DataFrame(data, columns = ['list 1','list 2']) from scipy import stats # For in-built method to get PCC pearson_coef, p_value = stats.pearsonr(df["list 1"], df["list 2"]) #define the columns to perform calculations on print("Pearson Correlation Coefficient: ", pearson_coef, "and a P-value of:", p_value) # Results

ただし、データセットのサイズや分析前に必要になる可能性がある変換を確認するために、あなたは私のためにデータを投稿しませんでした。

Turtles Are Cute · Answer

これは、mkhの答えを変形したもので、それよりはるかに高速に実行されます。scipy.stats.pearsonrでは、numbaを使用します。

import numba @numba.jit def corr(data1, data2): M = data1.size sum1 = 0. sum2 = 0. for i in range(M): sum1 += data1[i] sum2 += data2[i] mean1 = sum1 / M mean2 = sum2 / M var_sum1 = 0. var_sum2 = 0. cross_sum = 0. for i in range(M): var_sum1 += (data1[i] - mean1) ** 2 var_sum2 += (data2[i] - mean2) ** 2 cross_sum += (data1[i] * data2[i]) std1 = (var_sum1 / M) ** .5 std2 = (var_sum2 / M) ** .5 cross_mean = cross_sum / M return (cross_mean - mean1 * mean2) / (std1 * std2)

Vipul Sharma · Answer

これはスパースベクトルに基づくピアソン相関の実装です。ここでのベクトルは（index、value）として表現されたタプルのリストとして表現されます。 2つのスパースベクトルは長さが異なっていてもかまいませんが、すべてのベクトルサイズで同じでなければなりません。これは、ほとんどの機能が単語の集まりであるためにベクトルサイズが非常に大きいため、通常スパースベクトルを使用して計算が行われるテキストマイニングアプリケーションに役立ちます。

def get_pearson_corelation(self, first_feature_vector=[], second_feature_vector=[], length_of_featureset=0): indexed_feature_dict = {} if first_feature_vector == [] or second_feature_vector == [] or length_of_featureset == 0: raise ValueError("Empty feature vectors or zero length of featureset in get_pearson_corelation") sum_a = sum(value for index, value in first_feature_vector) sum_b = sum(value for index, value in second_feature_vector) avg_a = float(sum_a) / length_of_featureset avg_b = float(sum_b) / length_of_featureset mean_sq_error_a = sqrt((sum((value - avg_a) ** 2 for index, value in first_feature_vector)) + (( length_of_featureset - len(first_feature_vector)) * ((0 - avg_a) ** 2))) mean_sq_error_b = sqrt((sum((value - avg_b) ** 2 for index, value in second_feature_vector)) + (( length_of_featureset - len(second_feature_vector)) * ((0 - avg_b) ** 2))) covariance_a_b = 0 #calculate covariance for the sparse vectors for Tuple in first_feature_vector: if len(Tuple) != 2: raise ValueError("Invalid feature frequency Tuple in featureVector: %s") % (Tuple,) indexed_feature_dict[Tuple[0]] = Tuple[1] count_of_features = 0 for Tuple in second_feature_vector: count_of_features += 1 if len(Tuple) != 2: raise ValueError("Invalid feature frequency Tuple in featureVector: %s") % (Tuple,) if Tuple[0] in indexed_feature_dict: covariance_a_b += ((indexed_feature_dict[Tuple[0]] - avg_a) * (Tuple[1] - avg_b)) del (indexed_feature_dict[Tuple[0]]) else: covariance_a_b += (0 - avg_a) * (Tuple[1] - avg_b) for index in indexed_feature_dict: count_of_features += 1 covariance_a_b += (indexed_feature_dict[index] - avg_a) * (0 - avg_b) #adjust covariance with rest of vector with 0 value covariance_a_b += (length_of_featureset - count_of_features) * -avg_a * -avg_b if mean_sq_error_a == 0 or mean_sq_error_b == 0: return -1 else: return float(covariance_a_b) / (mean_sq_error_a * mean_sq_error_b)

ユニットテスト

def test_get_get_pearson_corelation(self): vector_a = [(1, 1), (2, 2), (3, 3)] vector_b = [(1, 1), (2, 5), (3, 7)] self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 3), 0.981980506062, 3, None, None) vector_a = [(1, 1), (2, 2), (3, 3)] vector_b = [(1, 1), (2, 5), (3, 7), (4, 14)] self.assertAlmostEquals(self.sim_calculator.get_pearson_corelation(vector_a, vector_b, 5), -0.0137089240555, 3, None, None)

Oleg K · Answer

この記事を見てください。これは、（Python用の）パンダライブラリを使用して複数のファイルからの過去の外国為替通貨ペアデータに基づいて相関を計算し、次にseabornライブラリを使用してヒートマッププロットを生成するためのよく文書化された例です。

http://www.tradinggeeks.net/2015/08/calculating-correlation-in-python/

Joshua Richardson · Answer

特定の方向の相関（負または正の相関）を探すという文脈であなたの確率をどのように解釈するか疑問に思うかもしれません。これを手助けするために私が書いた関数です。それも正しいかもしれません！

http://www.vassarstats.net/rsig.html と http://en.wikipedia.org/wiki/Student%27s_t_distribution から集めた情報に基づいています。、ここに投稿された他の答えのおかげで。

# Given (possibly random) variables, X and Y, and a correlation direction, # returns: # (r, p), # where r is the Pearson correlation coefficient, and p is the probability # that there is no correlation in the given direction. # # direction: # if positive, p is the probability that there is no positive correlation in # the population sampled by X and Y # if negative, p is the probability that there is no negative correlation # if 0, p is the probability that there is no correlation in either direction def probabilityNotCorrelated(X, Y, direction=0): x = len(X) if x != len(Y): raise ValueError("variables not same len: " + str(x) + ", and " + \ str(len(Y))) if x < 6: raise ValueError("must have at least 6 samples, but have " + str(x)) (corr, prb_2_tail) = stats.pearsonr(X, Y) if not direction: return (corr, prb_2_tail) prb_1_tail = prb_2_tail / 2 if corr * direction > 0: return (corr, prb_1_tail) return (corr, 1 - prb_1_tail)

aumpen · Answer

私はこれを解決するための非常に単純で理解しやすい解決策を持っています。等しい長さの2つの配列の場合、ピアソン係数は次のように簡単に計算できます。

def manual_pearson(a,b): """ Accepts two arrays of equal length, and computes correlation coefficient. Numerator is the sum of product of (a - a_avg) and (b - b_avg), while denominator is the product of a_std and b_std multiplied by length of array. """ a_avg, b_avg = np.average(a), np.average(b) a_stdev, b_stdev = np.std(a), np.std(b) n = len(a) denominator = a_stdev * b_stdev * n numerator = np.sum(np.multiply(a-a_avg, b-b_avg)) p_coef = numerator/denominator return p_coef

Source · Answer

def pearson(x,y): n=len(x) vals=range(n) sumx=sum([float(x[i]) for i in vals]) sumy=sum([float(y[i]) for i in vals]) sumxSq=sum([x[i]**2.0 for i in vals]) sumySq=sum([y[i]**2.0 for i in vals]) pSum=sum([x[i]*y[i] for i in vals]) # Calculating Pearson correlation num=pSum-(sumx*sumy/n) den=((sumxSq-pow(sumx,2)/n)*(sumySq-pow(sumy,2)/n))**.5 if den==0: return 0 r=num/den return r