PysparkとPCA：このPCAの固有ベクトルを抽出するにはどうすればよいですか？彼らが説明している分散の量をどのように計算できますか？

Question

次のように、pysparkでPCAモデルを使用して_Spark DataFrame_の次元数を減らしています（spark mlライブラリを使用）。

_pca = PCA(k=3, inputCol="features", outputCol="pca_features") model = pca.fit(data) _

ここで、dataは_Spark DataFrame_であり、1つの列がラベル付けされたfeaturesは、3次元のDenseVectorです。

_data.take(1) Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1') _

フィッティング後、データを変換します。

_transformed = model.transform(data) transformed.first() Row(features=DenseVector([0.4536,-0.43218, 0.9876]), label=u'class1', pca_features=DenseVector([-0.33256, 0.8668, 0.625])) _

私の質問は、このPCAの固有ベクトルを抽出するにはどうすればよいですか？彼らが説明している分散の量をどのように計算できますか？

desertnaut · Accepted Answer

[UPDATE：以降Spark 2.2以降、PCAとSVDの両方がPySparkで利用可能です-JIRAチケットを参照 SPARK-6227 および [〜＃〜] pca [〜＃〜] ＆ PCAModel for Spark ML 2.2;オリジナル以下の回答は、古いSparkバージョン）にも適用できます。

まあ、それは信じられないようですが、確かに、PCA分解からそのような情報を抽出する方法はありません（少なくともSpark 1.5現在）。しかし、同様に、多くの同様の「苦情」がありました-CrossValidatorModelから最適なパラメータを抽出できない場合は、たとえば here を参照してください。

幸いなことに、数か月前、AMPLab（Berkeley）とDatabricksによる 'Scalable Machine Learning' MOOCに参加しました。つまり、Sparkの作成者であり、PCAパイプライン全体を「手動」で宿題。私はそのときから自分の機能を変更しました（安心してください、私は完全なクレジットを得ました:-)。あなたのデータフレームと同じ形式（つまりDenseVectorsの行）のデータフレームを（RDDの代わりに）入力として使用できるようにします数値特徴を含む）。

最初に、次のように中間関数estimatedCovarianceを定義する必要があります。

_import numpy as np def estimateCovariance(df): """Compute the covariance matrix for a given dataframe. Note: The multi-dimensional covariance array should be calculated using outer products. Don't forget to normalize the data by first subtracting the mean. Args: df: A Spark dataframe with a column named 'features', which (column) consists of DenseVectors. Returns: np.ndarray: A multi-dimensional array where the number of rows and columns both equal the length of the arrays in the input dataframe. """ m = df.select(df['features']).map(lambda x: x[0]).mean() dfZeroMean = df.select(df['features']).map(lambda x: x[0]).map(lambda x: x-m) # subtract the mean return dfZeroMean.map(lambda x: np.outer(x,x)).sum()/df.count() _

次に、メインのpca関数を次のように記述します。

_from numpy.linalg import eigh def pca(df, k=2): """Computes the top `k` principal components, corresponding scores, and all eigenvalues. Note: All eigenvalues should be returned in sorted order (largest to smallest). `eigh` returns each eigenvectors as a column. This function should also return eigenvectors as columns. Args: df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors. k (int): The number of principal components to return. Returns: Tuple of (np.ndarray, RDD of np.ndarray, np.ndarray): A Tuple of (eigenvectors, `RDD` of scores, eigenvalues). Eigenvectors is a multi-dimensional array where the number of rows equals the length of the arrays in the input `RDD` and the number of columns equals `k`. The `RDD` of scores has the same number of rows as `data` and consists of arrays of length `k`. Eigenvalues is an array of length d (the number of features). """ cov = estimateCovariance(df) col = cov.shape[1] eigVals, eigVecs = eigh(cov) inds = np.argsort(eigVals) eigVecs = eigVecs.T[inds[-1:-(col+1):-1]] components = eigVecs[0:k] eigVals = eigVals[inds[-1:-(col+1):-1]] # sort eigenvals score = df.select(df['features']).map(lambda x: x[0]).map(lambda x: np.dot(x, components.T) ) # Return the `k` principal components, `k` scores, and all eigenvalues return components.T, score, eigVals _

テスト

Spark ML PCA documentation からのサンプルデータを使用して、既存のメソッドでの結果をまず見てみましょう（すべてをDenseVectorsになるように変更します））：

_ from pyspark.ml.feature import * from pyspark.mllib.linalg import Vectors data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] df = sqlContext.createDataFrame(data,["features"]) pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features") model = pca_extracted.fit(df) model.transform(df).collect() [Row(features=DenseVector([0.0, 1.0, 0.0, 7.0, 0.0]), pca_features=DenseVector([1.6486, -4.0133])), Row(features=DenseVector([2.0, 0.0, 3.0, 4.0, 5.0]), pca_features=DenseVector([-4.6451, -1.1168])), Row(features=DenseVector([4.0, 0.0, 0.0, 6.0, 7.0]), pca_features=DenseVector([-6.4289, -5.338]))] _

次に、私たちの方法で：

_ comp, score, eigVals = pca(df) score.collect() [array([ 1.64857282, 4.0132827 ]), array([-4.64510433, 1.11679727]), array([-6.42888054, 5.33795143])] _

定義した関数でcollect()メソッドを使用しないことを強調しましょう-scoreはRDD。

2番目の列の符号はすべて、既存のメソッドによって導出されたものと反対であることに注意してください。しかし、これは問題ではありません：（無料でダウンロード可能）統計学習の概要によると、HastieとTibshiraniの共著、p。 382

各主成分負荷ベクトルは、符号反転まで、一意です。つまり、2つの異なるソフトウェアパッケージは同じ主成分ロードベクトルを生成しますが、これらのロードベクトルの符号は異なる場合があります。各主成分負荷ベクトルはp次元空間での方向を指定するため、符号は異なる場合があります。方向が変化しないため、符号を反転しても効果はありません。 [...]同様に、Zの分散は-Zの分散と同じであるため、スコアベクトルは符号反転まで一意です。

最後に、利用可能な固有値が得られたので、説明された分散のパーセンテージの関数を書くのは簡単です。

_ def varianceExplained(df, k=1): """Calculate the fraction of variance explained by the top `k` eigenvectors. Args: df: A Spark dataframe with a 'features' column, which (column) consists of DenseVectors. k: The number of principal components to consider. Returns: float: A number between 0 and 1 representing the percentage of variance explained by the top `k` eigenvectors. """ components, scores, eigenvalues = pca(df, k) return sum(eigenvalues[0:k])/sum(eigenvalues) varianceExplained(df,1) # 0.79439325322305299 _

テストとして、サンプルデータで説明されている分散がk = 5の場合に1.0であるかどうかも確認します（元のデータは5次元であるため）。

_ varianceExplained(df,5) # 1.0 _

これはあなたの仕事をするはずです効率的に;あなたが必要とするかもしれない説明があれば自由に考えてください。

[Spark 1.5.0＆1.5.1で開発およびテスト済み]

eliasah · Answer

EDIT：

PCAとSVDの両方がpysparkspark 2.2で利用できるようになりました。この解決済みのJIRAチケットによると0 SPARK-6227 。

元の答え：

@desertnautの答えは実際には理論的には優れていますが、SVDを計算して固有ベクトルを抽出する方法に関する別のアプローチを提示したいと思いました。

from pyspark.mllib.common import callMLlibFunc, JavaModelWrapper from pyspark.mllib.linalg.distributed import RowMatrix class SVD(JavaModelWrapper): """Wrapper around the SVD scala case class""" @property def U(self): """ Returns a RowMatrix whose columns are the left singular vectors of the SVD if computeU was set to be True.""" u = self.call("U") if u is not None: return RowMatrix(u) @property def s(self): """Returns a DenseVector with singular values in descending order.""" return self.call("s") @property def V(self): """ Returns a DenseMatrix whose columns are the right singular vectors of the SVD.""" return self.call("V")

これはSVDオブジェクトを定義します。 Javaラッパーを使用して、computeSVDメソッドを定義できます。

def computeSVD(row_matrix, k, computeU=False, rCond=1e-9): """ Computes the singular value decomposition of the RowMatrix. The given row matrix A of dimension (m X n) is decomposed into U * s * V'T where * s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order. * U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A') * v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A' X A) :param k: number of singular values to keep. We might return less than k if there are numerically zero singular values. :param computeU: Whether of not to compute U. If set to be True, then U is computed by A * V * sigma^-1 :param rCond: the reciprocal condition number. All singular values smaller than rCond * sigma(0) are treated as zero, where sigma(0) is the largest singular value. :returns: SVD object """ Java_model = row_matrix._Java_matrix_wrapper.call("computeSVD", int(k), computeU, float(rCond)) return SVD(Java_model)

それを例に適用してみましょう：

from pyspark.ml.feature import * from pyspark.mllib.linalg import Vectors data = [(Vectors.dense([0.0, 1.0, 0.0, 7.0, 0.0]),), (Vectors.dense([2.0, 0.0, 3.0, 4.0, 5.0]),), (Vectors.dense([4.0, 0.0, 0.0, 6.0, 7.0]),)] df = sqlContext.createDataFrame(data,["features"]) pca_extracted = PCA(k=2, inputCol="features", outputCol="pca_features") model = pca_extracted.fit(df) features = model.transform(df) # this create a DataFrame with the regular features and pca_features # We can now extract the pca_features to prepare our RowMatrix. pca_features = features.select("pca_features").rdd.map(lambda row : row[0]) mat = RowMatrix(pca_features) # Once the RowMatrix is ready we can compute our Singular Value Decomposition svd = computeSVD(mat,2,True) svd.s # DenseVector([9.491, 4.6253]) svd.U.rows.collect() # [DenseVector([0.1129, -0.909]), DenseVector([0.463, 0.4055]), DenseVector([0.8792, -0.0968])] svd.V # DenseMatrix(2, 2, [-0.8025, -0.5967, -0.5967, 0.8025], 0)

Sameer Mahajan · Answer

spark 2.2+では、次のように説明された分散を簡単に取得できます。

from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler(inputCols=<columns of your original dataframe>, outputCol="features") df = assembler.transform(<your original dataframe>).select("features") from pyspark.ml.feature import PCA pca = PCA(k=10, inputCol="features", outputCol="pcaFeatures") model = pca.fit(df) sum(model.explainedVariance)

sergulaydore · Answer

質問に対する最も簡単な答えは、モデルに単位行列を入力することです。

identity_input = [(Vectors.dense([1.0, .0, 0.0, .0, 0.0]),),(Vectors.dense([.0, 1.0, .0, .0, .0]),), \ (Vectors.dense([.0, 0.0, 1.0, .0, .0]),),(Vectors.dense([.0, 0.0, .0, 1.0, .0]),), (Vectors.dense([.0, 0.0, .0, .0, 1.0]),)] df_identity = sqlContext.createDataFrame(identity_input,["features"]) identity_features = model.transform(df_identity)

これにより、主要なコンポーネントが提供されます。

エリアサの答えは、Sparkフレームワークの点で優れていると思います。砂漠の飛行士は、スパークのアクションの代わりにnumpyの関数を使用して問題を解決しているためです。しかし、エリアサの答えは、データの正規化がありません。それで、エリアサの答えに次の行を追加してください：

from pyspark.ml.feature import StandardScaler standardizer = StandardScaler(withMean=True, withStd=False, inputCol='features', outputCol='std_features') model = standardizer.fit(df) output = model.transform(df) pca_features = output.select("std_features").rdd.map(lambda row : row[0]) mat = RowMatrix(pca_features) svd = computeSVD(mat,5,True)

事実上、svd.Vとidentity_features.select（ "pca_features"）。collect（）の値は同じでなければなりません。

編集：私はPCAとその使用法をSparkとsklearnでこれに要約しました here