Spark numpyマトリックスからのデータフレームの作成

Question

pySpark（Spark 2）を使用するのは初めてで、Logitモデル用のおもちゃのデータフレームを作成しようとしています。私は tutorial を正常に実行し、自分のデータをそれに渡したいと思います。

私はこれを試しました：

%pyspark import numpy as np from pyspark.ml.linalg import Vectors, VectorUDT from pyspark.mllib.regression import LabeledPoint df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df) mydf = spark.createDataFrame(df,["label", "features"])

しかし私は取り除くことができません：

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

私はベクターにMLライブラリーを使用していて、入力はdouble配列なので、キャッチは何ですか？ documentation に従って問題ないはずです。

どうもありがとう。

desertnaut · Accepted Answer

MLとMLlibの機能が混在しているが、これらは必ずしも互換性がない。 spark-mlを使用する場合、LabeledPointは必要ありません。

sc.version # u'2.1.1' import numpy as np from pyspark.ml.linalg import Vectors df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df) mydf = spark.createDataFrame(dff,schema=["label", "features"]) mydf.show(5) # +-----+-------------+ # |label| features| # +-----+-------------+ # | 1|[0.0,0.0,0.0]| # | 0|[0.0,1.0,1.0]| # | 0|[0.0,1.0,0.0]| # | 1|[0.0,0.0,1.0]| # | 0|[0.0,1.0,0.0]| # +-----+-------------+

PS：Spark 2.0以降、spark.mllibパッケージのRDDベースのAPIはメンテナンスモードになりました。 Sparkの主要な機械学習APIは、spark.mlパッケージのDataFrameベースのAPIになりました。 [ref。]

Jeff Hernandez · Answer

NumpyからPandas= Sparkへ：

spark.createDataFrame(pd.DataFrame(np.random.Rand(4,4),columns=list('abcd'))).show()

出力：+-------------------+-------------------+------------------+-------------------+ | a| b| c| d| +-------------------+-------------------+------------------+-------------------+ | 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833| | 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332| |0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264| | 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325| +-------------------+-------------------+------------------+-------------------+

Dat Tran · Answer

問題は簡単に解決できます。 ml APIとmllib APIを同時に使用しています。 1つに固執する。そうしないと、このエラーが発生します。

これは、mllibAPIのソリューションです。

import numpy as np from pyspark.mllib.linalg import Vectors, VectorUDT from pyspark.mllib.regression import LabeledPoint df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1) df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df) mydf = spark.createDataFrame(df,["label", "features"])

ml APIの場合、もうLabeledPointは必要ありません。これが例です。 ml APIはまもなく廃止されるため、mllib APIの使用をお勧めします。