Python / pysparkでgraphxを実行するにはどうすればよいですか？

Question

Pysparkを使用してSpark graphx with Pythonを実行しようとしています。pysparkチュートリアルと（Java）GraphXを実行できるため、インストールは正しく表示されます。おそらくGraphXはSparkの一部であるため、pysparkはそれをインターフェイスできるはずです。

Pysparkのチュートリアルは次のとおりです。 http://spark.Apache.org/docs/0.9.0/quick-start.html http://spark.Apache.org/docs /0.9.0/python-programming-guide.html

GraphXの場合： http://spark.Apache.org/docs/0.9.0/graphx-programming-guide.html http://ampcamp.berkeley.edu /big-data-mini-course/graph-analytics-with-graphx.html

誰でもGraphXチュートリアルをPythonに変換できますか？

Misty Nodine · Accepted Answer

python GraphXへのバインディングは、少なくともSpark ~~1.4~~ ~~1.5~~ ∞。 Java API。

ステータスは SPARK-3789 GRAPHX Python GraphXのバインディング-ASF JIRA で追跡できます。

zhibo · Answer

GraphFrames（ https://github.com/graphframes/graphframes ）をご覧ください。これは、DataFrames APIの下でGraphXアルゴリズムをラップし、Pythonインターフェースを提供します。

https://graphframes.github.io/graphframes/docs/_site/quick-start.html の簡単な例を以下に示します。

グラフフレームpkgをロードしてpysparkを最初に起動します

pyspark --packages graphframes:graphframes:0.1.0-spark1.6

pythonコード：

from graphframes import * # Create a Vertex DataFrame with unique ID column "id" v = sqlContext.createDataFrame([ ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ], ["id", "name", "age"]) # Create an Edge DataFrame with "src" and "dst" columns e = sqlContext.createDataFrame([ ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ], ["src", "dst", "relationship"]) # Create a GraphFrame g = GraphFrame(v, e) # Query: Get in-degree of each vertex. g.inDegrees.show() # Query: Count the number of "follow" connections in the graph. g.edges.filter("relationship = 'follow'").count() # Run PageRank algorithm, and show results. results = g.pageRank(resetProbability=0.01, maxIter=20) results.vertices.select("id", "pagerank").show()

Wildfire · Answer

GraphX 0.9.0にはpython APIがまだありません。今後のリリースで期待されています。