Apacheの主キーSpark

Question

Apache SparkおよびPostgreSQLとのJDBC接続があり、データベースにデータを挿入したい。appendモードを使用する場合、id each DataFrame.Row。Sparkが主キーを作成する方法はありますか？

zero323 · Accepted Answer

スカラ：

必要なのが一意の数字だけである場合は、zipWithUniqueIdを使用してDataFrameを再作成できます。最初にいくつかのインポートとダミーデータ：

import sqlContext.implicits._ import org.Apache.spark.sql.Row import org.Apache.spark.sql.types.{StructType, StructField, LongType} val df = sc.parallelize(Seq( ("a", -1.0), ("b", -2.0), ("c", -3.0))).toDF("foo", "bar")

さらに使用するためにスキーマを抽出します。

val schema = df.schema

IDフィールドを追加：

val rows = df.rdd.zipWithUniqueId.map{ case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}

DataFrameを作成します。

val dfWithPK = sqlContext.createDataFrame( rows, StructType(StructField("id", LongType, false) +: schema.fields))

Pythonで同じこと：

from pyspark.sql import Row from pyspark.sql.types import StructField, StructType, LongType row = Row("foo", "bar") row_with_index = Row(*["id"] + df.columns) df = sc.parallelize([row("a", -1.0), row("b", -2.0), row("c", -3.0)]).toDF() def make_row(columns): def _make_row(row, uid): row_dict = row.asDict() return row_with_index(*[uid] + [row_dict.get(c) for c in columns]) return _make_row f = make_row(df.columns) df_with_pk = (df.rdd .zipWithUniqueId() .map(lambda x: f(*x)) .toDF(StructType([StructField("id", LongType(), False)] + df.schema.fields)))

連続番号を希望する場合は、zipWithUniqueIdをzipWithIndexに置き換えることができますが、少し高価です。

DataFrame APIで直接：

（ほぼ同じ構文のユニバーサルScala、Python、Java、R）

以前は、連続番号を必要としない限りうまく機能するmonotonicallyIncreasingId関数を見逃していました。

import org.Apache.spark.sql.functions.monotonicallyIncreasingId df.withColumn("id", monotonicallyIncreasingId).show() // +---+----+-----------+ // |foo| bar| id| // +---+----+-----------+ // | a|-1.0|17179869184| // | b|-2.0|42949672960| // | c|-3.0|60129542144| // +---+----+-----------+

便利なmonotonicallyIncreasingIdは非決定的です。 IDは実行ごとに異なる場合があるだけでなく、後続の操作にフィルターが含まれる場合、行を識別するために追加のトリックなしでは使用できません。

注：

rowNumberウィンドウ関数を使用することもできます。

from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber w = Window().orderBy() df.withColumn("id", rowNumber().over(w)).show()

残念ながら：

警告ウィンドウ：ウィンドウ操作用にパーティションが定義されていません！すべてのデータを単一のパーティションに移動すると、パフォーマンスが大幅に低下する可能性があります。

したがって、データを分割して一意性を確保する自然な方法がない限り、現時点では特に有用ではありません。

Allyn · Answer

from pyspark.sql.functions import monotonically_increasing_id df.withColumn("id", monotonically_increasing_id()).show()

Df.withColumnの2番目の引数は、monotonically_increasing_idではなく、monotonically_increasing_id（）であることに注意してください。

rocconnick · Answer

次のソリューションは、zipWithIndex（）が望ましい動作である場合、つまり連続した整数が望ましい場合に比較的簡単であることがわかりました。

この場合、pysparkを使用し、辞書の理解に依存して、元の行オブジェクトを、一意のインデックスを含む新しいスキーマに適合する新しい辞書にマップします。

# read the initial dataframe without index dfNoIndex = sqlContext.read.parquet(dataframePath) # Need to Zip together with a unique integer # First create a new schema with uuid field appended newSchema = StructType([StructField("uuid", IntegerType(), False)] + dfNoIndex.schema.fields) # Zip with the index, map it to a dictionary which includes new field df = dfNoIndex.rdd.zipWithIndex()\ .map(lambda (row, id): {k:v for k, v in row.asDict().items() + [("uuid", id)]})\ .toDF(newSchema)