Apache Spark：map vs mapPartitions？

Question

RDD's mapメソッドとmapPartitionsメソッドの違いは何ですか？また、flatMapはmapまたはmapPartitionsのように動作しますか？ありがとう。

（編集）すなわち、（セマンティックまたは実行の点で）間の違いは何ですか

 def map[A, B](rdd: RDD[A], fn: (A => B)) (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = { rdd.mapPartitions({ iter: Iterator[A] => for (i <- iter) yield fn(i) }, preservesPartitioning = true) }

そして：

 def map[A, B](rdd: RDD[A], fn: (A => B)) (implicit a: Manifest[A], b: Manifest[B]): RDD[B] = { rdd.map(fn) }

Alexey Romanov · Accepted Answer

RDDのmapメソッドとmapPartitionsメソッドの違いは何ですか？

メソッド map は、関数を適用することにより、ソースRDDの各elementを結果RDDの単一要素に変換します。 mapPartitions は、ソースRDDの各partitionを結果の複数の要素（おそらくなし）に変換します。

また、flatMapはmapまたはmapPartitionsのように動作しますか？

また、 flatMap は単一の要素（mapとして）で機能し、結果の複数の要素（mapPartitionsとして）を生成します。

Ram Ghadiyaram · Answer

インプヒント：

RDDname__要素ごとではなく、多くのRDDname__要素に対して1回実行する必要のある重い初期化があり、サードパーティライブラリからのオブジェクトの作成など、この初期化をシリアル化できない場合（Sparkはクラスターを介してワーカーノードに送信できます）、mapPartitions()の代わりにmap()を使用します。 mapPartitions()は、 example： のRDDname__データ要素ごとに1回ではなく、ワーカータスク/スレッド/パーティションごとに1回初期化を実行します。下記参照。

val newRd = myRdd.mapPartitions(partition => { val connection = new DbConnection /*creates a db connection per partition*/ val newPartition = partition.map(record => { readMatchingFromDB(record, connection) }).toList // consumes the iterator, thus calls readMatchingFromDB connection.close() // close dbconnection here newPartition.iterator // create a new iterator })

Q2。 flatMapname__はmapまたはmapPartitionsname__？のように動作します

はい。 flatmapname__の例2を参照してください。自明です。

Q1。 RDDのmapname__とmapPartitionsname__の違いは何ですか

mapname__は要素ごとのレベルで使用されている関数を処理し、mapPartitionsname__はパーティションレベルで関数を実行します。

シナリオの例：特定のRDDname__パーティションに100K個の要素がある場合、mapname__を使用すると、マッピング変換で使用される関数を100K回起動します。

逆に、mapPartitionsname__を使用する場合は、特定の関数を1回だけ呼び出しますが、すべての100Kレコードを渡して、1回の関数呼び出しですべての応答を取得します。

mapname__は特定の関数で何度も動作するため、特に関数がすべての要素を一度に渡した場合に実行する必要のない高価な処理を行う場合はパフォーマンスが向上します（のmappartitionsname __）。

地図

RDDの各アイテムに変換関数を適用し、結果を新しいRDDとして返します。

バリアントの一覧表示

def map [U：ClassTag]（f：T => U）：RDD [U]

例：

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) val b = a.map(_.length) val c = a.Zip(b) c.collect res0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))

mapPartitions

これは、パーティションごとに1回だけ呼び出される特殊なマップです。それぞれのパーティションのコンテンツ全体は、入力引数（Iterarator [T]）を介して値のシーケンシャルストリームとして利用できます。カスタム関数は、さらに別のIterator [U]を返す必要があります。結合された結果イテレータは、自動的に新しいRDDに変換されます。タプル（3,4）および（6,7）は、選択したパーティション化のために、次の結果から欠落していることに注意してください。

preservesPartitioningname__は、入力関数がパーティショナーを保持するかどうかを示します。パーティショナーは、ペアRDDで、入力関数がキーを変更しない限り、falsename__でなければなりません。

バリアントのリスト

def mapPartitions [U：ClassTag]（f：Iterator [T] => Iterator [U]、preservesPartitioning：ブール= false）：RDD [U]

例1

val a = sc.parallelize(1 to 9, 3) def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = { var res = List[(T, T)]() var pre = iter.next while (iter.hasNext) { val cur = iter.next; res .::= (pre, cur) pre = cur; } res.iterator } a.mapPartitions(myfunc).collect res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))

例2

val x = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8, 9,10), 3) def myfunc(iter: Iterator[Int]) : Iterator[Int] = { var res = List[Int]() while (iter.hasNext) { val cur = iter.next; res = res ::: List.fill(scala.util.Random.nextInt(10))(cur) } res.iterator } x.mapPartitions(myfunc).collect // some of the number are not outputted at all. This is because the random number generated for it is zero. res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10)

上記のプログラムは、次のようにflatMapを使用して作成することもできます。

フラットマップを使用した例2

val x = sc.parallelize(1 to 10, 3) x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

結論：

mapPartitionsname__変換は、once/elementではなく、once/partitionで関数を呼び出すため、mapname__よりも高速です。

さらに読む： foreach Vs foreachPartitions何を使用するか？

KrazyGautam · Answer

Map：

MapReduceのmap（）メソッドと非常によく似た、一度に1行を処理します。

行ごとに変換から戻ります。

MapPartitions

パーティション全体を一度に処理します。

パーティション全体を処理した後、関数から一度だけ戻ることができます。

すべての中間結果は、パーティション全体を処理するまでメモリに保持する必要があります。

MapReduceのsetup（）map（）およびcleanup（）関数を提供します

Map Vs mapPartitions http://bytepadding.com/big-data/spark/spark-map-vs-mappartitions/

Spark Map http://bytepadding.com/big-data/spark/spark-map/

Spark mapPartitions http://bytepadding.com/big-data/spark/spark-mappartitions/