Spark clusterからデータを収集する際のメモリ不足エラー

Question

SO Sparkのメモリ不足エラーについての質問がたくさんありますが、私の解決策は見つかりませんでした。

私はシンプルなワークフローを持っています：

amazon S3からORCファイルを読み込む
filterから行の小さなサブセットまで
select列の小さなサブセット
collectをドライバーノードに（したがって、Rで追加の操作を行うことができます）

上記を実行してからcacheにテーブルをspark=メモリが2GB未満-クラスターで使用可能なメモリと比較して小さい-を使用すると、collectを使用しようとするとOOMエラーが発生しますドライバーノードへのデータ。

次のセットアップで実行してみました。

32コアと244GBのRAMを搭載したコンピューターのローカルモード
10 x 6.2 GBエグゼキューターと61GBドライバーノードを使用したスタンドアロンモード

これらのそれぞれについて、executor.memory、driver.memory、およびdriver.maxResultSizeの多数の構成を使用して、使用可能なメモリ内の可能な範囲のすべてをカバーしましたが、常にoutになりますcollectステージでのメモリエラー。 Java.lang.OutOfMemoryError: Java heap space、
Java.lang.OutOfMemoryError : GC overhead limit exceeded、またはError in invoke_method.spark_Shell_connection(spark_connection(jobj), : No status is returned.（メモリの問題を示すsparklyrエラー）。

Sparkの[限られた]理解に基づいて、収集する前にテーブルをキャッシュすると、すべての計算が強制されます。つまり、テーブルが<2GBでキャッシュした後にメモリに問題なく座っている場合、収集に2GBを超えるメモリは必要ありませんそれをドライバーノードに入れます。

この質問に対する回答にはまだ試していない提案がありますが、これらはパフォーマンスに影響を与える可能性が高いため（RDDのシリアル化など）、可能な限り使用しないでください。

私の質問：

キャッシュされた後にほとんどスペースを占有しないデータフレームがメモリの問題を引き起こす可能性があるのはどうしてでしょうか？
パフォーマンスを損なう可能性のある追加オプションに進む前に、問題を解決するためにチェック/変更/トラブルシューティングを行うための明らかな何かがありますか？

ありがとうございました

Edit：以下の@Shaidoのコメントに応答して、Sparklyrを介してcacheを呼び出すと、「count(*) [Sparklyrのドキュメントから]-つまり、テーブルはメモリ内にあり、collectを呼び出す前にすべての計算を実行する必要があると思います。

編集：以下の提案に従うための追加の観察：

以下のコメントにあるように、収集する代わりにcsvにデータを書き込み、ファイルサイズの可能性を把握しようとしました。この操作により、最大3GBのcsvのセットが作成され、キャッシュ後に実行すると2秒しかかかりません。
driver.maxResultSizeを<1Gに設定すると、シリアル化されたRDDのサイズが1030 MBで、driver.maxResultSizeよりも大きいというエラーが表示されます。
collectを呼び出した後、タスクマネージャーでメモリ使用量を監視すると、使用量が90 GBに達するまで増加し続け、OOMエラーが発生することがわかります。 だから、何らかの理由でRAM collect操作の実行に使用される量は、RDDのサイズの約100倍です収集しようとしています。

編集：コメントで要求されているように、以下に追加されたコード。

#__________________________________________________________________________________________________________________________________ # Set parameters used for filtering rows #__________________________________________________________________________________________________________________________________ firstDate <- '2017-07-01' maxDate <- '2017-08-31' advertiserID <- '4529611' advertiserID2 <- '4601141' advertiserID3 <- '4601141' library(dplyr) library(stringr) library(sparklyr) #__________________________________________________________________________________________________________________________________ # Configure & connect to spark #__________________________________________________________________________________________________________________________________ Sys.setenv("SPARK_MEM"="100g") Sys.setenv(HADOOP_HOME="C:/Users/Jay.Ruffell/AppData/Local/rstudio/spark/Cache/spark-2.0.1-bin-hadoop2.7/tmp/hadoop") config <- spark_config() config$sparklyr.defaultPackages <- "org.Apache.hadoop:hadoop-aws:2.7.3" # used to connect to S3 Sys.setenv(AWS_ACCESS_KEY_ID="") Sys.setenv(AWS_SECRET_ACCESS_KEY="") # setting these blank ensures that AWS uses the IAM roles associated with the cluster to define S3 permissions # Specify memory parameters - have tried lots of different values here! config$`sparklyr.Shell.driver-memory` <- '50g' config$`sparklyr.Shell.executor-memory` <- '50g' config$spark.driver.maxResultSize <- '50g' sc <- spark_connect(master='local', config=config, version='2.0.1') #__________________________________________________________________________________________________________________________________ # load data into spark from S3 ---- #__________________________________________________________________________________________________________________________________ #+++++++++++++++++++ # create spark table (not in memory yet) of all logfiles within logfiles path #+++++++++++++++++++ spark_session(sc) %>% invoke("read") %>% invoke("format", "orc") %>% invoke("load", 's3a://nz-omg-ann-aipl-data-lake/aip-connect-256537/orc-files/dcm-log-files/dt2-facts') %>% invoke("createOrReplaceTempView", "alldatadf") alldftbl <- tbl(sc, 'alldatadf') # create a reference to the sparkdf without loading into memory #+++++++++++++++++++ # define variables used to filter table down to daterange #+++++++++++++++++++ # Calculate firstDate & maxDate as unix timestamps unixTime_firstDate <- as.numeric(as.POSIXct(firstDate))+1 unixTime_maxDate <- as.numeric(as.POSIXct(maxDate)) + 3600*24-1 # Convert daterange params into date_year, date_month & date_day values to pass to filter statement dateRange <- as.character(seq(as.Date(firstDate), as.Date(maxDate), by=1)) years <- unique(substring(dateRange, first=1, last=4)) if(length(years)==1) years <- c(years, years) year_y1 <- years[1]; year_y2 <- years[2] months_y1 <- substring(dateRange[grepl(years[1], dateRange)], first=6, last=7) minMonth_y1 <- min(months_y1) maxMonth_y1 <- max(months_y1) months_y2 <- substring(dateRange[grepl(years[2], dateRange)], first=6, last=7) minMonth_y2 <- min(months_y2) maxMonth_y2 <- max(months_y2) # Repeat for 1 day prior to first date & one day after maxdate (because of the way logfile orc partitions are created, sometimes touchpoints can end up in the wrong folder by 1 day. So read in extra days, then filter by event time) firstDateMinusOne <- as.Date(firstDate)-1 firstDateMinusOne_year <- substring(firstDateMinusOne, first=1, last=4) firstDateMinusOne_month <- substring(firstDateMinusOne, first=6, last=7) firstDateMinusOne_day <- substring(firstDateMinusOne, first=9, last=10) maxDatePlusOne <- as.Date(maxDate)+1 maxDatePlusOne_year <- substring(maxDatePlusOne, first=1, last=4) maxDatePlusOne_month <- substring(maxDatePlusOne, first=6, last=7) maxDatePlusOne_day <- substring(maxDatePlusOne, first=9, last=10) #+++++++++++++++++++ # Read in data, filter & select #+++++++++++++++++++ # startTime <- proc.time()[3] dftbl <- alldftbl %>% # create a reference to the sparkdf without loading into memory # filter by month and year, using ORC partitions for extra speed filter(((date_year==year_y1 & date_month>=minMonth_y1 & date_month<=maxMonth_y1) | (date_year==year_y2 & date_month>=minMonth_y2 & date_month<=maxMonth_y2) | (date_year==firstDateMinusOne_year & date_month==firstDateMinusOne_month & date_day==firstDateMinusOne_day) | (date_year==maxDatePlusOne_year & date_month==maxDatePlusOne_month & date_day==maxDatePlusOne_day))) %>% # filter to be within firstdate & maxdate. Note that event_time_char will be in UTC, so 12hrs behind. filter(event_time>=(unixTime_firstDate*1000000) & event_time<(unixTime_maxDate*1000000)) %>% # filter by advertiser ID filter(((advertiser_id==advertiserID | advertiser_id==advertiserID2 | advertiser_id==advertiserID3) & !is.na(advertiser_id)) | ((floodlight_configuration==advertiserID | floodlight_configuration==advertiserID2 | floodlight_configuration==advertiserID3) & !is.na(floodlight_configuration)) & user_id!="0") %>% # Define cols to keep transmute(time=as.numeric(event_time/1000000), user_id=as.character(user_id), action_type=as.character(if(fact_type=='click') 'C' else if(fact_type=='impression') 'I' else if(fact_type=='activity') 'A' else NA), lookup=concat_ws("_", campaign_id, ad_id, site_id_dcm, placement_id), activity_lookup=as.character(activity_id), sv1=as.character(segment_value_1), other_data=as.character(other_data)) %>% mutate(time_char=as.character(from_unixtime(time))) # cache to memory dftbl <- sdf_register(dftbl, "filtereddf") tbl_cache(sc, "filtereddf") #__________________________________________________________________________________________________________________________________ # Collect out of spark #__________________________________________________________________________________________________________________________________ myDF <- collect(dftbl)

BalaramRaju · Answer

データフレームで収集すると言うと、2つのことが起こっています。

まず、すべてのデータをドライバーの出力に書き込む必要があります。
ドライバは、すべてのノードからデータを収集し、そのメモリに保存する必要があります。

回答：

データをエクセキュータのメモリにロードするだけの場合、count（）は、他のプロセスで使用できるエクゼキュータのメモリにデータをロードするアクションでもあります。

データを抽出する場合は、データをプルするときに他のプロパティと一緒にこれを試してください "--conf spark.driver.maxResultSize = 10g"。

pasha701 · Answer

上記のように、「キャッシュ」はアクションではありません。 RDD Persistence を確認してください。

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes.

ただし、「収集」はアクションであり、「収集」が呼び出されると、すべての計算（「キャッシュ」を含む）が開始されます。

アプリケーションをスタンドアロンモードで実行します。つまり、初期データの読み込みとすべての計算は同じメモリで実行されます。

データのダウンロードやその他の計算は、「収集」ではなく、ほとんどのメモリで使用されます。

「collect」を「count」に置き換えることで確認できます。