私はZeppelinノートブックからPysparkプログラムを送信しているspark EC2クラスターを持っています。 hadoop-aws-2.7.3.jarとaws-java-sdk-1.11.179.jarをロードして、sparkインスタンスの/ opt/spark/jarsディレクトリに配置しました。 java.lang.noclassDeffounterror:com/Amazonaws/AmazonServiceException
なぜsparkがjarsを見ていないのですか?私はすべてのスレーブの中にjarsをjarsし、マスターとスレーブのspark-defaults.confを指定する必要がありますか?新しいJARファイルを認識するためにZeppelinで設定する必要があるものはありますか?
sparkマスターにJARファイル/ opt/spark/jarを配置しました。 spark-defaults.confを作成して行を追加しました
_spark.hadoop.fs.s3a.access.key [ACCESS KEY]
spark.hadoop.fs.s3a.secret.key [SECRET KEY]
spark.hadoop.fs.s3a.impl org.Apache.hadoop.fs.s3a.S3AFileSystem
spark.driver.extraClassPath /opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/aws-Java-sdk-1.11.179.jar
_
spark Masterに送信するsparkを送信するZeppelinインタプリタがあります。
また、ゼロもスレーブの/ opt/spark/jarsに配置しましたが、Spark-Deafults.confを作成しませんでした。
_%spark.pyspark
#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col
# add aws credentials
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "[ACCESS KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "[SECRET KEY]")
sc._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.Apache.hadoop.fs.s3a.S3AFileSystem")
#creating the context
sqlContext = SQLContext(sc)
#reading the first csv file and store it in an RDD
rdd1= sc.textFile("s3a://filepath/baby-names.csv").map(lambda line: line.split(","))
#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)
#converting the RDD into a dataframe
df1 = rdd1.toDF(['year','name', 'percent', 'sex'])
#print the dataframe
df1.show()
_
エラーが投げられたエラー:
_
Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.runJob.
: org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.11.93.90, executor 1): Java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at Java.lang.Class.forName0(Native Method)
at Java.lang.Class.forName(Class.Java:348)
at org.Apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.Java:2134)
at org.Apache.hadoop.conf.Configuration.getClassByName(Configuration.Java:2099)
at org.Apache.hadoop.conf.Configuration.getClass(Configuration.Java:2193)
at org.Apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.Java:2654)
at org.Apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.Java:2667)
at org.Apache.hadoop.fs.FileSystem.access$200(FileSystem.Java:94)
at org.Apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.Java:2703)
at org.Apache.hadoop.fs.FileSystem$Cache.get(FileSystem.Java:2685)
at org.Apache.hadoop.fs.FileSystem.get(FileSystem.Java:373)
at org.Apache.hadoop.fs.Path.getFileSystem(Path.Java:295)
at org.Apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.Java:108)
at org.Apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.Java:67)
at org.Apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.Apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.Apache.spark.scheduler.Task.run(Task.scala:123)
at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1149)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:624)
at Java.lang.Thread.run(Thread.Java:748)
Caused by: Java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at Java.net.URLClassLoader.findClass(URLClassLoader.Java:382)
at Java.lang.ClassLoader.loadClass(ClassLoader.Java:424)
at Sun.misc.Launcher$AppClassLoader.loadClass(Launcher.Java:349)
at Java.lang.ClassLoader.loadClass(ClassLoader.Java:357)
... 34 more
Driver stacktrace:
at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926)
at scala.Option.foreach(Option.scala:257)
at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059)
at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048)
at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2061)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2082)
at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2101)
at org.Apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:153)
at org.Apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62)
at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
at Java.lang.reflect.Method.invoke(Method.Java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
at py4j.Gateway.invoke(Gateway.Java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
at py4j.commands.CallCommand.execute(CallCommand.Java:79)
at py4j.GatewayConnection.run(GatewayConnection.Java:238)
at Java.lang.Thread.run(Thread.Java:748)
Caused by: Java.lang.NoClassDefFoundError: com/amazonaws/AmazonServiceException
at Java.lang.Class.forName0(Native Method)
at Java.lang.Class.forName(Class.Java:348)
at org.Apache.hadoop.conf.Configuration.getClassByNameOrNull(Configuration.Java:2134)
at org.Apache.hadoop.conf.Configuration.getClassByName(Configuration.Java:2099)
at org.Apache.hadoop.conf.Configuration.getClass(Configuration.Java:2193)
at org.Apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.Java:2654)
at org.Apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.Java:2667)
at org.Apache.hadoop.fs.FileSystem.access$200(FileSystem.Java:94)
at org.Apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.Java:2703)
at org.Apache.hadoop.fs.FileSystem$Cache.get(FileSystem.Java:2685)
at org.Apache.hadoop.fs.FileSystem.get(FileSystem.Java:373)
at org.Apache.hadoop.fs.Path.getFileSystem(Path.Java:295)
at org.Apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.Java:108)
at org.Apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.Java:67)
at org.Apache.spark.rdd.HadoopRDD$$anon$1.liftedTree1$1(HadoopRDD.scala:267)
at org.Apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:266)
at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:224)
at org.Apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:95)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.Apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.Apache.spark.scheduler.Task.run(Task.scala:123)
at org.Apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.Apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1149)
at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:624)
... 1 more
Caused by: Java.lang.ClassNotFoundException: com.amazonaws.AmazonServiceException
at Java.net.URLClassLoader.findClass(URLClassLoader.Java:382)
at Java.lang.ClassLoader.loadClass(ClassLoader.Java:424)
at Sun.misc.Launcher$AppClassLoader.loadClass(Launcher.Java:349)
at Java.lang.ClassLoader.loadClass(ClassLoader.Java:357)
... 34 more
_
私のために働いた後
My System Config:
Ubuntu 16.04.6 LTS Python3.7.7 OpenJDKバージョン1.8.0_252 Spark-2.4.5-Bin-Hadoop2.7
Pyspark_python pathを設定します。$ spark_home/conf/spark-env.shに次の行を追加します。
pYSPARK_PYTHON = PYTHON_ENV_PATH/BIN/Python.
Pysparkを起動します
pyspark - packages com.amazonaws:aws-java-sdk-pom:1.11.760、org.apache.hadoop:hadoop-aws:2.7.0 --conf spark.hadoop.fs.s3a.endpoint = s3.us- West-2.Amazonaws.com
com.amazonaws:aws-java-sdk-pom:1.11.760:JDKバージョン:hadoop-aws:2.7.0:あなたのHadoopバージョンのversions s3.us-west-2.amazonaws.com:あなたに依存S3ロケーション
3. S3からのデータを読み取ります
df2=spark.read.parquet("s3a://s3location_file_path")
_
上記で何も機能しない場合は、欠けているクラスの猫とgrepを行います。ジャーが破損している可能性が高い。たとえば、クラスAmazonServiceExceptionが見つからなかった場合は、以下のようにjarが既に存在するgrepを実行してください。
grep "AmazonServiceException" *.jar
_
このファイルに次の追加を追加してくださいhadoop/etc/hadoop/core-site.xml
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>***</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>***</value>
</property>
Hadoopインストールディレクトリ内で、AWS JARの検索、Macインストールディレクトリの場合は/usr/local/Cellar/hadoop/
find . -type f -name "*aws*"
Sudo cp hadoop/share/hadoop/tools/lib/aws-Java-sdk-1.7.4.jar hadoop/share/hadoop/common/lib/
Sudo cp hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.5.jar hadoop/share/hadoop/common/lib/