Hadoop DistributedCacheは非推奨です-推奨されるAPIは何ですか？

Question

マップタスクにはいくつかの構成データが必要です。これを分散キャッシュ経由で配布したいと思います。

Hadoop MapReduce Tutorial は、おおよそ次のように、DistributedCacheクラスの sage を示しています。

// In the driver JobConf conf = new JobConf(getConf(), WordCount.class); ... DistributedCache.addCacheFile(new Path(filename).toUri(), conf); // In the mapper Path[] myCacheFiles = DistributedCache.getLocalCacheFiles(job); ...

ただし、DistributedCacheは、Hadoop 2.2.0では非推奨としてマークです。

これを達成するための新しい好ましい方法は何ですか？このAPIをカバーする最新の例やチュートリアルはありますか？

user2371156 · Accepted Answer

分散キャッシュのAPIは、Jobクラス自体に含まれています。こちらのドキュメントを確認してください： http://hadoop.Apache.org/docs/stable2/api/org/Apache/hadoop/mapreduce/Job.html コードは

Job job = new Job(); ... job.addCacheFile(new Path(filename).toUri());

マッパーコードで：

Path[] localPaths = context.getLocalCacheFiles(); ...

tolgap · Answer

@jtravagliniを拡張するために、YARN/MapReduce 2にDistributedCacheを使用する好ましい方法は次のとおりです。

ドライバーで、Job.addCacheFile()を使用します

_public int run(String[] args) throws Exception { Configuration conf = getConf(); Job job = Job.getInstance(conf, "MyJob"); job.setMapperClass(MyMapper.class); // ... // Mind the # sign after the absolute file location. // You will be using the name after the # sign as your // file name in your Mapper/Reducer job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some")); job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other")); return job.waitForCompletion(true) ? 0 : 1; } _

マッパー/リデューサーで、setup(Context context)メソッドをオーバーライドします。

_@Override protected void setup( Mapper<LongWritable, Text, Text, Text>.Context context) throws IOException, InterruptedException { if (context.getCacheFiles() != null && context.getCacheFiles().length > 0) { File some_file = new File("./some"); File other_file = new File("./other"); // Do things to these two files, like read them // or parse as JSON or whatever. } super.setup(context); } _

jtravaglini · Answer

YARN/MR2用の新しいDistributedCache APIは、org.Apache.hadoop.mapreduce.Jobクラス。

 Job.addCacheFile()

残念ながら、まだ多くの包括的なチュートリアルスタイルの例はありません。

http://hadoop.Apache.org/docs/stable/api/org/Apache/hadoop/mapreduce/Job.html#addCacheFile%28Java.net.URI%29

Jackie Jiang · Answer

Job.addCacheFile（）は使用しませんでした。代わりに、以前のように「-files /path/to/myfile.txt#myfile」のような-filesオプションを使用しました。次に、マッパーまたはレデューサーコードで、以下のメソッドを使用します。

/** * This method can be used with local execution or HDFS execution. * * @param context * @param symLink * @param throwExceptionIfNotFound * @return * @throws IOException */ public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException { URI[] uris = context.getCacheFiles(); if(uris==null||uris.length==0) { if(throwExceptionIfNotFound) throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache"); return null; } URI symlinkUri = null; for(URI uri: uris) { if(symLink.equals(uri.getFragment())) { symlinkUri = uri; break; } } if(symlinkUri==null) { if(throwExceptionIfNotFound) throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache"); return null; } //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink); }

次に、マッパー/リデューサーで：

@Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true); ... do work ... }

「-files /path/to/myfile.txt」を直接使用した場合、「myfile.txt」を使用してファイルにアクセスする必要があることに注意してください。これはデフォルトのシンボリックリンク名です。

Somum · Answer

言及された解決策のどれも完全に私のために働いた。 Hadoopバージョンが変化し続けるため、Hadoop 2.6.4を使用しています。基本的に、DistributedCacheは非推奨であるため、使用したくありませんでした。ただし、投稿の一部でaddCacheFile（）を使用することが示唆されているため、少し変更されています。これが私のために働いた方法です

job.addCacheFile(new URI("hdfs://X.X.X.X:9000/EnglishStop.txt#EnglishStop.txt"));

ここで、X.X.X.XはマスターIPアドレスまたはローカルホストです。 EnglishStop.txtはHDFSの/ locationに保存されていました。

hadoop fs -ls /

出力は

-rw-r--r-- 3 centos supergroup 1833 2016-03-12 20:24 /EnglishStop.txt drwxr-xr-x - centos supergroup 0 2016-03-12 19:46 /test

面白いが便利な＃EnglishStop.txtは、マッパーで「EnglishStop.txt」としてアクセスできることを意味します。これは同じコードです

public void setup(Context context) throws IOException, InterruptedException { File stopwordFile = new File("EnglishStop.txt"); FileInputStream fis = new FileInputStream(stopwordFile); BufferedReader reader = new BufferedReader(new InputStreamReader(fis)); while ((stopWord = reader.readLine()) != null) { // stopWord is a Word read from Cache } }

これはちょうど私のために働いた。 HDFSに保存されているファイルから行を読み取ることができます

patapouf_ai · Answer

同じ問題がありました。また、DistributedCachが推奨されなくなっただけでなく、getLocalCacheFilesと「new Job」も廃止されました。だから私のために働いたのは次のとおりです：

ドライバ：

Configuration conf = getConf(); Job job = Job.getInstance(conf); ... job.addCacheFile(new Path(filename).toUri());

マッパー/リデューサーのセットアップ：

@Override protected void setup(Context context) throws IOException, InterruptedException { super.setup(context); URI[] files = context.getCacheFiles(); // getCacheFiles returns null Path file1path = new Path(files[0]) ... }