プログラムでデータをHBaseに一括ロードする最速の方法は何ですか？

Question

カスタム解析が必要な数百万行のプレーンテキストファイルがあり、それをできるだけ速くHBaseテーブルにロードしたいと思います（HadoopまたはHBase Java client）を使用）。

私の現在のソリューションは、Reduce部分のないMapReduceジョブに基づいています。 FileInputFormatを使用してテキストファイルを読み取り、各行がmapクラスのMapperメソッドに渡されるようにします。この時点で、行が解析されてPutオブジェクトが形成されます。このオブジェクトはcontextに書き込まれます。次に、TableOutputFormatはPutオブジェクトを取得し、それをテーブルに挿入します。

このソリューションでは、1秒あたり1,000行の平均挿入率が得られますが、これは私が予想したよりも低くなっています。 私のHBaseセットアップは、単一サーバー上で疑似分散モードになっています。

興味深い点の1つは、1,000,000行の挿入中に、25個のマッパー（タスク）が生成されますが、それらは連続して（次々に）実行されることです。これは正常ですか？

これが私の現在のソリューションのコードです：

public static class CustomMap extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> { protected void map(LongWritable key, Text value, Context context) throws IOException { Map<String, String> parsedLine = parseLine(value.toString()); Put row = new Put(Bytes.toBytes(parsedLine.get(keys[1]))); for (String currentKey : parsedLine.keySet()) { row.add(Bytes.toBytes(currentKey),Bytes.toBytes(currentKey),Bytes.toBytes(parsedLine.get(currentKey))); } try { context.write(new ImmutableBytesWritable(Bytes.toBytes(parsedLine.get(keys[1]))), row); } catch (InterruptedException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } public int run(String[] args) throws Exception { if (args.length != 2) { return -1; } conf.set("hbase.mapred.outputtable", args[1]); // I got these conf parameters from a presentation about Bulk Load conf.set("hbase.hstore.blockingStoreFiles", "25"); conf.set("hbase.hregion.memstore.block.multiplier", "8"); conf.set("hbase.regionserver.handler.count", "30"); conf.set("hbase.regions.percheckin", "30"); conf.set("hbase.regionserver.globalMemcache.upperLimit", "0.3"); conf.set("hbase.regionserver.globalMemcache.lowerLimit", "0.15"); Job job = new Job(conf); job.setJarByClass(BulkLoadMapReduce.class); job.setJobName(NAME); TextInputFormat.setInputPaths(job, new Path(args[0])); job.setInputFormatClass(TextInputFormat.class); job.setMapperClass(CustomMap.class); job.setOutputKeyClass(ImmutableBytesWritable.class); job.setOutputValueClass(Put.class); job.setNumReduceTasks(0); job.setOutputFormatClass(TableOutputFormat.class); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { Long startTime = Calendar.getInstance().getTimeInMillis(); System.out.println("Start time : " + startTime); int errCode = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadMapReduce(), args); Long endTime = Calendar.getInstance().getTimeInMillis(); System.out.println("End time : " + endTime); System.out.println("Duration milliseconds: " + (endTime-startTime)); System.exit(errCode); }

QuinnG · Accepted Answer

MRからHBaseにデータをロードする効率的な方法を見つけようとするあなたのプロセスとおそらく非常に似ているプロセスを経験しました。私が機能することがわかったのは、MRのOutputFormatClassとしてHFileOutputFormatを使用することです。

以下は、データを書き出すjob関数とMappermap関数を生成する必要があるコードの基礎です。これは速かった。もう使わないので手元に数字はありませんが、1分足らずで約250万枚のレコードでした。

これは、データをHBaseに配置するためのMapReduceプロセスのジョブを生成するために作成した（削除された）関数です。

private Job createCubeJob(...) { //Build and Configure Job Job job = new Job(conf); job.setJobName(jobName); job.setMapOutputKeyClass(ImmutableBytesWritable.class); job.setMapOutputValueClass(Put.class); job.setMapperClass(HiveToHBaseMapper.class);//Custom Mapper job.setJarByClass(CubeBuilderDriver.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(HFileOutputFormat.class); TextInputFormat.setInputPaths(job, hiveOutputDir); HFileOutputFormat.setOutputPath(job, cubeOutputPath); Configuration hConf = HBaseConfiguration.create(conf); hConf.set("hbase.zookeeper.quorum", hbaseZookeeperQuorum); hConf.set("hbase.zookeeper.property.clientPort", hbaseZookeeperClientPort); HTable hTable = new HTable(hConf, tableName); HFileOutputFormat.configureIncrementalLoad(job, hTable); return job; }

これは、HiveToHBaseMapperクラスのマップ関数です（少し編集されています）。

public void map(WritableComparable key, Writable val, Context context) throws IOException, InterruptedException { try{ Configuration config = context.getConfiguration(); String[] strs = val.toString().split(Constants.Hive_RECORD_COLUMN_SEPARATOR); String family = config.get(Constants.CUBEBUILDER_CONFIGURATION_FAMILY); String column = strs[COLUMN_INDEX]; String Value = strs[VALUE_INDEX]; String sKey = generateKey(strs, config); byte[] bKey = Bytes.toBytes(sKey); Put put = new Put(bKey); put.add(Bytes.toBytes(family), Bytes.toBytes(column), (value <= 0) ? Bytes.toBytes(Double.MIN_VALUE) : Bytes.toBytes(value)); ImmutableBytesWritable ibKey = new ImmutableBytesWritable(bKey); context.write(ibKey, put); context.getCounter(CubeBuilderContextCounters.CompletedMapExecutions).increment(1); } catch(Exception e){ context.getCounter(CubeBuilderContextCounters.FailedMapExecutions).increment(1); } }

これがコピー＆ペーストのソリューションになることはないと確信しています。明らかに、ここで使用していたデータには、カスタム処理は必要ありませんでした（これは、この前のMRジョブで実行されていました）。これから私が提供したい主なものはHFileOutputFormatです。残りは私がそれをどのように使用したかのほんの一例です。 :)
それがあなたを良い解決策への確かな道へと導いてくれることを願っています。：

Praveen Sripati · Answer

興味深い点の1つは、1,000,000行の挿入中に、25個のマッパー（タスク）が生成されますが、それらは連続して（次々に）実行されることです。これは正常ですか？

mapreduce.tasktracker.map.tasks.maximumパラメータ（デフォルトは2）は、ノードで並行して実行できるタスクの最大数を決定します。変更しない限り、各ノードで2つのマップタスクが同時に実行されているはずです。