ParquetWriterを使用して、TIMESTAMP論理型（INT96）を寄木細工に書き込む方法は？

Question

org.Apache.parquet.hadoop.ParquetWriter を使用してCSVデータファイルを寄木細工のデータファイルに変換するツールがあります。

現在、int32、double、およびstringのみを処理します

寄木細工のtimestamp論理型（int96と注釈が付けられている）をサポートする必要があります。正確な仕様がオンラインで見つからないため、その方法がわかりません。

このタイムスタンプエンコーディング（int96）はまれであり、十分にサポートされていないようです。仕様の詳細はオンラインではほとんど見つかりません。このgithub README は次のように述べています：

Int96として保存されるタイムスタンプは、その日のナノ秒（最初の8バイト）とユリウス日（最後の4バイト）で構成されます。

具体的に：

MessageType スキーマの列にどの寄木細工 Type を使用しますか？プリミティブ型PrimitiveTypeName.INT96を使用する必要があると思いますが、論理型を指定する方法があるかどうかわかりません。
データを書き込むにはどうすればよいですか？つまり、どの形式でタイムスタンプをグループに書き込みますか？ INT96タイムスタンプの場合、バイナリタイプを記述する必要があると思いますか？

これが私がやろうとしていることを示す私のコードの簡略版です。具体的には、「TODO」コメントを見てください。これらは、上記の質問に関連するコードの2つのポイントです。

List<Type> fields = new ArrayList<>(); fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT32, "int32_col", null)); fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.DOUBLE, "double_col", null)); fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.STRING, "string_col", null)); // TODO: // Specify the TIMESTAMP type. // How? INT96 primitive type? Is there a logical timestamp type I can use w/ MessageType schema? fields.add(new PrimitiveType(Type.Repetition.OPTIONAL, PrimitiveTypeName.INT96, "timestamp_col", null)); MessageType schema = new MessageType("input", fields); // initialize writer Configuration configuration = new Configuration(); configuration.setQuietMode(true); GroupWriteSupport.setSchema(schema, configuration); ParquetWriter<Group> writer = new ParquetWriter<Group>( new Path("output.parquet"), new GroupWriteSupport(), CompressionCodecName.SNAPPY, ParquetWriter.DEFAULT_BLOCK_SIZE, ParquetWriter.DEFAULT_PAGE_SIZE, 1048576, true, false, ParquetProperties.WriterVersion.PARQUET_1_0, configuration ); // write CSV data CSVParser parser = CSVParser.parse(new File(csv), StandardCharsets.UTF_8, CSVFormat.TDF.withQuote(null)); ArrayList<String> columns = new ArrayList<>(schemaMap.keySet()); int colIndex; int rowNum = 0; for (CSVRecord csvRecord : parser) { rowNum ++; Group group = f.newGroup(); colIndex = 0; for (String record : csvRecord) { if (record == null || record.isEmpty() || record.equals( "NULL")) { colIndex++; continue; } record = record.trim(); String type = schemaMap.get(columns.get(colIndex)).get("type").toString(); MessageTypeConverter.addTypeValueToGroup(type, record, group, colIndex++); switch (colIndex) { case 0: // int32 group.add(colIndex, Integer.parseInt(record)); break; case 1: // double group.add(colIndex, Double.parseDouble(record)); break; case 2: // string group.add(colIndex, record); break; case 3: // TODO: convert CSV string value to TIMESTAMP type (how?) throw new NotImplementedException(); } } writer.write(group); } writer.close();

James Wierzba · Accepted Answer

このコード from spark sqlを参照として使用して、私はそれを理解しました。

INT96バイナリエンコーディングは2つの部分に分かれています。最初の8バイトは午前0時からのナノ秒です。最後の4バイトはユリウス日です。

String value = "2019-02-13 13:35:05"; final long NANOS_PER_HOUR = TimeUnit.HOURS.toNanos(1); final long NANOS_PER_MINUTE = TimeUnit.MINUTES.toNanos(1); final long NANOS_PER_SECOND = TimeUnit.SECONDS.toNanos(1); // Parse date SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); Calendar cal = Calendar.getInstance(TimeZone.getTimeZone("UTC")); cal.setTime(parser.parse(value)); // Calculate Julian days and nanoseconds in the day LocalDate dt = LocalDate.of(cal.get(Calendar.YEAR), cal.get(Calendar.MONTH)+1, cal.get(Calendar.DAY_OF_MONTH)); int julianDays = (int) JulianFields.JULIAN_DAY.getFrom(dt); long nanos = (cal.get(Calendar.HOUR_OF_DAY) * NANOS_PER_HOUR) + (cal.get(Calendar.MINUTE) * NANOS_PER_MINUTE) + (cal.get(Calendar.SECOND) * NANOS_PER_SECOND); // Write INT96 timestamp byte[] timestampBuffer = new byte[12]; ByteBuffer buf = ByteBuffer.wrap(timestampBuffer); buf.order(ByteOrder.LITTLE_ENDIAN).putLong(nanos).putInt(julianDays); // This is the properly encoded INT96 timestamp Binary tsValue = Binary.fromReusedByteArray(timestampBuffer);