AWSのAPIを介してグルーテーブルにパーティションを追加しますか？

Question

私は常に新しいデータで満たされているS3バケットを持っています。AthenaとGlueを使用してそのデータをクエリしています。接着剤が新しいパーティションが作成されたことを知らない場合、検索する必要がないので検索しません。そこ。新しいパーティションが必要になるたびにグルークローラーを実行するAPI呼び出しを行うとコストがかかりすぎるので、これを行う最善の解決策は、新しいパーティションが追加されることをグルーに通知することです。 AWSのドキュメントを調べましたが、運が悪かったので、AWSでJavaを使用しています。何か助けはありますか？

Aashish Ola · Answer

新しいパーティションを登録するには、batch_create_partition（）グルーAPIを使用することをお勧めします。 MSCK REPAIR TABLEや再クロールのような高価な操作は必要ありません。

私はpython以下を実行するスクリプトを記述した同様のユースケースを持っています-

ステップ1-テーブル情報を取得し、パーティションを登録するために必要なテーブルから必要な情報を解析します。

# Fetching table information from glue catalog logger.info("Fetching table info for {}.{}".format(l_database, l_table)) try: response = l_client.get_table( CatalogId=l_catalog_id, DatabaseName=l_database, Name=l_table ) except Exception as error: logger.error("Exception while fetching table info for {}.{} - {}" .format(l_database, l_table, error)) sys.exit(-1) # Parsing table info required to create partitions from table input_format = response['Table']['StorageDescriptor']['InputFormat'] output_format = response['Table']['StorageDescriptor']['OutputFormat'] table_location = response['Table']['StorageDescriptor']['Location'] serde_info = response['Table']['StorageDescriptor']['SerdeInfo'] partition_keys = response['Table']['PartitionKeys']

ステップ2-各リストに単一のパーティションを作成するための情報が含まれるリストの辞書を生成します。すべてのリストは同じ構造になりますが、パーティション固有の値は変更されます（年、月、日、時間）

def generate_partition_input_list(start_date, num_of_days, table_location, input_format, output_format, serde_info): input_list = [] # Initializing empty list today = datetime.utcnow().date() if start_date > today: # To handle scenarios if any future partitions are created manually start_date = today end_date = today + timedelta(days=num_of_days) # Getting end date till which partitions needs to be created logger.info("Partitions to be created from {} to {}".format(start_date, end_date)) for input_date in date_range(start_date, end_date): # Formatting partition values by padding required zeroes and converting into string year = str(input_date)[0:4].zfill(4) month = str(input_date)[5:7].zfill(2) day = str(input_date)[8:10].zfill(2) for hour in range(24): # Looping over 24 hours to generate partition input for 24 hours for a day hour = str('{:02d}'.format(hour)) # Padding zero to make sure that hour is in two digits part_location = "{}{}/{}/{}/{}/".format(table_location, year, month, day, hour) input_dict = { 'Values': [ year, month, day, hour ], 'StorageDescriptor': { 'Location': part_location, 'InputFormat': input_format, 'OutputFormat': output_format, 'SerdeInfo': serde_info } } input_list.append(input_dict.copy()) return input_list

ステップ3-batch_create_partition（）APIを呼び出します

 for each_input in break_list_into_chunks(partition_input_list, 100): create_partition_response = client.batch_create_partition( CatalogId=catalog_id, DatabaseName=l_database, TableName=l_table, PartitionInputList=each_input )

1回のAPI呼び出しで100パーティションの制限があるため、100を超えるパーティションを作成する場合は、リストをチャンクに分割し、それを繰り返す必要があります。

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html#Glue.Client.batch_create_partition

botchniaque · Answer

5分ごとにトリガーされるようにグルーカタログを構成できます。
スケジュールに従って実行されるか、バケットからのイベント（例：putObjectイベント）によってトリガーされるラムダ関数を作成でき、その関数はathenaを呼び出してdiscover partitions：
```
import boto3 athena = boto3.client('athena') def lambda_handler(event, context): athena.start_query_execution( QueryString = "MSCK REPAIR TABLE mytable", ResultConfiguration = { 'OutputLocation': "s3://some-bucket/_athena_results" } 
```

Athenaを使用して手動でパーティションを追加します。ラムダの例のように、APIを介してSQLクエリを実行することもできます。

Athenaマニュアルからの例：

ALTER TABLE orders ADD PARTITION (dt = '2016-05-14', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_14_May_2016' PARTITION (dt = '2016-05-15', country = 'IN') LOCATION 's3://mystorage/path/to/INDIA_15_May_2016';

ravenblizzard · Answer

この質問は古いですが、誰かがs3:ObjectCreated:Put通知は、データがS3に到着したときに新しいパーティションを登録するLambda関数をトリガーします。この関数を拡張して、オブジェクトの削除などに基づく非推奨を処理することもできます。 S3イベント通知の詳細については、AWSのブログ投稿をご覧ください。 https://aws.Amazon.com/blogs/aws/s3-event-notification/