boto3を使用したdynamoDbの完全なスキャン

Question

私のテーブルは約220MBで、250kのレコードが含まれています。私はこのデータをすべてPythonに引き込もうとしています。これはチャンク化されたバッチプロセスであり、ループスルーする必要があることを理解していますが、以前の中断したところから開始するようにバッチを設定する方法がわかりません。

スキャンをフィルタリングする方法はありますか？私が読んだことから、ロード後にフィルタリングが発生し、ロードが1MBで停止するため、実際には新しいオブジェクトをスキャンできません。

どんな援助も大歓迎です。

import boto3 dynamodb = boto3.resource('dynamodb', aws_session_token = aws_session_token, aws_access_key_id = aws_access_key_id, aws_secret_access_key = aws_secret_access_key, region_name = region ) table = dynamodb.Table('widgetsTableName') data = table.scan()

Tay B · Answer

テーブルスキャンに関する Amazon DynamoDBドキュメントがあなたの質問に答えていると思います。

つまり、応答でLastEvaluatedKeyを確認する必要があります。コードを使用した例を次に示します。

import boto3 dynamodb = boto3.resource('dynamodb', aws_session_token=aws_session_token, aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key, region_name=region ) table = dynamodb.Table('widgetsTableName') response = table.scan() data = response['Items'] while 'LastEvaluatedKey' in response: response = table.scan(ExclusiveStartKey=response['LastEvaluatedKey']) data.extend(response['Items'])

Jordon Phillips · Answer

boto3は、ページネーションの詳細をすべて処理するページネーターを提供します。ここは、スキャンページネータのドキュメントページです。基本的に、次のように使用します。

import boto3 client = boto3.client('dynamodb') paginator = client.get_paginator('scan') for page in paginator.paginate(): # do something

Abe Voelker · Answer

ジョーダン・フィリップスの答えを取り除いて、次のようにページネーションでFilterExpressionを渡します。

import boto3 client = boto3.client('dynamodb') paginator = client.get_paginator('scan') operation_parameters = { 'TableName': 'foo', 'FilterExpression': 'bar > :x AND bar < :y', 'ExpressionAttributeValues': { ':x': {'S': '2017-01-31T01:35'}, ':y': {'S': '2017-01-31T02:08'}, } } page_iterator = paginator.paginate(**operation_parameters) for page in page_iterator: # do something

Vincent · Answer

@kungphuが言及したdynamodb形式タイプを削除するためのコード。

import boto3 from boto3.dynamodb.types import TypeDeserializer from boto3.dynamodb.transform import TransformationInjector client = boto3.client('dynamodb') paginator = client.get_paginator('query') service_model = client._service_model.operation_model('Query') trans = TransformationInjector(deserializer = TypeDeserializer()) for page in paginator.paginate(): trans.inject_attribute_value_output(page, service_model)

CJ_Spaz · Answer

Boto3は、返された応答の一部として「LastEvaluatedKey」をキャプチャすることがわかりました。これは、スキャンの開始点として使用できます。

data= table.scan( ExclusiveStartKey=data['LastEvaluatedKey'] )

返されるデータがExclusiveStartKeyのみになるまで、この周りにループを構築する予定です

Dan Hook · Answer

LastEvaluatedKeyに適用され、ページネーションを台無しにする変換に関連するVincentの答えには、いくつかの問題がありました。次のように解決しました：

import boto3 from boto3.dynamodb.types import TypeDeserializer from boto3.dynamodb.transform import TransformationInjector client = boto3.client('dynamodb') paginator = client.get_paginator('scan') operation_model = client._service_model.operation_model('Scan') trans = TransformationInjector(deserializer = TypeDeserializer()) operation_parameters = { 'TableName': 'tablename', } items = [] for page in paginator.paginate(**operation_parameters): has_last_key = 'LastEvaluatedKey' in page if has_last_key: last_key = page['LastEvaluatedKey'].copy() trans.inject_attribute_value_output(page, operation_model) if has_last_key: page['LastEvaluatedKey'] = last_key items.extend(page['Items'])

Richard · Answer

DynamoDBは、scanメソッドをスキャンごとに1 MBのデータに制限します。

ドキュメント： https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html# DynamoDB.Client.scan

以下は、LastEvaluatedKeyを使用してDynamoDBテーブルからすべてのデータを取得するループの例です。

import boto3 dynamodb = boto3.resource('dynamodb') table = dynamodb.Table('your_table_name') has_items = True last_key = False while has_items: if last_key: data = table.scan(ExclusiveStartKey=last_key) else: data = table.scan() if 'LastEvaluatedKey' in data: has_items = True last_key = data['LastEvaluatedKey'] else: has_items = False last_key = False # TODO do something with data['Items'] here.

YitzikC · Answer

上記の2つのアプローチには両方とも問題があります：ループ内で明示的にページングを処理する長くて反復的なコードを記述するか、低レベルセッションでBotoページネーターを使用し、高レベルBotoオブジェクトの利点を前述します。

Python機能コードを使用して高レベルの抽象化を提供するソリューションでは、AWSページングの複雑さを隠しながら、高レベルのBotoメソッドを使用できます。

import itertools import typing def iterate_result_pages(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Generator: """A wrapper for functions using AWS paging, that returns a generator which yields a sequence of items for every response Args: function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey' This could be a bound method of an object. Returns: A generator which yields the 'Items' field of the result for every response """ response = function_returning_response(*args, **kwargs) yield response["Items"] while "LastEvaluatedKey" in response: kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"] response = function_returning_response(*args, **kwargs) yield response["Items"] return def iterate_paged_results(function_returning_response: typing.Callable, *args, **kwargs) -> typing.Iterator: """A wrapper for functions using AWS paging, that returns an iterator of all the items in the responses. Items are yielded to the caller as soon as they are received. Args: function_returning_response: A function (or callable), that returns an AWS response with 'Items' and optionally 'LastEvaluatedKey' This could be a bound method of an object. Returns: An iterator which yields one response item at a time """ return itertools.chain.from_iterable(iterate_result_pages(function_returning_response, *args, **kwargs)) # Example, assuming 'table' is a Boto DynamoDB table object: all_items = list(iterate_paged_results(ProjectionExpression = 'my_field'))