Pandas DataFrameをGoogle Cloud StorageまたはBigQueryに書き込みます

Question

こんにちは、時間と考慮に感謝します。 Google Cloud Platform/DatalabでJupyter Notebookを開発しています。 Pandas DataFrameを作成し、このDataFrameをGoogle Cloud Storage（GCS）および/またはBigQueryの両方に書き込みたいと思います。GCSにバケットを作成し、次のコードを使用して作成しました次のオブジェクト：

import gcp import gcp.storage as storage project = gcp.Context.default().project_id bucket_name = 'steve-temp' bucket_path = bucket_name bucket = storage.Bucket(bucket_path) bucket.exists()

Google Datalabのドキュメントに基づいてさまざまなアプローチを試しましたが、失敗し続けています。ありがとう

Anthonios Partheniou · Accepted Answer

次の作業例を試してください。

from datalab.context import Context import google.datalab.storage as storage import google.datalab.bigquery as bq import pandas as pd # Dataframe to write simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c']) sample_bucket_name = Context.default().project_id + '-datalab-example' sample_bucket_path = 'gs://' + sample_bucket_name sample_bucket_object = sample_bucket_path + '/Hello.txt' bigquery_dataset_name = 'TestDataSet' bigquery_table_name = 'TestTable' # Define storage bucket sample_bucket = storage.Bucket(sample_bucket_name) # Create storage bucket if it does not exist if not sample_bucket.exists(): sample_bucket.create() # Define BigQuery dataset and table dataset = bq.Dataset(bigquery_dataset_name) table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name) # Create BigQuery dataset if not dataset.exists(): dataset.create() # Create or overwrite the existing table if it exists table_schema = bq.Schema.from_data(simple_dataframe) table.create(schema = table_schema, overwrite = True) # Write the DataFrame to GCS (Google Cloud Storage) %storage write --variable simple_dataframe --object $sample_bucket_object # Write the DataFrame to a BigQuery table table.insert(simple_dataframe)

this の例、および datalab githubサイトの _ table.py ファイルを参照として使用しました。 this リンクで他のdatalabソースコードファイルを見つけることができます。

Jan Krynauw · Answer

Googleの使用 Cloud Datalabドキュメント

import datalab.storage as gcs gcs.Bucket('bucket-name').item('to/data.csv').write_to(simple_dataframe.to_csv(),'text/csv')

Ekaba Bisong · Answer

Pandas DataFrameをBigQueryに書き込む

更新 @Anthonios Partheniouの回答。
コードは少し異なります-Nov。29 2017

BigQueryデータセットを定義するには

project_idおよびdataset_idを含むタプルをbq.Datasetに渡します。

# define a BigQuery dataset bigquery_dataset_name = ('project_id', 'dataset_id') dataset = bq.Dataset(name = bigquery_dataset_name)

BigQueryテーブルを定義するには

project_id、dataset_idおよびテーブル名を含むタプルをbq.Tableに渡します。

# define a BigQuery table bigquery_table_name = ('project_id', 'dataset_id', 'table_name') table = bq.Table(bigquery_table_name)

データセット/テーブルを作成し、BQのテーブルに書き込みます

# Create BigQuery dataset if not dataset.exists(): dataset.create() # Create or overwrite the existing table if it exists table_schema = bq.Schema.from_data(dataFrame_name) table.create(schema = table_schema, overwrite = True) # Write the DataFrame to a BigQuery table table.insert(dataFrame_name)

Porada Kev · Answer

Daskを使用して、タスクに対してもう少し簡単な解決策があります。 DataFrameをDask DataFrameに変換できます。これはCloud Storageのcsvに書き込むことができます

import dask.dataframe as dd import pandas df # your Pandas DataFrame ddf = dd.from_pandas(df,npartitions=1, sort=True) dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False, storage_options={'token': gcs.session.credentials})

intotecho · Answer

2017年以来、PandasにはBigQueryへのDataframe関数があります pandas.DataFrame.to_gbq

documentation に例があります：

import pandas_gbq as gbq gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')

パラメーターif_existsは、 'fail'、 'replace'、または 'append'に設定できます。

こちらもご覧ください例。

Theo · Answer

一時ファイルを書き込まずに標準のGCSモジュールのみを使用して、Google Cloud Storageにアップロードする

from google.cloud import storage import os import pandas as pd # Only need this if you're running this code locally. os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json' df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c']) client = storage.Client() bucket = client.get_bucket('my-bucket-name') bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')

dartdog · Answer

それをプレーンなバイト変数にロードし、%% storage write --variable $ sample_bucketpath（ドキュメントを参照）を別のセルで使用する必要があると思います...私はまだそれを理解しています...しかし、それは大体ですCSVファイルを読み込むために必要なことの逆で、書き込みに違いがあるかどうかはわかりませんが、BytesIOを使用して%% storage readコマンドで作成されたバッファを読み取る必要がありました...助けて、教えてください！