PostgreSQLコピーのCSVからスキーマを生成する方法

Question

数十列以上のCSVがある場合、COPYツールで使用するためにPostgreSQLのCREATE TABLE SQL式で使用できる「スキーマ」をどのように作成できますか？

COPYツールの例や基本的なCREATE TABLE式はたくさんありますが、スキーマを手動で作成するのに法外な数の列がある場合の詳細については何も説明していません。

Daniel Mahler · Accepted Answer

CSVが過度に大きくなく、ローカルマシンで利用できる場合、 csvkit が最も簡単なソリューションです。また、CSVを操作するためのその他のユーティリティも多数含まれているため、一般的に知っている sefull tool です。

シェルに入力する最も簡単な方法：

$ csvsql myfile.csv

必要なCREATE TABLE SQLコマンドを出力します。これは、出力リダイレクトを使用してファイルに保存できます。

接続文字列csvsqlも指定すると、テーブルが作成され、ファイルが一度にアップロードされます。

$ csvsql --db "$MY_DB_URI" --insert myfile.csv

使用しているSQLおよびCSVのフレーバーを指定するオプションもあります。それらは組み込みヘルプに文書化されています：

$ csvsql -h usage: csvsql [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b] [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-H] [-v] [--zero] [-y SNIFFLIMIT] [-i {access,sybase,sqlite,informix,firebird,mysql,Oracle,maxdb,postgresql,mssql}] [--db CONNECTION_STRING] [--query QUERY] [--insert] [--tables TABLE_NAMES] [--no-constraints] [--no-create] [--blanks] [--no-inference] [--db-schema DB_SCHEMA] [FILE [FILE ...]] Generate SQL statements for one or more CSV files, create execute those statements directly on a database, and execute one or more SQL queries. positional arguments: FILE The CSV file(s) to operate on. If omitted, will accept input on STDIN. optional arguments: -h, --help show this help message and exit -d DELIMITER, --delimiter DELIMITER Delimiting character of the input CSV file. -t, --tabs Specifies that the input CSV file is delimited with tabs. Overrides "-d". -q QUOTECHAR, --quotechar QUOTECHAR Character used to quote strings in the input CSV file. -u {0,1,2,3}, --quoting {0,1,2,3} Quoting style used in the input CSV file. 0 = Quote Minimal, 1 = Quote All, 2 = Quote Non-numeric, 3 = Quote None. -b, --doublequote Whether or not double quotes are doubled in the input CSV file. -p ESCAPECHAR, --escapechar ESCAPECHAR Character used to escape the delimiter if --quoting 3 ("Quote None") is specified and to escape the QUOTECHAR if --doublequote is not specified. -z MAXFIELDSIZE, --maxfieldsize MAXFIELDSIZE Maximum length of a single field in the input CSV file. -e ENCODING, --encoding ENCODING Specify the encoding the input CSV file. -S, --skipinitialspace Ignore whitespace immediately following the delimiter. -H, --no-header-row Specifies that the input CSV file has no header row. Will create default headers. -v, --verbose Print detailed tracebacks when errors occur. --zero When interpreting or displaying column numbers, use zero-based numbering instead of the default 1-based numbering. -y SNIFFLIMIT, --snifflimit SNIFFLIMIT Limit CSV dialect sniffing to the specified number of bytes. Specify "0" to disable sniffing entirely. -i {access,sybase,sqlite,informix,firebird,mysql,Oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,Oracle,maxdb,postgresql,mssql} Dialect of SQL to generate. Only valid when --db is not specified. --db CONNECTION_STRING If present, a sqlalchemy connection string to use to directly execute generated SQL on a database. --query QUERY Execute one or more SQL queries delimited by ";" and output the result of the last query as CSV. --insert In addition to creating the table, also insert the data into the table. Only valid when --db is specified. --tables TABLE_NAMES Specify one or more names for the tables to be created. If omitted, the filename (minus extension) or "stdin" will be used. --no-constraints Generate a schema without length limits or null checks. Useful when sampling big tables. --no-create Skip creating a table. Only valid when --insert is specified. --blanks Do not coerce empty strings to NULL values. --no-inference Disable type inference when parsing the input. --db-schema DB_SCHEMA Optional name of database schema to create table(s) in.

他のいくつかのツールも、スキーマ推論を行います。

Apache Spark
パンダ（Python）
ブレイズ（Python）
read.csv + Rのお気に入りのdbパッケージ

これらのそれぞれには、CSV（およびその他の形式）を、通常はDataFrameなどと呼ばれる表形式のデータ構造に読み込む機能があり、プロセスの列タイプを推測します。次に、同等のSQLスキーマを書き出すか、DataFrameを指定したデータベースに直接アップロードする他のコマンドがあります。ツールの選択は、データの量、データの保存方法、CSVの特異性、ターゲットデータベース、および使用する言語によって異なります。

klin · Answer

基本的に、既成のツールまたはPython、Rubyまたは選択した言語を使用して、データベースの外部でデータ（その構造を含む）を準備する必要があります。ただし、そのような機会がない場合は、 plpgsqlを使用して多くのことができます。

テキスト列を含むテーブルを作成する

Csv形式のファイルには、列タイプ、主キーまたは外部キーなどに関する情報は含まれていません。テキスト列を含むテーブルを比較的簡単に作成し、そこにデータをコピーできます。その後、列のタイプを手動で変更し、制約を追加する必要があります。

create or replace function import_csv(csv_file text, table_name text) returns void language plpgsql as $$ begin create temp table import (line text) on commit drop; execute format('copy import from %L', csv_file); execute format('create table %I (%s);', table_name, concat(replace(line, ',', ' text, '), ' text')) from import limit 1; execute format('copy %I from %L (format csv, header)', table_name, csv_file); end $$;

ファイルc:\data est.csvのサンプルデータ：

id,a_text,a_date,a_timestamp,an_array 1,str 1,2016-08-01,2016-08-01 10:10:10,"{1,2}" 2,str 2,2016-08-02,2016-08-02 10:10:10,"{1,2,3}" 3,str 3,2016-08-03,2016-08-03 10:10:10,"{1,2,3,4}"

インポート：

select import_csv('c:\data	est.csv', 'new_table'); select * from new_table; id | a_text | a_date | a_timestamp | an_array ----+--------+------------+---------------------+----------- 1 | str 1 | 2016-08-01 | 2016-08-01 10:10:10 | {1,2} 2 | str 2 | 2016-08-02 | 2016-08-02 10:10:10 | {1,2,3} 3 | str 3 | 2016-08-03 | 2016-08-03 10:10:10 | {1,2,3,4} (3 rows)

大きなcsvファイル

上記の関数は、データを2回インポートします（一時テーブルとターゲットテーブルに）。大きなファイルの場合、これは時間の深刻な損失とサーバーへの不要な負荷になる可能性があります。解決策は、csvファイルを2つのファイルに分割することです。1つはヘッダー、もう1つはデータです。その後、関数は次のようになります。

create or replace function import_csv(header_file text, data_file text, table_name text) returns void language plpgsql as $$ begin create temp table import (line text) on commit drop; execute format('copy import from %L', header_file); execute format('create table %I (%s);', table_name, concat(replace(line, ',', ' text, '), ' text')) from import; execute format('copy %I from %L (format csv)', table_name, data_file); end $$;

列タイプの変更

内容に基づいて列のタイプを自動的に変更してみることができます。単純なタイプを処理していて、ファイル内のデータが特定の形式を一貫して保持している場合は、成功できます。ただし、一般的にそれは複雑なタスクであり、以下にリストされている機能は例としてのみ考慮されるべきです。

コンテンツに基づいて列タイプを決定します（関数を編集して目的の変換を追加します）。

create or replace function column_type(val text) returns text language sql as $$ select case when val ~ '^[\+-]{0,1}\d+$' then 'integer' when val ~ '^[\+-]{0,1}\d*\.\d+$' then 'numeric' when val ~ '^\d\d\d\d-\d\d-\d\d$' then 'date' when val ~ '^\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d$' then 'timestamp' end $$;

上記の関数を使用して列タイプを変更します。

create or replace function alter_column_types(table_name text) returns void language plpgsql as $$ declare rec record; qry text; begin for rec in execute format( 'select key, column_type(value) ctype from ( select row_to_json(t) a_row from %I t limit 1 ) s, json_each_text (a_row)', table_name) loop if rec.ctype is not null then qry:= format( '%salter table %I alter %I type %s using %s::%s;', qry, table_name, rec.key, rec.ctype, rec.key, rec.ctype); end if; end loop; execute(qry); end $$;

使用する：

select alter_column_types('new_table'); \d new_table Table "public.new_table" Column | Type | Modifiers -------------+-----------------------------+----------- id | integer | a_text | text | a_date | date | a_timestamp | timestamp without time zone | an_array | text |

（まあ、配列型の適切な認識はかなり複雑です）