大きなテーブルからグループごとに最大の値を取得するための効率的なクエリ

Question

テーブルを考えると：

_ Column | Type id | integer latitude | numeric(9,6) longitude | numeric(9,6) speed | integer equipment_id | integer created_at | timestamp without time zone Indexes: "geoposition_records_pkey" PRIMARY KEY, btree (id) _

テーブルには2,000万レコードがあり、比較的大きな数値ではありません。ただし、順次スキャンが遅くなります。

各_equipment_id_の最後のレコード（max(created_at)）を取得するにはどうすればよいですか？

私は次の両方のクエリを試しましたが、このトピックの多くの回答を読んだいくつかのバリアントがあります。

_select max(created_at),equipment_id from geoposition_records group by equipment_id; select distinct on (equipment_id) equipment_id,created_at from geoposition_records order by equipment_id, created_at desc; _

_equipment_id,created_at_のbtreeインデックスの作成も試みましたが、Postgresはseqscanを使用する方が高速であることを発見しました。インデックスの読み取りはseqスキャンと同じくらい遅いため、おそらく_enable_seqscan = off_を強制しても意味がありません。

クエリは定期的に実行して、常に最後を返す必要があります。

Postgres 9.3の使用。

説明/分析（170万件のレコード）：

_set enable_seqscan=true; explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id; "HashAggregate (cost=47803.77..47804.34 rows=57 width=12) (actual time=1935.536..1935.556 rows=58 loops=1)" " -> Seq Scan on geoposition_records (cost=0.00..39544.51 rows=1651851 width=12) (actual time=0.029..494.296 rows=1651851 loops=1)" "Total runtime: 1935.632 ms" set enable_seqscan=false; explain analyze select max(created_at),equipment_id from geoposition_records group by equipment_id; "GroupAggregate (cost=0.00..2995933.57 rows=57 width=12) (actual time=222.034..11305.073 rows=58 loops=1)" " -> Index Scan using geoposition_records_equipment_id_created_at_idx on geoposition_records (cost=0.00..2987673.75 rows=1651851 width=12) (actual time=0.062..10248.703 rows=1651851 loops=1)" "Total runtime: 11305.161 ms" _

Erwin Brandstetter · Answer

結局のところ、単純な複数列のBツリーインデックスは機能するはずです。

_CREATE INDEX foo_idx ON geoposition_records (equipment_id, created_at DESC NULLS LAST); _

なぜ_DESC NULLS LAST_なのですか？

日付範囲のクエリで未使用のインデックス

関数

クエリプランナーに意味を伝えることができない場合は、機器テーブルをループする関数が役立ちます。一度に1つのequipment_idを検索すると、インデックスが使用されます。少数の場合（_EXPLAIN ANALYZE_出力から判断すると57）、それは高速です。
equipmentテーブルがあると想定しても安全ですか？

_CREATE OR REPLACE FUNCTION f_latest_equip() RETURNS TABLE (equipment_id int, latest timestamp) AS $func$ BEGIN FOR equipment_id IN SELECT e.equipment_id FROM equipment e ORDER BY 1 LOOP SELECT g.created_at FROM geoposition_records g WHERE g.equipment_id = f_latest_equip.equipment_id -- prepend function name to disambiguate ORDER BY g.created_at DESC NULLS LAST LIMIT 1 INTO latest; RETURN NEXT; END LOOP; END $func$ LANGUAGE plpgsql STABLE; _

ニースにも電話をかけます：

_SELECT * FROM f_latest_equip(); _

相関サブクエリ

考えてみてください。このequipmentテーブルを使用すると、相関性の低いサブクエリを使用してダーティな作業を行うことができます。

_SELECT equipment_id ,(SELECT created_at FROM geoposition_records WHERE equipment_id = eq.equipment_id ORDER BY created_at DESC NULLS LAST LIMIT 1) AS latest FROM equipment eq; _

パフォーマンスはとても良いです。

`LATERAL` Postgres 9.3以降に参加

_SELECT eq.equipment_id, r.latest FROM equipment eq LEFT JOIN LATERAL ( SELECT created_at FROM geoposition_records WHERE equipment_id = eq.equipment_id ORDER BY created_at DESC NULLS LAST LIMIT 1 ) r(latest) ON true; _

詳細な説明：

ユーザーごとに最新のレコードを取得するためにGROUP BYクエリを最適化する

相関サブクエリと同様のパフォーマンス。 max()、_DISTINCT ON_、関数、相関サブクエリ、LATERALのパフォーマンスの比較：

SQLフィドル 。

Colin &#39;t Hart · Answer

試行1

もし

別のequipmentテーブルがあります。
geoposition_records(equipment_id, created_at desc)にインデックスがあります

その後、次のように動作します：

_select id as equipment_id, (select max(created_at) from geoposition_records where equipment_id = equipment.id ) as max_created_at from equipment; _

_equipment_id_ sと関連するmax(created_at)のリストをboth決定するためにPGに高速クエリを強制することができませんでした。でも明日はまたやってみます！

試行2

私はこのリンクを見つけました： http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values この手法を試行1のクエリと組み合わせると、次のようになります。

_WITH RECURSIVE equipment(id) AS ( SELECT MIN(equipment_id) FROM geoposition_records UNION SELECT ( SELECT equipment_id FROM geoposition_records WHERE equipment_id > equipment.id ORDER BY equipment_id LIMIT 1 ) FROM equipment WHERE id IS NOT NULL ) SELECT id AS equipment_id, (SELECT MAX(created_at) FROM geoposition_records WHERE equipment_id = equipment.id ) AS max_created_at FROM equipment; _

これは高速に動作します！しかし、あなたは必要です

この非常に歪んだクエリフォーム
geoposition_records(equipment_id, created_at desc)のインデックス。

大きなテーブルからグループごとに最大の値を取得するための効率的なクエリ

関数

相関サブクエリ

LATERAL Postgres 9.3以降に参加

`LATERAL` Postgres 9.3以降に参加