小さなLIMITでクエリを最適化し、1つの列に述語を付け、別の列で並べ替えます

Question

私はPostgres 9.3.4を使用していて、入力が非常に似ているが応答時間が大幅に異なる4つのクエリがあります。

クエリ＃1

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..585.44 rows=100 width=1041) (actual time=326092.852..507360.199 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14871916.35 rows=2542166 width=1041) (actual time=326092.301..507359.524 rows=100 loops=1) Filter: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Rows Removed by Filter: 6913925 Total runtime: 507361.944 ms

クエリ＃2

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=31239.64..31239.89 rows=100 width=1041) (actual time=2.004..2.038 rows=100 loops=1) -> Sort (cost=31239.64..31261.26 rows=8648 width=1041) (actual time=2.003..2.017 rows=100 loops=1) Sort Key: external_created_at Sort Method: top-N heapsort Memory: 80kB -> Index Scan using index_posts_on_source_id on posts (cost=0.44..30909.12 rows=8648 width=1041) (actual time=0.024..1.063 rows=944 loops=1) Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 109 Total runtime: 2.125 ms

クエリ＃3

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..821.25 rows=100 width=1041) (actual time=19.224..55.599 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14930351.58 rows=1818959 width=1041) (actual time=19.213..55.529 rows=100 loops=1) Filter: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Rows Removed by Filter: 414 Total runtime: 55.683 ms

クエリ＃4

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330) AND posts.deleted_at IS NULL ORDER BY external_created_at desc LIMIT 100 OFFSET 0; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=0.43..69055.29 rows=100 width=1041) (actual time=26.094..320.626 rows=100 loops=1) -> Index Scan using index_posts_on_external_created_at on posts (cost=0.43..14930351.58 rows=21621 width=1041) (actual time=26.093..320.538 rows=100 loops=1) Filter: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[])) Rows Removed by Filter: 6156 Total runtime: 320.778 ms

異なるsource_idsの投稿を見る以外は、4つすべてが同じです。

4つのうち3つは次のインデックスを使用します。

CREATE INDEX index_posts_on_external_created_at ON posts USING btree (external_created_at DESC) WHERE (deleted_at IS NULL);

そして＃2はこのインデックスを使用します：

CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);

私にとって興味深いのは、index_posts_on_external_created_atインデックスを使用する3つのうち、2つは非常に高速で、もう1つ（＃1）は非常に遅いということです。

クエリ＃2は他の3つよりも投稿数が少ないため、代わりにindex_posts_on_source_idインデックスを使用する理由を説明している可能性があります。ただし、index_posts_on_external_created_atインデックスを使用しない場合、index_posts_on_source_idインデックスを使用するように強制すると、他の3つのクエリは非常に遅くなります。

これがpostsテーブルの私の定義です：

CREATE TABLE posts ( id integer NOT NULL, source_id integer, message text, image text, external_id text, created_at timestamp without time zone, updated_at timestamp without time zone, external text, like_count integer DEFAULT 0 NOT NULL, comment_count integer DEFAULT 0 NOT NULL, external_created_at timestamp without time zone, deleted_at timestamp without time zone, poster_name character varying(255), poster_image text, poster_url character varying(255), poster_id text, position integer, location character varying(255), description text, video text, rejected_at timestamp without time zone, deleted_by character varying(255), height integer, width integer );

CLUSTER posts USING index_posts_on_external_created_atを使用してみました

これは本質的にexternal_created_atで注文するインデックスであり、これは私が見つけた唯一の効果的な方法のようです。ただし、実行中に数時間グローバルロックが発生するため、本番環境では使用できません。私はherokuを使用しているため、 pg_repack などをインストールできません。

なぜ＃1のクエリはとても遅く、他のクエリは本当に速いのでしょうか？これを軽減するにはどうすればよいですか？

編集：LIMITとORDERなしの私のクエリは次のとおりです

クエリ＃1

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600, 17804, 20717, 27598, 27599) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=7455044.81..7461163.56 rows=2447503 width=1089) (actual time=94903.143..95110.898 rows=238975 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 81440kB -> Bitmap Heap Scan on posts (cost=60531.78..1339479.50 rows=2447503 width=1089) (actual time=880.150..90988.460 rows=238975 loops=1) Recheck Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Rows Removed by Index Recheck: 5484857 Filter: (deleted_at IS NULL) Rows Removed by Filter: 3108465 -> Bitmap Index Scan on index_posts_on_source_id (cost=0.00..59919.90 rows=3267549 width=0) (actual time=877.904..877.904 rows=3347440 loops=1) Index Cond: (source_id = ANY ('{19082,19075,20705,18328,19110,24965,18329,27600,17804,20717,27598,27599}'::integer[])) Total runtime: 95534.724 ms

クエリ＃2

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (5202, 5203, 661, 659, 662, 627) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=36913.72..36935.85 rows=8852 width=1089) (actual time=212.450..212.549 rows=944 loops=1) Sort Key: external_created_at Sort Method: quicksort Memory: 557kB -> Index Scan using index_posts_on_source_id on posts (cost=0.44..32094.90 rows=8852 width=1089) (actual time=1.732..209.590 rows=944 loops=1) Index Cond: (source_id = ANY ('{5202,5203,661,659,662,627}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 109 Total runtime: 214.507 ms

クエリ＃3

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (14790, 14787, 32928, 14796, 14791, 15503, 14789, 14772, 15506, 14794, 15543, 31615, 15507, 15508, 14800) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=5245032.87..5249894.14 rows=1944508 width=1089) (actual time=131032.952..134342.372 rows=1674072 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 854864kB -> Bitmap Heap Scan on posts (cost=48110.86..1320005.55 rows=1944508 width=1089) (actual time=605.648..91351.334 rows=1674072 loops=1) Recheck Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Rows Removed by Index Recheck: 5304550 Filter: (deleted_at IS NULL) Rows Removed by Filter: 879414 -> Bitmap Index Scan on index_posts_on_source_id (cost=0.00..47624.73 rows=2596024 width=0) (actual time=602.744..602.744 rows=2553486 loops=1) Index Cond: (source_id = ANY ('{14790,14787,32928,14796,14791,15503,14789,14772,15506,14794,15543,31615,15507,15508,14800}'::integer[])) Total runtime: 136176.868 ms

クエリ＃4

EXPLAIN ANALYZE SELECT posts.* FROM posts WHERE posts.source_id IN (18766, 18130, 18128, 18129, 19705, 28252, 18264, 18126, 18767, 27603, 28657, 28654, 28655, 19706, 18330) AND posts.deleted_at IS NULL ORDER BY external_created_at desc; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=102648.92..102704.24 rows=22129 width=1089) (actual time=15225.250..15256.931 rows=51408 loops=1) Sort Key: external_created_at Sort Method: external merge Disk: 35456kB -> Index Scan using index_posts_on_source_id on posts (cost=0.45..79869.91 rows=22129 width=1089) (actual time=3.975..14803.320 rows=51408 loops=1) Index Cond: (source_id = ANY ('{18766,18130,18128,18129,19705,28252,18264,18126,18767,27603,28657,28654,28655,19706,18330}'::integer[])) Filter: (deleted_at IS NULL) Rows Removed by Filter: 54 Total runtime: 15397.453 ms

Postgresのメモリ設定：

name, setting, unit 'default_statistics_target','100','' 'effective_cache_size','16384','8kB' 'maintenance_work_mem','16384','kB' 'max_connections','100','' 'random_page_cost','4',NULL 'seq_page_cost','1',NULL 'shared_buffers','16384','8kB' 'work_mem','1024','kB'

データベース統計：

Total Posts: 20,997,027 Posts where deleted_at is null: 15,665,487 Distinct source_id's: 22,245 Max number of rows per single source_id: 1,543,950 Min number of rows per single source_id: 1 Most source_ids in a single query: 21 Distinct external_created_at: 11,146,151

Erwin Brandstetter · Accepted Answer

一般的なアドバイス

パフォーマンスの最適化に関する一般的なアドバイスがすべて適用されます。デフォルト設定は非常に保守的であり、これらのリソース設定の一部はway to lowwith with table数百万行（特に_work_mem_）。利用可能なRAM=賢明に利用できるようにRDBMSを設定する必要があります。 Postgres Wikiは良い出発点です。これは、ここでの単一の質問の範囲を超えています。

However、以下に提案するクエリでは、非常に中程度のリソース設定のみが必要です。

また、_source_id_の統計ターゲットを増やして、重要な列でより詳細な統計を取得します。

_ALTER TABLE posts ALTER COLUMN source_id SET STATISTICS 2000; -- or similar _

次に：_ANALYZE posts;_

もっと：

PostgreSQLが不適切なクエリプランを選択しないようにする

ストレージをさらに最適化することもできます（マイナーな改善のため）：

読み取りパフォーマンスのためのPostgreSQLの構成

クエリ

クエリ自体を最適化するのは困難です。高度なパフォーマンス最適化については、@ ypercubeの関連質問を参照してください。

空間インデックスは「範囲-並べ替え-制限」クエリに役立ちます

簡単な方法がありますif ...

クエリごとの個別の_source_id_の数がかなり少ない
LIMITもかなり小さいです。

...追加された詳細によると、これはあなたのケースに当てはまります。

以下のクエリに必要な唯一のindex：

_CREATE INDEX posts_special_idx ON posts (source_id, external_created_at DESC) WHERE deleted_at IS NULL; _

クエリに基づく例＃1：

_SELECT p.* FROM unnest('{19082, 19075, 20705, 18328, 19110, 24965, 18329, 27600 , 17804, 20717, 27598, 27599}'::int[]) s(source_id) , LATERAL ( SELECT * FROM posts WHERE source_id = s.source_id AND deleted_at IS NULL ORDER BY external_created_at DESC LIMIT 100 ) p ORDER BY p.external_created_at DESC LIMIT 100; _

これは、ルーズインデックススキャンをエミュレートしています。

ユーザーごとに最新のレコードを取得するためにGROUP BYクエリを最適化

nがsource_idの数である場合（そして幸運にもnever> 21）、Postgresは上位100をフェッチしますインデックスからの各_external_created_at DESC_の行（_source_id_による）。これは、それ自体は非常に高速ですが、最大です。（n-1）* 100行は余剰です。あなたの価値の頻度を考えると：

22,245 _source_id_ 1〜1,543,950行-合計20,997,027行

（これらの数値のすべてに「削除された」行が含まれるかどうかは明確ではありませんでしたが、「削除された」のはわずか25％です。

...一部の_source_id_は、最初は100行未満であると予想します。したがって、上位100を維持するには、2100行/最悪の場合（通常ははるかに少ない）をソートするだけです。Postgresを構成したら、パフォーマンスはそれほど良くないはずです。まともなリソース設定で。

すべての個別の_source_id_を保持するソーステーブルがある場合、それを使用して、存在しない_source_id_を早期に削除することは理にかなっています。

_SELECT p.* FROM source s, LATERAL ( ... ) p WHERE s.source_id IN (19082, 19075, 20705, ...) ORDER BY ... _

このフォームでは最大21個のIN値で問題ありませんが、次の関連する質問を検討してください。

大きなINを使用したPostgresクエリの最適化

know結果の最小_external_created_at_または単一の_source_id_からの最大行数...

jcaron · Answer

クエリ＃3と＃4が＃1よりも高速に実行される理由は、おそらく、行を取得する順序（created_atの値によって示される）がsource_idの条件に一致する100件のレコードを取得するためです。非常に高速（早い段階で作成されたsource_idsでより多くのレコードが必要）、一方＃1は100を見つける前に大量の行をスキャンする必要があります一致します。

source_idではなく、そのインデックスを選択する理由は、source_id値の広がり、deleted_atがNULLであるレコードの割合、およびSTATISTICS設定など、いくつかの要因に依存しますテーブルの上。ほとんどの場合、部分インデックス（deleted_at IS NULL条件を満たすのに役立ちます）はsource_idのインデックスよりも役立つと考えられているようです。

おそらく、source_idに条件deleted_at IS NULLを使用して部分インデックスを追加する必要があります。同時に作成してロックを回避できます。

CREATE INDEX CONCURRENTLY posts_source_id_where_deleted_at_is_null_idx ON posts(source_id) WHERE deleted_at IS NULL;

うまくいけば、それは常にこのインデックスを常に使用するでしょう、それは最も速い実行計画を与えるはずです。

または、CTE（ WITH querys ）を使用してクエリの最適化ガードとして機能し、条件を2つの部分に分割できます。最初にsource_id（既存のインデックスを使用します）、次にdeleted_at（最初のクエリの結果をフィルタリングします）。ただし、deleted_at IS NOT NULL行の割合が多い場合、これは新しい部分インデックスよりも効率が悪くなります。