フィルターを使用した複雑なPostgresクエリの最適化

Question

だから私は以前にこのクエリについて尋ねましたが、私は本当に洞察に満ちた答えを得ました。ただし、Postgresql 9.6.3でこのクエリをさらにセグメント化できるようにしたいのですが、再びスローダウンし始めます。部分的なインデックスがブール値からではないため、ここで役立つかどうかはわかりません。

したがって、これは非常にうまく機能している基本クエリです：

EXPLAIN ANALYZE SELECT posts.* FROM unnest('{17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305}'::int []) s(source_id), LATERAL (SELECT "posts".* FROM "posts" WHERE (source_id = s.source_id) AND ("posts"."deleted_at" IS NOT NULL) AND "posts"."rejected_at" IS NULL ORDER BY posts.external_created_at DESC LIMIT 100) posts ORDER BY posts.external_created_at DESC LIMIT 100 OFFSET 1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=30895.79..30896.04 rows=100 width=1043) (actual time=5.299..5.337 rows=100 loops=1) -> Sort (cost=30895.78..30920.78 rows=10000 width=1043) (actual time=5.297..5.325 rows=101 loops=1) Sort Key: posts.external_created_at DESC Sort Method: top-N heapsort Memory: 110kB -> Nested Loop (cost=0.56..30512.87 rows=10000 width=1043) (actual time=0.085..4.077 rows=738 loops=1) -> Function Scan on unnest s (cost=0.00..1.00 rows=100 width=4) (actual time=0.011..0.016 rows=12 loops=1) -> Limit (cost=0.56..303.12 rows=100 width=1043) (actual time=0.018..0.298 rows=62 loops=12) -> Index Scan using index_posts_for_moderation_queue on posts (cost=0.56..7628.00 rows=2521 width=1043) (actual time=0.017..0.285 rows=62 loops=12) Index Cond: (source_id = s.source_id) Planning time: 0.443 ms Execution time: 5.433 ms (11 rows)

そして、これはフィルター付きの修正されたもので、はるかに遅いです：

EXPLAIN ANALYZE SELECT posts.* FROM unnest('{17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305}'::int []) s(source_id), LATERAL (SELECT "posts".* FROM "posts" WHERE (source_id = s.source_id) AND ("posts"."deleted_at" IS NOT NULL) AND "posts"."deleted_by" = 'User' AND "posts"."rejected_at" IS NULL ORDER BY posts.external_created_at DESC LIMIT 100) posts ORDER BY posts.external_created_at DESC LIMIT 100 OFFSET 0; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=551390.03..551390.28 rows=100 width=1043) (actual time=769.522..769.522 rows=0 loops=1) -> Sort (cost=551390.03..551391.78 rows=700 width=1043) (actual time=769.521..769.521 rows=0 loops=1) Sort Key: posts.external_created_at DESC Sort Method: quicksort Memory: 25kB -> Nested Loop (cost=5513.47..551363.28 rows=700 width=1043) (actual time=769.508..769.508 rows=0 loops=1) -> Function Scan on unnest s (cost=0.00..1.00 rows=100 width=4) (actual time=0.012..0.022 rows=12 loops=1) -> Limit (cost=5513.47..5513.48 rows=7 width=1043) (actual time=64.122..64.122 rows=0 loops=12) -> Sort (cost=5513.47..5513.48 rows=7 width=1043) (actual time=64.120..64.120 rows=0 loops=12) Sort Key: posts.external_created_at DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on posts (cost=5485.28..5513.37 rows=7 width=1043) (actual time=64.104..64.104 rows=0 loops=12) Recheck Cond: ((source_id = s.source_id) AND (deleted_at IS NOT NULL) AND (rejected_at IS NULL) AND ((deleted_by)::text = 'User'::text)) Rows Removed by Index Recheck: 1 Heap Blocks: exact=9 -> BitmapAnd (cost=5485.28..5485.28 rows=7 width=0) (actual time=64.098..64.098 rows=0 loops=12) -> Bitmap Index Scan on index_posts_for_moderation_queue (cost=0.00..59.47 rows=2521 width=0) (actual time=0.028..0.028 rows=168 loops=12) Index Cond: (source_id = s.source_id) -> Bitmap Index Scan on index_posts_on_deleted_by (cost=0.00..5425.55 rows=291865 width=0) (actual time=76.855..76.855 rows=334200 loops=10) Index Cond: ((deleted_by)::text = 'User'::text) Planning time: 0.348 ms Execution time: 769.660 ms (21 rows)

2つの違いは、2番目のAND "posts"."deleted_by" = 'User'部分がラテラルクエリに追加されたことです。

問題は、「ユーザー」の値がどこにあるかであり、これはブール値ではなく、何でもかまいません。

Deleted_byクエリを設定していても、このクエリをさらに最適化してより高速になる方法はありますか？

DBの構造とインデックス、設定は次のとおりです。

CREATE TABLE posts ( id integer NOT NULL, source_id integer, message text, image text, external_id text, created_at timestamp without time zone, updated_at timestamp without time zone, external text, like_count integer DEFAULT 0 NOT NULL, comment_count integer DEFAULT 0 NOT NULL, external_created_at timestamp without time zone, deleted_at timestamp without time zone, poster_name character varying(255), poster_image text, poster_url character varying(255), poster_id text, position integer, location character varying(255), description text, video text, rejected_at timestamp without time zone, deleted_by character varying(255), height integer, width integer ); CREATE INDEX index_posts_on_source_id_and_external_created_at ON posts USING btree (source_id, external_created_at DESC) WHERE deleted_at IS NOT NULL AND rejected_at IS NULL; CREATE INDEX index_posts_on_deleted_at ON posts USING btree (deleted_at); CREATE INDEX index_posts_on_deleted_by ON posts USING btree (deleted_by); CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id);

上記の最初のインデックスは、最後の質問に対する答えの結果です。

Postgresのメモリ設定：

name, setting, unit 'default_statistics_target','100','' 'effective_cache_size','16384','8kB' 'maintenance_work_mem','16384','kB' 'max_connections','100','' 'random_page_cost','4',NULL 'seq_page_cost','1',NULL 'shared_buffers','16384','8kB' 'work_mem','1024','kB'

データベース統計：

Total Posts: 20,997,027 Posts where deleted_at is null: 15,665,487 Distinct source_id's: 22,245 Max number of rows per single source_id: 1,543,950 Min number of rows per single source_id: 1 Most source_ids in a single query: 21 Distinct external_created_at: 11,146,151

[〜＃〜]編集[〜＃〜]

さまざまなソースIDを使用してEvanから取得した簡略化された回答を試してみましたが、かなり遅いです：

EXPLAIN ANALYZE SELECT * FROM posts AS p WHERE source_id IN (159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810) AND deleted_at IS NOT NULL AND deleted_by = 'Filter' AND rejected_at IS NULL ORDER BY external_created_at DESC LIMIT 100; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=74114.14..74114.19 rows=100 width=1060) (actual time=2794.981..2794.981 rows=0 loops=1) -> Sort (cost=74114.14..74115.48 rows=2678 width=1060) (actual time=2794.981..2794.981 rows=0 loops=1) Sort Key: external_created_at DESC Sort Method: quicksort Memory: 25kB -> Bitmap Heap Scan on posts p (cost=68759.42..74093.67 rows=2678 width=1060) (actual time=2794.977..2794.977 rows=0 loops=1) Recheck Cond: ((source_id = ANY ('{159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810}'::integer[])) AND (deleted_at IS NOT NULL) AND (rejected_at IS NULL) AND ((deleted_by)::text = 'Filter'::text)) Rows Removed by Index Recheck: 32326 Heap Blocks: exact=16019 -> BitmapAnd (cost=68759.42..68759.42 rows=2678 width=0) (actual time=2745.376..2745.376 rows=0 loops=1) -> Bitmap Index Scan on index_posts_for_moderation_queue (cost=0.00..830.64 rows=52637 width=0) (actual time=42.319..42.319 rows=272192 loops=1) Index Cond: (source_id = ANY ('{159469,120669,120668,120670,120671,120674,120662,120661,120664,109450,109448,109447,108039,159468,157810}'::integer[])) -> Bitmap Index Scan on index_posts_on_deleted_by (cost=0.00..67928.46 rows=6942897 width=0) (actual time=2651.123..2651.123 rows=7863994 loops=1) Index Cond: ((deleted_by)::text = 'Filter'::text) Planning time: 0.856 ms Execution time: 2795.033 ms (15 rows)

LATERALを使用している理由は別の以前の質問で説明できます。このクエリを最適化しました。

Evan Carroll · Accepted Answer

クエリから直接、これを修正します。代わりにこれを試してください。

二重引用符の使用を停止します。これは二重引用符で囲まないでください。
「、ラテラル」と言ってはいけません。それがSQL-89 JOIN構文です。それを更新する時間です。それらはすべて_CROSS JOIN LATERAL_です
Intには文字列リテラルを使用しないでください。 ARRAY []を実行してください。
_CROSS JOIN LATERAL_に書き換えられる場合は、_INNER JOIN_を使用しないでください。
WHERE x IN ()に書き直すことができる場合は、リテラルに_INNER JOIN_を使用しないでください。
リストがSQLからのものである場合は、_WHERE x IN_を使用しないでください。 EXISTSを使用します（これはここでは適用されませんが、許可する場合は...）。

これを試して。

_EXPLAIN ANALYZE SELECT posts.* FROM posts AS p WHERE source_id IN (17858,50909,52659,50914,50916,51696,52661,52035,17860,53315,54027,53305) AND deleted_at IS NOT NULL AND deleted_by = 'User' AND posts.rejected_at IS NULL ORDER BY posts.external_created_at DESC LIMIT 100; _

更新

そのクエリの大きな問題は、単に_deleted_by_です。これが私のアドバイスです。

これらは現在のインデックスです

_CREATE INDEX index_posts_on_source_id_and_external_created_at ON posts USING btree (source_id, external_created_at DESC) WHERE deleted_at IS NOT NULL AND rejected_at IS NULL; CREATE INDEX index_posts_on_deleted_at ON posts USING btree (deleted_at); CREATE INDEX index_posts_on_deleted_by ON posts USING btree (deleted_by); CREATE INDEX index_posts_on_source_id ON posts USING btree (source_id); _

_index_posts_on_source_id_and_external_created_at_および_index_posts_on_source_id_を使用する理由はありません。どちらも最初のsource_idをカバーしています。したがって、_index_posts_on_source_id_を削除します。挿入が遅くなるだけです。

次に、大きな問題は_deleted_by_です。これを修正するには2つの方法があります。

1つは複合インデックスなので、2つのインデックススキャンを実行してビットマップをマージする必要はありません。
述語インデックスです。

_deleted_by_が数種類の値しか持てない場合は、enum型として作成し、文字列比較を削除することを検討してください。

jjanes · Answer

PostgreSQLは、インデックスの使用index_posts_on_deleted_byは、逆効果になる場合に役立ちます。

プランナーで実際に何が間違っているのかを理解して混乱させるには、かなり深く掘り下げる必要があります。その一部は、おそらく単純にsource_idの値がどれほど人気があるかわからず、係数10で過大に見積もられていることですが、それだけですべてだとは思いません。迅速でダーティなソリューションは、インデックスが使用されないようにすることです。他のクエリが必要ない場合index_posts_on_deleted_by、あなたはそれを落とすことができます。そうでない場合は、クエリを変更して、次のように変更することで、そのクエリで使用されないようにすることができます。

 AND posts.deleted_by = 'User'

に

 AND posts.deleted_by||'' = 'User'

または、レナートが示唆したインデックスは、より魅力的な外観のオプションをプランナーに提供するのに十分かもしれません。確かに役立つ別のインデックスは次のとおりです。

(source_id, deleted_by, external_created_at)

または

(deleted_by, source_id, external_created_at)

ただし、それらはdeleted_by句を省略したクエリをサポートしないため、両方のクエリをサポートするために両方のインデックスを保持する必要があります。