インデックス付きの列でINクエリを最適化する方法

Question

50Mを超えるレコードを含むテーブルがあります。フィールドの1つはCOLOR_CODE。列にインデックスを設定しましたCOLOR_CODE このような：

"mytable_colorcode_idx" btree (color_code)

以下のクエリを実行すると、実行時間が長くなることに気づきました

SELECT count(total_amount) FROM mytable WHERE color_code in ('red','green') and sale_date = '1970'

ただし、OR句を使用すると実行時間が短縮されます。

SELECT count(total_amount) FROM mytable WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970'

INのクエリプラン

explain analyze SELECT count(total_amount) FROM mytable WHERE color_code in ('red','green') and sale_date = '1970' QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=2074238.07..2074238.08 rows=1 width=8) (actual time=63520.150..63520.150 rows=1 loops=1) -> Bitmap Heap Scan on mytable (cost=53504.73..2069923.27 rows=1725919 width=6) (actual time=3509.920..63080.519 rows=1727037 loops=1) Recheck Cond: ((color_code)::text = ANY ('{red,green}'::text[])) Rows Removed by Index Recheck: 5067635 Filter: (sale_date = 1970) Heap Blocks: exact=38679 lossy=496680 -> Bitmap Index Scan on mytable_colorcode_idx (cost=0.00..53073.26 rows=1725919 width=0) (actual time=3501.777..3501.777 rows=1727037 loops=1) Index Cond: ((color_code)::text = ANY ('{red,green}'::text[])) Planning time: 0.165 ms Execution time: 63524.100 ms (10 rows)

ORのクエリプラン

explain analyze SELECT count(total_amount) FROM mytable WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970' QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=2081265.36..2081265.37 rows=1 width=8) (actual time=18895.998..18895.998 rows=1 loops=1) -> Bitmap Heap Scan on mytable (cost=56223.06..2076956.39 rows=1723588 width=6) (actual time=161.335..18468.146 rows=1727037 loops=1) Recheck Cond: (((color_code)::text = 'red'::text) OR ((color_code)::text = 'green'::text)) Rows Removed by Index Recheck: 5067635 Filter: (((color_code)::text = 'red'::text) OR (((color_code)::text = 'green'::text) AND (sale_date = 1970))) Heap Blocks: exact=38679 lossy=496680 -> BitmapOr (cost=56223.06..56223.06 rows=1725919 width=0) (actual time=153.683..153.684 rows=0 loops=1) -> Bitmap Index Scan on mytable_colorcode_idx (cost=0.00..663.35 rows=20655 width=0) (actual time=3.935..3.935 rows=26768 loops=1) Index Cond: ((color_code)::text = 'red'::text) -> Bitmap Index Scan on mytable_colorcode_idx (cost=0.00..54697.91 rows=1705264 width=0) (actual time=149.745..149.746 rows=1700269 loops=1) Index Cond: ((color_code)::text = 'green'::text) Planning time: 0.162 ms Execution time: 18896.785 ms (13 rows)

更新

インデックス（color_code、total_count、sale_date）を追加すると、インデックスがまったく使用されないことに気づきました。むしろ、部分的なスキャンを行います。

"mytable_color_total_count_sale_Date_idx" btree (color_code, total_count, sale_date) QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=2099755.26..2099755.27 rows=1 width=8) (actual time=97066.585..97066.586 rows=1 loops=1) -> Gather (cost=2099755.04..2099755.25 rows=2 width=8) (actual time=97063.512..97069.838 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=2098755.04..2098755.05 rows=1 width=8) (actual time=97061.531..97061.532 rows=1 loops=3) -> Parallel Seq Scan on mytable (cost=0.00..2096119.69 rows=1054140 width=6) (actual time=27782.491..96730.232 rows=841604 loops=3) Filter: ((sale_date = 1970) AND ((color_code)::text = ANY ('{red,green}'::text[]))) Rows Removed by Filter: 4196103 Planning time: 0.161 ms Execution time: 97069.896 ms (10 rows)

質問

IN句に変換する以外に、OR句クエリで最適化できる方法はありますか？

Lennart · Answer

次のパフォーマンスを比較することはできません。

_WHERE color_code in ('red','green') and sale_date = '1970' _

と：

_WHERE color_code = 'red' or color_code = 'green' and sale_date = '1970' _

論理的に同等ではないため（異なる結果が返されます）。簡単な例：

_ with T (color_code, sale_date) as ( values ('red', '1970'), ('green','1969') ) select * from T where color_code in ('green', 'red') and sale_date = '1970'; color_code | sale_date ------------+----------- red | 1970 (1 row) _

しかしながら：

_with T (color_code, sale_date) as ( values ('red', '1970'), ('green','1969') ) select * from T where color_code = 'green' or color_code = 'red' and sale_date = '1970'; color_code | sale_date ------------+----------- red | 1970 green | 1969 (2 rows) _

つまり、ANDはORよりも優先順位が高いため、最適化された式_A OR B AND C_はA OR (B AND C)として評価されます。元の式は_(A OR B) AND C_として評価されます。

比較を有効にするには、クエリを次のように変更する必要があります。

_select * from T where (color_code = 'green' or color_code = 'red') and sale_date = '1970'; _

私の推測では、それとあなたのオリジナルの表現とではパフォーマンスの点でそれほど違いは見られないでしょう。

そうは言っても、私は次のようなインデックスを提案します。

_CREATE INDEX ... ON ... (sale_date, color_code) _

jjanes · Answer

表示されるタイミングの違いは、最初に実行したクエリに基づく、キャッシュ効果にすぎないと思います。これは、クエリの指定方法によって引き起こされる実際の違いではない可能性があります（Lennartが説明したように、ORの部分に括弧がないため、クエリは実際には同等ではありません-すべての行がとにかく_sale_date = '1970'_を満たすように見えるので、この違いは一般的に重要ですが、正確な例では違いはありません）

このクエリの両方の仕様を高速化するためにできることがいくつかあります。

1つは、次の行を見てください。

_ Heap Blocks: exact=38679 lossy=496680 _

つまり、work_memはビットマップ全体を保持するのに十分な大きさではありません。これらの損失のあるブロックはすべて、それらのすべての行を再チェックする必要があり、これには時間がかかります。 work_memを増やすとこれが回避され、クエリが高速化されます。理想的には、不可逆ブロックはゼロに低下します（その時点で「不可逆」ラベルは表示されなくなります）。

第2に、インデックスon mytable (color_code, sale_date, total_count)を使用すると、インデックスのみのスキャンが可能になります。これは、必要なすべてのデータがインデックス内にあり、テーブルにまったくアクセスする必要がないためです（テーブルが十分にバキュームされていると仮定します））。

これらは相互に排他的です。ビットマップスキャンではなくインデックスのみのスキャンを実行する場合、work_memは問題ではなくなります。