大きなテーブルのインデックス付きフィールドからトップ10を選択すると時間がかかりすぎる

Question

次のような165Mレコードのテーブルがあります。

Performance id integer installs integer hour timestamp without time zone

私は時間のインデックスも持っています：

CREATE INDEX hour_idx ON performance USING btree (hour DESC NULLS LAST);

ただし、時間順に並べられた上位10件のレコードを選択すると6分かかります。

EXPLAIN ANALYZE select hour from performance order by hour desc limit 10

戻り値

Limit (cost=7952135.23..7952135.25 rows=10 width=8) (actual time=376313.958..376313.964 rows=10 loops=1) -> Sort (cost=7952135.23..8368461.00 rows=166530310 width=8) (actual time=376313.957..376313.960 rows=10 loops=1) Sort Key: hour Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on performance (cost=0.00..4353475.10 rows=166530310 width=8) (actual time=0.006..327149.828 rows=192330557 loops=1) Planning time: 0.070 ms Execution time: 376330.573 ms

なぜそんなに時間がかかるのですか？日付フィールドdescにインデックスがある場合-データを取得するのは非常に速くありませんか？

Jeremy Schneider · Accepted Answer

上記のサンプルコードでは、インデックスはNULLS LASTとして明示的に作成され、クエリはNULLS FIRST（ORDER BY .. DESCのデフォルト）を暗黙的に実行しているため、PostgreSQLはデータを再ソートする必要がありますインデックスを使用した場合。その結果、インデックスは実際に（すでに遅い）テーブルスキャンよりもクエリを何倍も遅くします。

rds-9.6.5 root@db1=> create table performance (id integer, installs integer, hour timestamp without time zone); CREATE TABLE Time: 28.100 ms rds-9.6.5 root@db1=> with generator as (select generate_series(1,166530) i) [more] - > insert into performance ( [more] ( > select [more] ( > i id, [more] ( > (random()*1000)::integer installs, [more] ( > (now() - make_interval(secs => i))::timestamp installs [more] ( > from generator [more] ( > ); INSERT 0 166530 Time: 244.872 ms rds-9.6.5 root@db1=> create index hour_idx [more] - > on performance [more] - > using btree [more] - > (hour desc nulls last); CREATE INDEX Time: 67.089 ms rds-9.6.5 root@db1=> vacuum analyze performance; VACUUM Time: 43.552 ms

時間列にWHERE句を追加して、インデックスを使用することをお勧めします。ただし、stillがインデックスからデータを再ソートする必要があることに注意してください。

rds-9.6.5 root@db1=> explain select hour from performance where hour>now() order by hour desc limit 10; QUERY PLAN --------------------------------------------------------------------------------------------- Limit (cost=4.45..4.46 rows=1 width=8) -> Sort (cost=4.45..4.46 rows=1 width=8) Sort Key: hour DESC -> Index Only Scan using hour_idx on performance (cost=0.42..4.44 rows=1 width=8) Index Cond: (hour > now()) (5 rows) Time: 0.789 ms

明示的にNULLS LASTをクエリに追加すると、期待どおりにインデックスが使用されます。

rds-9.6.5 root@db1=> explain select hour from performance order by hour desc NULLS LAST limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------- Limit (cost=0.42..0.68 rows=10 width=8) -> Index Only Scan using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8) (2 rows) Time: 0.526 ms

または、インデックスから（デフォルト以外の）NULLS LASTを削除すると、クエリは変更なしで期待どおりにそれを使用します。

rds-9.6.5 root@db1=> drop index hour_idx; DROP INDEX Time: 4.124 ms rds-9.6.5 root@db1=> create index hour_idx [more] - > on performance [more] - > using btree [more] - > (hour desc); CREATE INDEX Time: 69.220 ms rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10; QUERY PLAN ----------------------------------------------------------------------------------------------- Limit (cost=0.42..0.68 rows=10 width=8) -> Index Only Scan using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8) (2 rows) Time: 0.725 ms

インデックスからDESCを削除することもできます。 PostgreSQLはインデックスを順方向と逆方向の両方でスキャンできます。単一列のインデックスでは、通常、逆にする必要はありません。あなたは正しいcombinationの順序と最初と最後のnullを持つことに注意する必要があるだけです。

rds-9.6.5 root@db1=> drop index hour_idx; DROP INDEX Time: 3.837 ms rds-9.6.5 root@db1=> create index hour_idx [more] - > on performance [more] - > using btree [more] - > (hour); CREATE INDEX Time: 94.815 ms rds-9.6.5 root@db1=> explain select hour from performance order by hour desc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------- Limit (cost=0.42..0.68 rows=10 width=8) -> Index Only Scan Backward using hour_idx on performance (cost=0.42..4334.37 rows=166530 width=8) (2 rows) Time: 0.740 ms

Will Crawford · Answer

ほとんどのクエリでhourからNULL以外の値を選択する場合は、それらの値にpartialインデックスを作成することを検討する必要があります。

CREATE INDEX hour_not_null_idx ON performance (hour) WHERE hour IS NOT NULL;

ジェレミーが答えで示したように、hourの特定の値に対してクエリを実行するか、hour IS NOT NULLをWHERE句に追加すると、同じ結果が得られます、そしておそらくあなたにも少しスペースを節約します：

# explain select hour from performance where hour > now() order by hour desc limit 10; Limit (cost=0.42..5.30 rows=10 width=8) -> Index Only Scan Backward using hour_not_null_idx on performance (cost=0.42..8.72 rows=17 width=8) Index Cond: (hour > now())

列にNULL値がない場合は、それをNOT NULLとして宣言する必要があります（ALTER TABLE; oを使用してこれを行う方法を知っていると想定します））、インデックスを作成します（ NULLS LASTなし。とにかく重要ではなくなったため）。次に、同じ利点を得ます：

william=# create index hour_idx on performance using btree ( hour ); CREATE INDEX william=# explain select hour from performance order by hour desc limit 10; QUERY PLAN -------------------------------------------------------------------------------------------------------- Limit (cost=0.42..0.73 rows=10 width=8) -> Index Only Scan Backward using hour_idx on performance (cost=0.42..5238.37 rows=166530 width=8) (2 rows)