Postgresはインデックススキャンではなく順次スキャンを実行しています

Question

約1000万行のテーブルと日付フィールドのインデックスがあります。結果セットに26項目しかない場合でも、インデックス付きフィールドの一意の値を抽出しようとすると、Postgresは順次スキャンを実行します。オプティマイザがこの計画を選ぶのはなぜですか？そして、私はそれを避けることができますか？

他の回答から、これはインデックスと同じくらいクエリに関連していると思います。

explain select "labelDate" from pages group by "labelDate"; QUERY PLAN ----------------------------------------------------------------------- HashAggregate (cost=524616.78..524617.04 rows=26 width=4) Group Key: "labelDate" -> Seq Scan on pages (cost=0.00..499082.42 rows=10213742 width=4) (3 rows)

テーブル構造：

http=# \d pages Table "public.pages" Column | Type | Modifiers -----------------+------------------------+---------------------------------- pageid | integer | not null default nextval('... createDate | integer | not null archive | character varying(16) | not null label | character varying(32) | not null wptid | character varying(64) | not null wptrun | integer | not null url | text | urlShort | character varying(255) | startedDateTime | integer | renderStart | integer | onContentLoaded | integer | onLoad | integer | PageSpeed | integer | rank | integer | reqTotal | integer | not null reqHTML | integer | not null reqJS | integer | not null reqCSS | integer | not null reqImg | integer | not null reqFlash | integer | not null reqJSON | integer | not null reqOther | integer | not null bytesTotal | integer | not null bytesHTML | integer | not null bytesJS | integer | not null bytesCSS | integer | not null bytesHTML | integer | not null bytesJS | integer | not null bytesCSS | integer | not null bytesImg | integer | not null bytesFlash | integer | not null bytesJSON | integer | not null bytesOther | integer | not null numDomains | integer | not null labelDate | date | TTFB | integer | reqGIF | smallint | not null reqJPG | smallint | not null reqPNG | smallint | not null reqFont | smallint | not null bytesGIF | integer | not null bytesJPG | integer | not null bytesPNG | integer | not null bytesFont | integer | not null maxageMore | smallint | not null maxage365 | smallint | not null maxage30 | smallint | not null maxage1 | smallint | not null maxage0 | smallint | not null maxageNull | smallint | not null numDomElements | integer | not null numCompressed | smallint | not null numHTTPS | smallint | not null numGlibs | smallint | not null numErrors | smallint | not null numRedirects | smallint | not null maxDomainReqs | smallint | not null bytesHTMLDoc | integer | not null maxage365 | smallint | not null maxage30 | smallint | not null maxage1 | smallint | not null maxage0 | smallint | not null maxageNull | smallint | not null numDomElements | integer | not null numCompressed | smallint | not null numHTTPS | smallint | not null numGlibs | smallint | not null numErrors | smallint | not null numRedirects | smallint | not null maxDomainReqs | smallint | not null bytesHTMLDoc | integer | not null fullyLoaded | integer | cdn | character varying(64) | SpeedIndex | integer | visualComplete | integer | gzipTotal | integer | not null gzipSavings | integer | not null siteid | numeric | Indexes: "pages_pkey" PRIMARY KEY, btree (pageid) "pages_date_url" UNIQUE CONSTRAINT, btree ("urlShort", "labelDate") "idx_pages_cdn" btree (cdn) "idx_pages_labeldate" btree ("labelDate") CLUSTER "idx_pages_urlshort" btree ("urlShort") Triggers: pages_label_date BEFORE INSERT OR UPDATE ON pages FOR EACH ROW EXECUTE PROCEDURE fix_label_date()

ypercubeᵀᴹ · Accepted Answer

これはPostgresの最適化に関する既知の問題です。明確な値が少ない場合-あなたの場合のように-8.4以降のバージョンを使用している場合、再帰クエリを使用した非常に高速な回避策は次のとおりです： Loose Indexscan 。

クエリは書き直すことができます（LATERALには9.3以降のバージョンが必要です）：

WITH RECURSIVE pa AS ( ( SELECT labelDate FROM pages ORDER BY labelDate LIMIT 1 ) UNION ALL SELECT n.labelDate FROM pa AS p , LATERAL ( SELECT labelDate FROM pages WHERE labelDate > p.labelDate ORDER BY labelDate LIMIT 1 ) AS n ) SELECT labelDate FROM pa ;

Erwin Brandstetterは、この回答の完全な説明とクエリのいくつかのバリエーションを持っています（関連するが異なる問題について）： GROUP BYクエリを最適化してユーザーごとに最新のレコードを取得する

Erwin Brandstetter · Answer

最良のクエリはデータ分布に大きく依存します。

確立された日付ごとに多数行があります。あなたのケースは結果の26の値のみに焼き付きますので、以下のソリューションはすべて、インデックスが使用されるとすぐに驚くほど高速になります。
（より明確な値の場合、ケースはより興味深いものになります。）

pageidをすべて含める必要はありません（コメントしたように）。

インデックス

必要なのは、_"labelDate"_の単純なbtreeインデックスだけです。
列にいくつかのNULL値がある場合、部分インデックスがさらに役立ちます（そしてより小さくなります）。

_CREATE INDEX pages_labeldate_nonull_idx ON big ("labelDate") WHERE "labelDate" IS NOT NULL; _

後で明確にしました：

0％NULL。ただし、インポート時に修正した後のみ。

部分インデックスmayは、NULL値を持つ行の中間状態を除外することに意味があります。インデックスへの不必要な更新を回避します（結果として生じる膨張）。

クエリ

暫定範囲に基づく

ギャップがあまり多くない連続した範囲に日付が表示される場合、データ型dateの性質を活用できます。与えられた2つの値の間には、有限でカウント可能な値の数しかありません。ギャップが少ない場合、これが最も速くなります。

_SELECT d."labelDate" FROM ( SELECT generate_series(min("labelDate")::timestamp , max("labelDate")::timestamp , interval '1 day')::date AS "labelDate" FROM pages ) d WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate"); _

generate_series()でtimestampにキャストする理由見る：

PostgreSQLの2つの日付間の時系列の生成

インデックスから最小値と最大値を安価に選択できます。あなたが知っている最小および/または最大の可能な日付であるなら、それはまだ少し安くなります。例：

_SELECT d."labelDate" FROM (SELECT date '2011-01-01' + g AS "labelDate" FROM generate_series(0, now()::date - date '2011-01-01' - 1) g) d WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate"); _

または、不変の間隔の場合：

_SELECT d."labelDate" FROM (SELECT date '2011-01-01' + g AS "labelDate" FROM generate_series(0, 363) g) d WHERE EXISTS (SELECT FROM pages WHERE "labelDate" = d."labelDate"); _

ルーズインデックススキャン

これは、日付の分布に非常によく機能します（日付ごとに多くの行がある場合）。基本的に @ ypercubeはすでに提供されています。ただし、細かい点がいくつかあり、お気に入りのインデックスをどこでも使用できるようにする必要があります。

_WITH RECURSIVE p AS ( ( -- parentheses required for LIMIT SELECT "labelDate" FROM pages WHERE "labelDate" IS NOT NULL ORDER BY "labelDate" LIMIT 1 ) UNION ALL SELECT (SELECT "labelDate" FROM pages WHERE "labelDate" > p."labelDate" ORDER BY "labelDate" LIMIT 1) FROM p WHERE "labelDate" IS NOT NULL ) SELECT "labelDate" FROM p WHERE "labelDate" IS NOT NULL; _

最初のCTE pは、実質的に
```
_SELECT min("labelDate") FROM pages _
```
ただし、詳細形式では、部分インデックスが使用されます。さらに、このフォームは通常、私の経験（および私のテスト）では少し高速です。
単一の列の場合のみ、rCTEの再帰的な項の相関サブクエリは少し高速になるはずです。これは、「labelDate」のNULLとなる行を除外する必要があります。見る：
GROUP BYクエリを最適化して、ユーザーごとに最新のレコードを取得する

アサイド

引用符で囲まれていない合法的な小文字の識別子は、あなたの人生を楽にします。
テーブル定義の列を適切に配列して、ディスク領域を節約します。

PostgreSQLのスペースの計算と保存