インデックス化されたDISTINCT ONが内部結合よりもはるかに遅いのはなぜですか？

Question

2つのテーブル、customersとpurchasesがあります。顧客あたりの購入は多数（数千）あります。通常、各顧客の最新の購入のみが必要なので、latest_purchase_id列を追加し、購入を追加するたびにトリガーで更新します（ https://dba.stackexchange.com/a/243988/186435 を参照）。

トリガーを使用したくないので、DISTINCT ONインデックスを使用してクエリを実行しますが、処理速度が遅くなり、理由がわかりません。

テーブルcustomers：

 Column | Type | Modifiers | Storage | Stats target | Description ---------------------+----------+--------------------------------------------------------+----------+--------------+------------- id | integer | not null default nextval('customers_id_seq'::regclass) | plain | | latest_purchase_id | integer | | plain | | Indexes: "customers_pkey" PRIMARY KEY, btree (id) "customers_latest_purchase_id" btree (latest_purchase_id) Foreign-key constraints: "customers_latest_purchase_fk" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED Referenced by: TABLE "purchases" CONSTRAINT "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED Has OIDs: no

テーブルpurchases：

 Column | Type | Modifiers | Storage | Stats target | Description --------------+-----------+--------------------------------------------------------+----------+--------------+------------- id | integer | not null default nextval('purchases_id_seq'::regclass) | plain | | customer_id | integer | | plain | | Indexes: "purchases_pkey" PRIMARY KEY, btree (id) "purchases_customer_id_id" btree (customer_id, id) "purchases_customer_id" btree (customer_id) Foreign-key constraints: "purchases_customer_fk" FOREIGN KEY (customer_id) REFERENCES customers(id) DEFERRABLE INITIALLY DEFERRED Referenced by: TABLE "customers" CONSTRAINT "customers_latest_purchase_id" FOREIGN KEY (latest_purchase_id) REFERENCES purchases(id) DEFERRABLE INITIALLY DEFERRED Has OIDs: no

DISTINCT ONクエリ：

EXPLAIN ANALYZE SELECT DISTINCT ON (customer_id) id, customer_id FROM purchases ORDER BY customer_id DESC, id DESC; Result (cost=0.43..162516.37 rows=381 width=8) (actual time=0.050..1478.196 rows=823 loops=1) -> Unique (cost=0.43..162516.37 rows=381 width=8) (actual time=0.047..1477.754 rows=823 loops=1) -> Index Only Scan Backward using purchases_customer_id_id on purchases (cost=0.43..157850.96 rows=1866163 width=8) (actual time=0.045..1066.759 rows=1866132 loops=1) Heap Fetches: 1363529 Planning Time: 0.096 ms Execution Time: 1478.408 ms

INNER JOINに基づくクエリlatest_purchase：

EXPLAIN ANALYZE SELECT c.id, p.id FROM customers c JOIN purchases p ON c.latest_purchase = p.id; Nested Loop (cost=0.43..43877.27 rows=7594 width=8) (actual time=0.508..112.665 rows=755 loops=1) -> Seq Scan on customers d (cost=0.00..213.94 rows=7594 width=8) (actual time=0.006..2.861 rows=7594 loops=1) -> Index Only Scan using customers_purchase_pkey on purchases p (cost=0.43..5.75 rows=1 width=4) (actual time=0.014..0.014 rows=0 loops=7594) Index Cond: (id = c.latest_purchase) Heap Fetches: 583 Planning Time: 1.032 ms Execution Time: 112.861 ms

Erwin Brandstetter · Accepted Answer

これが答えです：

顧客あたりの購入は多数（数千）あります。

_DISTINCT ON_は、顧客ごとのfew購入に対して高速です。見る：

各GROUP BYグループの最初の行を選択しますか？

これはmuch速くなるはずです：

_SELECT c.id AS customer_id, p.id AS purchase_id FROM customers c LEFT JOIN LATERAL ( SELECT p.id FROM purchases p WHERE p.customer_id = c.id ORDER BY p.id DESC LIMIT 1 ) p ON true; _

微妙な違い：あらゆる顧客が結果に含まれています。

インデックス"purchases_customer_id_id" btree (customer_id, id)はこれに適しています。 _(customer_id, id DESC)_のインデックスは少しでも良いでしょう。

見る：

GROUP BYクエリを最適化して、ユーザーごとに最新の行を取得する

余談1：

最初の計画は_rows=823_を示し、2番目の計画は_rows=755_を示します。テーブルcustomersに一致しない_purchases.customer_id_があることを示していますが、通常は一致しません。 _purchases.customer_id_から_customers.id_にFK制約を追加し、_purchases.customer_id NOT NULL_を作成して参照整合性を適用します。

余談2：

各クエリプランの最後にたくさんの_Heap Fetches_があります。十分掃除機をかけていますか？見る：

PostgreSQLは、フィールドにBツリーインデックスを使用してORDER BYをどのように実行しますか？