Postgresqlクエリプランナーにハッシュ結合でインデックス付きのネストされたループを使用させる

Question

PostgreSQL 9.3.4に読み込まれたStackOverflow-schema関連のデータに問題があります。インデックス付きのネストされたループの代わりにハッシュ結合を使用することを選択しているため、クエリに必要な時間の約10倍の時間がかかります。たとえば、クエリで500人のユーザーを選択した場合、post_tokenizedテーブルでIDとタイプインデックスを使用する代わりに、ハッシュ結合が使用されます。

explain select creation_Epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl join posts as posts_tbl on posts_tbl.id = tokenized_tbl.id where type = 'tag' and user_screen_name is not null and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500) and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 500)) Hash Join (cost=29570.13..751852.55 rows=119954 width=21) Hash Cond: (tokenized_tbl.id = posts_tbl.id) -> Index Scan using type_index_post_tokenized on post_tokenized tokenized_tbl (cost=0.44..646219.29 rows=20281711 width=8) Index Cond: (type = 'tag'::text) -> Hash (cost=29561.73..29561.73 rows=637 width=25) -> Hash Join (cost=15576.75..29561.73 rows=637 width=25) Hash Cond: (posts_tbl.id = posts.id) -> Nested Loop (cost=48.20..12824.71 rows=106853 width=21) -> HashAggregate (cost=47.76..52.76 rows=500 width=4) -> Limit (cost=0.43..41.51 rows=500 width=8) -> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8) Index Cond: (reputation > 100000) -> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.40 rows=214 width=25) Index Cond: (owner_user_id = users.id) Filter: (user_screen_name IS NOT NULL) -> Hash (cost=14181.63..14181.63 rows=107754 width=4) -> HashAggregate (cost=13104.09..14181.63 rows=107754 width=4) -> Nested Loop (cost=48.20..12834.71 rows=107754 width=4) -> HashAggregate (cost=47.76..52.76 rows=500 width=4) -> Limit (cost=0.43..41.51 rows=500 width=8) -> Index Scan using reputation_index_users on users users_1 (cost=0.43..211.57 rows=2570 width=8) Index Cond: (reputation > 100000) -> Index Scan using owner_user_id_index_posts on posts (cost=0.44..23.40 rows=216 width=8) Index Cond: (owner_user_id = users_1.id)

しかし、ユーザーの数を200に減らすと、インデックス付きのネストされたループが使用されます（はるかに高速）。

explain select creation_Epoch, user_screen_name, chunk from post_tokenized as tokenized_tbl join posts as posts_tbl on posts_tbl.id = tokenized_tbl.id where type = 'tag' and user_screen_name is not null and owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200) and tokenized_tbl.id in (select id from posts where owner_user_id in (select id from users where reputation > 100000 order by reputation asc limit 200)) Nested Loop (cost=6633.63..466114.15 rows=47982 width=21) -> Hash Join (cost=6291.07..11836.00 rows=102 width=25) Hash Cond: (posts_tbl.id = posts.id) -> Nested Loop (cost=19.80..5189.72 rows=42741 width=21) -> HashAggregate (cost=19.36..21.36 rows=200 width=4) -> Limit (cost=0.43..16.86 rows=200 width=8) -> Index Scan using reputation_index_users on users (cost=0.43..211.57 rows=2570 width=8) Index Cond: (reputation > 100000) -> Index Scan using owner_user_id_index_posts on posts posts_tbl (cost=0.44..23.70 rows=214 width=25) Index Cond: (owner_user_id = users.id) Filter: (user_screen_name IS NOT NULL) -> Hash (cost=5732.50..5732.50 rows=43102 width=4) -> HashAggregate (cost=5301.48..5732.50 rows=43102 width=4) -> Nested Loop (cost=19.80..5193.72 rows=43102 width=4) -> HashAggregate (cost=19.36..21.36 rows=200 width=4) -> Limit (cost=0.43..16.86 rows=200 width=8) -> Index Scan using reputation_index_users on users users_1 (cost=0.43..211.57 rows=2570 width=8) Index Cond: (reputation > 100000) -> Index Scan using owner_user_id_index_posts on posts (cost=0.44..23.70 rows=216 width=8) Index Cond: (owner_user_id = users_1.id) -> Bitmap Heap Scan on post_tokenized tokenized_tbl (cost=342.56..4448.69 rows=502 width=8) Recheck Cond: (id = posts_tbl.id) Filter: (type = 'tag'::text) -> Bitmap Index Scan on id_index_post_tokenized (cost=0.00..342.44 rows=43656 width=0) Index Cond: (id = posts_tbl.id)

500人のユーザーが選択されているときに、同じプラン（インデックス付きのネストされたループ）を使用するにはどうすればよいですか？次のパラメータを調整してみました：cpu_Tuple_cost、seq_page_cost、random_page_cost、effective_cache_size、（ ref ）で、プランを変更する方法がわかりません。リクエストされたユーザーの数が増えると計画は変わるようですが、私の環境でのテストから、Postgresが500ユーザーでも同じ計画を維持していれば、はるかに速くなります。

Erwin Brandstetter · Accepted Answer

SOに関するこの密接に関連する答えは、あなたの主な質問に対する答えを提供するはずです：
単一のSELECTクエリでenable_seqscan = offを設定

現在のトランザクションのハッシュ結合を無効にするために、同様にを使用できます。

SET LOCAL enable_hashjoin=off;

しかし、それはではありません私のアドバイスです。あちらで答えを読んでください。
そして、統計とコスト設定についてもこれです。

さらに重要なこと、最初にクエリのもつれをほどく：

SELECT creation_Epoch, user_screen_name, chunk FROM ( SELECT id AS owner_user_id FROM users WHERE reputation > 100000 ORDER BY reputation LIMIT 500 ) u JOIN posts p USING (owner_user_id) JOIN post_tokenized t USING (id) WHERE type = 'tag' AND user_screen_name IS NOT NULL;

かなり高速になり、クエリプランナーが最適なプランを選択するのが容易になります（適切なコスト設定とテーブル統計が与えられます）。