厄介なネストされたビューの結合のパフォーマンスの向上

Question

私はいくつかのテーブルに広がった中規模のデータベースを持っています。大まかなアーキテクチャは次のとおりです。

入力データ（データID、セッションID、および統計的に重要ないくつかのフィールド）
入力ファイル（データIDとblob）
ステージ1出力ファイル（データIDとblob）
ステージ2出力ファイル（データIDとblob）
カテゴリ1の結果（データIDといくつかのブール値）
カテゴリ2の結果（データIDと一部の整数）
カテゴリ3の結果（データIDと一部の整数）

各テーブルには、約200,000行があります。

また、基本的にこれらすべてを一緒に接着して、一連のIDをSELECT（通常はセッションIDに基づいて選択）して、すべての関連データを1つのページに表示できるようにするビューもあります。

ビューは機能し、クエリプランのインデックス使用率seems正気ですが、結果は速くありません：

> EXPLAIN ANALYZE SELECT(*) FROM overlay WHERE test_session=12345; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Merge Right Join (cost=7.19..74179.49 rows=10 width=305) (actual time=10680.129..10680.494 rows=4 loops=1) Merge Cond: (p.data_id = d.id) -> Merge Join (cost=7.19..75077.04 rows=183718 width=234) (actual time=0.192..10434.995 rows=173986 loops=1) Merge Cond: (p.data_id = input_file.data_id) -> Merge Join (cost=7.19..69917.74 rows=183718 width=222) (actual time=0.173..9255.653 rows=173986 loops=1) Merge Cond: (p.data_id = stage1_output_file.data_id) -> Merge Join (cost=5.50..62948.54 rows=183718 width=186) (actual time=0.153..8081.949 rows=173986 loops=1) Merge Cond: (p.data_id = stage2_output_file.data_id) -> Merge Join (cost=3.90..55217.36 rows=183723 width=150) (actual time=0.132..6918.814 rows=173986 loops=1) Merge Cond: (p.data_id = stage3_output_file.data_id) -> Nested Loop (cost=2.72..47004.01 rows=183723 width=114) (actual time=0.111..5753.105 rows=173986 loops=1) Join Filter: (p.impression = istr.id) -> Merge Join (cost=1.68..30467.90 rows=183723 width=102) (actual time=0.070..2675.733 rows=173986 loops=1) Merge Cond: (p.data_id = s.data_id) -> Merge Join (cost=1.68..19031.56 rows=183723 width=58) (actual time=0.049..1501.546 rows=173986 loops=1) Merge Cond: (p.data_id = t.data_id) -> Index Scan using Category1_Results_pkey on Category1_Results p (cost=0.00..7652.17 rows=183723 width=18) (actual time=0.025..315.531 rows=173986 loops=1) -> Index Scan using Category3_Results_pkey on Category3_Results t (cost=0.00..8624.43 rows=183787 width=40) (actual time=0.016..321.460 rows=173986 loops=1) -> Index Scan using Category2_Results_pkey on Category2_Results s (cost=0.00..8681.47 rows=183787 width=44) (actual time=0.014..320.794 rows=173986 loops=1) -> Materialize (cost=1.04..1.08 rows=4 width=20) (actual time=0.001..0.007 rows=4 loops=173986) -> Seq Scan on Category1_impression_str istr (cost=0.00..1.04 rows=4 width=20) (actual time=0.005..0.012 rows=4 loops=1) -> Index Scan using Stage3_Output_file_pkey on Stage3_Output_file stage3 (cost=0.00..8178.35 rows=183871 width=36) (actual time=0.015..317.698 rows=173986 loops=1) -> Index Scan using analysis_file_pkey on analysis_file Stage2_Output (cost=0.00..8168.99 rows=183718 width=36) (actual time=0.014..317.776 rows=173986 loops=1) -> Index Scan using Stage1_output_file_pkey on Stage1_output_file stg1 (cost=0.00..8199.07 rows=183856 width=36) (actual time=0.014..321.648 rows=173986 loops=1) -> Index Scan using input_file_pkey on input_file input (cost=0.00..8618.05 rows=183788 width=36) (actual time=0.014..328.968 rows=173986 loops=1) -> Materialize (cost=0.00..39.59 rows=10 width=75) (actual time=0.046..0.150 rows=4 loops=1) -> Nested Loop Left Join (cost=0.00..39.49 rows=10 width=75) (actual time=0.039..0.128 rows=4 loops=1) Join Filter: (t.id = d.input_quality) -> Index Scan using input_data_exists_index on input_data d (cost=0.00..28.59 rows=10 width=45) (actual time=0.013..0.025 rows=4 loops=1) Index Cond: (test_session = 1040) -> Seq Scan on quality_codes t (cost=0.00..1.04 rows=4 width=38) (actual time=0.002..0.009 rows=4 loops=4) Total runtime: 10680.902 ms

これの基礎となるビューは、次のように定義された「完全な結果」ビューです。

 SELECT p.data_id, p.x2, istr.str AS impression, input.h, p.x3, p.x3, p.x4, s.x5, s.x6, s.x7, s.x8, s.x9, s.x10, s.x11, s.x12, s.x13, s.x14, t.x15, t.x16, t.x17, t.x18, t.x19, t.x20, t.x21, t.x22, t.x23, input.data AS input, stage1_output_file.data AS stage1, stage2_output_file.data AS stage2, stage3_output_file.data AS stage3 FROM category1_results p, category1_impression_str istr, input_file input, stage1_output_file, stage2_output_file, stage3_output_file, category2_results s, category3_results t WHERE p.impression = istr.id AND p.data_id = input.data_id AND p.data_id = stage1_output_file.data_id AND p.data_id = stage2_output_file.data_id AND p.data_id = stage3_output_file.data_id AND p.data_id = s.data_id AND p.data_id = t.data_id;

上記のクエリプランが生成されたオーバーレイビュー。次のように定義されます。

 SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f, r.* FROM input_data d LEFT JOIN quality_codes t ON t.id = d.input_quality LEFT JOIN full_results r ON r.data_id = d.data_id WHERE NOT d.deleted;

チェーン全体のほとんどの段階でデータセット全体を結合しているようですが、これはパフォーマンスの問題だと確信しています-このブタを最適化する方法についての提案はありますか？

Erwin Brandstetter · Accepted Answer

私は推測ですが、ビューにLEFT JOINすると、クエリの最初の部分に参加する前に、プランナがビュー全体の結果を計算するようになります。

ビューからクエリをインライン化し、LEFT JOINの代わりにJOINにして、プランナーがより速い方法をすぐに見つけるかどうかを確認します。

SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f , p.data_id AS p_data_id, p.x2, c.str AS impression, i.h , p.x3, p.x3, p.x4 , s.x5, s.x6, s.x7, s.x8, s.x9, s.x10, s.x11, s.x12, s.x13, s.x14 , t.x15, t.x16, t.x17, t.x18, t.x19, t.x20, t.x21, t.x22, t.x23 , i.data AS input , s1.data AS stage1, s2.data AS stage2, s3.data AS stage3 FROM input_data d JOIN category1_results p ON p.data_id = d.data_id JOIN input_file i USING (data_id) JOIN stage1_output_file s1 USING (data_id) JOIN stage2_output_file s2 USING (data_id) JOIN stage3_output_file s3 USING (data_id) JOIN category2_results s USING (data_id) JOIN category3_results t USING (data_id) JOIN category1_impression_str c ON p.impression = c.id LEFT JOIN quality_codes t ON t.id = d.input_quality WHERE NOT d.deleted;

構文を整理して、管理しやすくしました。 2番目のdata_id列にエイリアスを追加して、実行できるようにしました。

If実行時間がかなり速くなるはずですが、次のようにINNER JOINが原因で欠落している行を追加してみることができます。

SELECT DISTINCT ON (1,2,3,4,5,6,7,8) * FROM ( <<query>> ) x UNION ALL SELECT d.data_id, d.test_session, d.a, d.b, t.c, d.d, d.e, d.f ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL ,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL FROM input_data d LEFT JOIN quality_codes t ON t.id = d.input_quality WHERE NOT d.deleted; ORDER BY 1,2,3,4,5,6,7,8, 9 NULLS LAST; -- p.data_id is otherwise not null

voretaq7 · Answer

これを数日間見つめていたので、考えられる解決策の1つは、テーブルを非正規化し、すべてにセッションIDを貼り付けることです。これにより、クエリプランナーはJOINsを行のより小さなサブセットにすばやく減らすことができます。

ここでの大きな不利な点は、データベースを非正規化することです-おそらく取引を壊すものではありませんが、可能であれば避けたいものです...