Postgres CTEがサブクエリより遅いのはなぜですか？

Question

文字列を分割し、各Wordをレコードとして出力するやや複雑なクエリがあります。

1つはCTEを使用し、もう1つはサブクエリを使用する簡単なテストを行いましたが、CTEの実行に2倍の時間がかかることに驚きました。

クエリが行うことの要点は次のとおりです。

-- 1. translate matches characters from comment to given list (of symbols) and replaces them with commas. -- 2. string_to_array splits string by comma and puts in an array -- 3. unnest unpacks the array into rows

インラインサブクエリ

SELECT sub_query.Word, sub_query._created_at FROM ( SELECT unnest(string_to_array(translate(nps_reports.comment::text, ' ,.<>?/;:@#~[{]}=+-_)("*&^%$£!`\|}'::text, ',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'::text), ','::text, ''::text)) AS Word, nps_reports.comment, nps_reports._id, nps_reports._created_at FROM nps_reports WHERE nps_reports.comment::text <> 'undefined'::text ) sub_query WHERE sub_query.Word IS NOT NULL AND NOT (sub_query.Word IN ( SELECT stop_words.stop_Word FROM stop_words)) ORDER BY sub_query._created_at DESC;

CTE

WITH split AS ( SELECT unnest(string_to_array(translate(nps_reports.comment::text, ' ,.<>?/;:@#~[{]}=+-_)("*&^%$£!`\|}'::text, ',,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,'::text), ','::text, ''::text)) AS Word, nps_reports.comment, nps_reports._id, nps_reports._created_at FROM nps_reports WHERE nps_reports.comment::text <> 'undefined'::text ) SELECT split.Word, split._created_at FROM split WHERE split.Word IS NOT NULL AND NOT (split.Word IN ( SELECT stop_words.stop_Word FROM stop_words)) ORDER BY split._created_at DESC;

そして、それぞれの説明は次のとおりです。

サブクエリの説明

Sort (cost=15921589.76..16082302.91 rows=64285258 width=40) (actual time=16299.150..17697.914 rows=4394788 loops=1) Sort Key: sub_query._created_at DESC Sort Method: external merge Disk: 116112kB Buffers: shared hit=22915 read=7627, temp read=34281 written=34281 -> Subquery Scan on sub_query (cost=2.49..2311035.10 rows=64285258 width=40) (actual time=0.177..13274.895 rows=4394788 loops=1) Filter: ((sub_query.Word IS NOT NULL) AND (NOT (hashed SubPlan 1))) Rows Removed by Filter: 3676303 Buffers: shared hit=22915 read=7627 -> Seq Scan on nps_reports (cost=0.00..695825.11 rows=129216600 width=88) (actual time=0.073..9781.244 rows=8071091 loops=1) Filter: ((comment)::text <> 'undefined'::text) Rows Removed by Filter: 844360 Buffers: shared hit=22914 read=7627 SubPlan 1 -> Seq Scan on stop_words (cost=0.00..2.19 rows=119 width=4) (actual time=0.016..0.034 rows=119 loops=1) Buffers: shared hit=1 Planning time: 0.115 ms Execution time: 18451.245 ms

CTEの説明

Sort (cost=17213755.76..17374468.91 rows=64285258 width=40) (actual time=44008.467..45508.786 rows=4394788 loops=1) Sort Key: split._created_at DESC Sort Method: external merge Disk: 116112kB Buffers: shared hit=23031 read=7531, temp read=34281 written=353942 CTE split -> Seq Scan on nps_reports (cost=0.00..695825.11 rows=129216600 width=135) (actual time=0.057..10451.951 rows=8071091 loops=1) Filter: ((comment)::text <> 'undefined'::text) Rows Removed by Filter: 844360 Buffers: shared hit=23027 read=7531 -> CTE Scan on split (cost=2.49..2907375.99 rows=64285258 width=40) (actual time=0.162..37888.364 rows=4394788 loops=1) Filter: ((Word IS NOT NULL) AND (NOT (hashed SubPlan 2))) Rows Removed by Filter: 3676303 Buffers: shared hit=23028 read=7531, temp written=319661 SubPlan 2 -> Seq Scan on stop_words (cost=0.00..2.19 rows=119 width=4) (actual time=0.009..0.030 rows=119 loops=1) Buffers: shared hit=1 Planning time: 0.649 ms Execution time: 46297.825 ms

Evan Carroll · Accepted Answer

PostgreSQLのCTEは最適化フェンスです。つまり、クエリプランナーはCTE境界を越えて最適化をプッシュしません。

これに関するブログエントリ

このように書けばいいのですが、多くはばかげていると思います。ここではCROSS JOIN LATERALではなく、複雑なラッピングとNOT EXISTS のではなく NOT IN

SELECT Word, _created_at FROM nps_reports CROSS JOIN LATERAL unnest(regexp_split_to_array( nps_reports.comment, '[^a-zA-Z0-9]+' )) AS Word WHERE nps_reports.comment <> 'undefined' AND nps_reports.comment IS NOT NULL AND NOT EXISTS ( SELECT 1 FROM stop_words WHERE stop_words.stop_Word = Word ) ORDER BY _created_at DESC;

そうは言っても、あなたがしていることはFTSを再発明しているようです。したがって、これも悪い考えです。

turnip · Answer

@Evan Carrollが、CTEに時間がかかる理由を説明しましたが、ここではクエリが改善されています。これは、上記のすべてのソリューションよりも高速です。

詳細についてはこの質問を参照してください。

-- create custom dict (you don't necessarily need to do this) CREATE TEXT SEARCH DICTIONARY simple_with_stop_words (TEMPLATE = pg_catalog.simple, STOPWORDS = english); CREATE TEXT SEARCH CONFIGURATION public.simple_with_stop_words (COPY = pg_catalog.simple); ALTER TEXT SEARCH CONFIGURATION public.simple_with_stop_words ALTER MAPPING FOR asciiword WITH simple_with_stop_words; -- the actual query SELECT token.Word, nps._created_at FROM nps_reports nps CROSS JOIN LATERAL UNNEST(to_tsvector('simple_with_stop_words', nps.comment)) token(Word) WHERE nps.comment IS NOT NULL AND nps.comment <> 'undefined' AND nps.language = 'en-US';

これは、PostgreSQLのto_tsvector関数を利用しており、指定された構成に応じていくつかの処理を実行します。 simple辞書と一緒に使用すると、私が作成したカスタム辞書の代わりに、任意の文字列を単語に分割するだけです。

また、Postgres 9.3以降の機能であるLATERALキーワードを使用しています。これにより、結合の左側から結合の右側に引数を渡すことができます。つまり、commentをUNNESTに入れます。

データベース全体で実行するには、約10 secondsかかります。 18 secondsを使用した以前の最速のメソッド（サブクエリ）と比較してください。