Gistインデックスを使用するPostgres LIKEクエリは、フルスキャンと同じくらい遅い

Question

私が持っているのは、UNC共有からのファイルのパス、拡張子、および名前を格納する非常に単純なデータベースです。テストのために、約1.5行のmio行を挿入しました。以下のクエリは、ご覧のようにGistインデックスを使用していますが、戻るのに5秒かかります。予想されるのは、数ミリ秒（100など）です。

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE 'somevalue'

クエリで%%を使用すると、シーケンシャルスキャン（？！）を使用しても、それほど長くはかかりません

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE parentpath LIKE '%a%'

name（ファイル名）列にも同じ設定があります。その列で同様のクエリを実行すると、%%を使用している場合でも、半分の時間しかかかりません。

EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE name LIKE '%a%'

私がすでに試みたことは、ここでは簡単に書くことはできません。私が何をしても、約1 mio行から始まると遅くなります。基本的に何も削除されないため、バキュームと再インデックス付けはまったく役に立ちません。「特定の人間の言語の単語」だけでなく、関心のある列で任意の文字を検索できるようにする必要があるため、LIKE %%およびGINまたはGistインデックス以外の種類の検索を実際に使用することはできません。

これが数百万行以上でも約100ミリ秒で機能するはずだと私の予想は間違っていますか？

さらに詳しい情報

DB作成スクリプト

テキストやその他のインデックスをまったく使用せずに再試行します。1.7mioの一意のエントリ

EXPLAIN ANALYZE select * from residentfiles where name like '%12345%' Seq Scan on residentfiles (cost=0.00..78162.76 rows=33225 width=232) (actual time=0.076..3195.965 rows=45301 loops=1) Filter: ((name)::text ~~ '%12345%'::text) Rows Removed by Filter: 1604780+ Planning time: 0.596 ms Execution time: 3318.595 ms

トライグラムジンインデックスで試してください：

CREATE INDEX IF NOT EXISTS name_gin_idx ON residentfiles USING gin (name gin_trgm_ops); CREATE INDEX IF NOT EXISTS parentpath_gin_idx ON residentfiles USING gin (parentpath gin_trgm_ops); CREATE INDEX IF NOT EXISTS ext_gin_idx ON residentfiles USING gin (extension gin_trgm_ops); EXPLAIN ANALYZE select * from residentfiles where name like '%12345%' Aggregate (cost=53717.59..53717.60 rows=1 width=0) (actual time=1694.223..1694.224 rows=1 loops=1) -> Bitmap Heap Scan on residentfiles (cost=341.89..53631.82 rows=34308 width=0) (actual time=72.010..1615.007 rows=46532 loops=1) Recheck Cond: ((name)::text ~~ '%12345%'::text) Rows Removed by Index Recheck: 111 Heap Blocks: exact=46372 -> Bitmap Index Scan on name_gin_idx (cost=0.00..333.31 rows=34308 width=0) (actual time=52.287..52.287 rows=46643 loops=1) Index Cond: ((name)::text ~~ '%12345%'::text) Planning time: 10.881 ms Execution time: 1694.755 ms

varchar_patternで試してください：

CREATE INDEX idx_varchar_pattern_parentpath ON residentfiles (parentpath varchar_pattern_ops); CREATE INDEX idx_varchar_pattern_name ON residentfiles (name varchar_pattern_ops); CREATE INDEX idx_varchar_pattern_extension ON residentfiles (extension varchar_pattern_ops); EXPLAIN ANALYZE select * from residentfiles where name like '%12345%' Aggregate (cost=89718.74..89718.75 rows=1 width=0) (actual time=1995.206..1995.207 rows=1 loops=1) -> Seq Scan on residentfiles (cost=0.00..89574.98 rows=57507 width=0) (actual time=0.060..1913.114 rows=52232 loops=1) Filter: ((name)::text ~~ '%12345%'::text) Rows Removed by Filter: 1852103 Planning time: 8.280 ms Execution time: 1995.255 ms

MarcoP · Answer

私の意見では、テストの実施方法を共有しないと、答えを出すのは非常に困難です。私の意味の例を見てみましょう。 postgres 11を使用して申し訳ありませんが、結論は同じです。

これは新しいdbであり、インスタンスに対して何も実行されていません。

test=# CREATE EXTENSION pg_trgm; CREATE EXTENSION test=# create table test_trgmidx (col1 varchar(30), col2 varchar(50)); CREATE TABLE test=# CREATE INDEX trgm_idx_test_col2 ON test_trgmidx USING Gist (col2 Gist_trgm_ops); CREATE INDEX

非常に単純なループを使用して500000行を挿入します。

test=# \i loop_long.sql DO test=# select count(1) from test_trgmidx; count -------- 500000 (1 row) test=# select * from test_trgmidx limit 20; col1 | col2 ------------+------------ ABCD1EFGH | abcd1efgh ABCD2EFGH | abcd2efgh ABCD3EFGH | abcd3efgh ABCD4EFGH | abcd4efgh ABCD5EFGH | abcd5efgh ABCD6EFGH | abcd6efgh ABCD7EFGH | abcd7efgh ABCD8EFGH | abcd8efgh ABCD9EFGH | abcd9efgh ABCD10EFGH | abcd10efgh ABCD11EFGH | abcd11efgh ABCD12EFGH | abcd12efgh ABCD13EFGH | abcd13efgh ABCD14EFGH | abcd14efgh ABCD15EFGH | abcd15efgh ABCD16EFGH | abcd16efgh ABCD17EFGH | abcd17efgh ABCD18EFGH | abcd18efgh ABCD19EFGH | abcd19efgh ABCD20EFGH | abcd20efgh (20 rows)

次に、インスタンスを再起動してクリーンなバッファキャッシュを作成し、最初の選択の説明を2回実行して、キャッシュが結果にどのように影響を与えているかを確認します。

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.586..4.912 rows=1 loops=1) Output: col1, col2 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text) Buffers: shared hit=19 read=237 Planning Time: 0.303 ms Execution Time: 4.934 ms (6 rows) test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=2.096..2.298 rows=1 loops=1) Output: col1, col2 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text) Buffers: shared hit=232 Planning Time: 0.072 ms Execution Time: 2.317 ms (6 rows)

最初の実行ではディスクから行を取得する必要があり（読み取り= 237）、2番目の実行ではバッファキャッシュにアクセスするだけでよい（共有ヒット= 232、読み取りなし）のは明らかです。次に、2番目の選択に対して同じことを行い、インスタンスを再起動して、Explainを2回実行します。

test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.039..89.906 rows=111111 loops=1) Output: col1, col2 Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text) Rows Removed by Filter: 388889 Buffers: shared read=3676 Planning Time: 0.719 ms Execution Time: 94.942 ms (7 rows) test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.015..61.741 rows=111111 loops=1) Output: col1, col2 Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text) Rows Removed by Filter: 388889 Buffers: shared hit=3676 Planning Time: 0.081 ms Execution Time: 65.878 ms (7 rows)

ご覧のとおり、最初の実行ではディスクから、2回目の実行ではバッファから読み取りが行われます。 OSバッファーのため、状況はさらに複雑になります。 OSキャッシュをクリーンアップして、さまざまな結果を返しながら再実行することは可能ですか？

# free total used free shared buff/cache available Mem: 7914604 929920 4105056 93960 2879628 6748994 Swap: 4063228 0 4063228 # echo 3 > /proc/sys/vm/drop_caches # free total used free shared buff/cache available Mem: 7914604 802204 6846392 93960 266008 6951156 Swap: 4063228 0 4063228 #

列buff/cacheが2879628から266008にドロップダウンするのを確認します。次に、説明を再度実行します（選択に2回）。

postgres=# \c test You are now connected to database "test" as user "postgres". test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh'; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=130.858..140.403 rows=1 loops=1) Output: col1, col2 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text) Buffers: shared hit=19 read=237 Planning Time: 38.448 ms Execution Time: 140.466 ms (6 rows) test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like 'abcd345678efgh'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Index Scan using trgm_idx_test_col2 on public.test_trgmidx (cost=0.29..8.30 rows=1 width=28) (actual time=4.386..4.759 rows=1 loops=1) Output: col1, col2 Index Cond: ((test_trgmidx.col2)::text ~~ 'abcd345678efgh'::text) Buffers: shared hit=232 Planning Time: 0.115 ms Execution Time: 4.787 ms (6 rows) test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=9.214..161.243 rows=111111 loops=1) Output: col1, col2 Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text) Rows Removed by Filter: 388889 Buffers: shared hit=1 read=3675 Planning Time: 0.090 ms Execution Time: 165.354 ms (7 rows) test=# explain (analyze, buffers, verbose) select * from test_trgmidx where col2 like '%d2%'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Seq Scan on public.test_trgmidx (cost=0.00..9926.00 rows=106061 width=28) (actual time=0.013..62.043 rows=111111 loops=1) Output: col1, col2 Filter: ((test_trgmidx.col2)::text ~~ '%d2%'::text) Rows Removed by Filter: 388889 Buffers: shared hit=3676 Planning Time: 0.078 ms Execution Time: 66.101 ms (7 rows)

統計情報の違いを確認できます。最初にディスクから読み取り、次にバッファキャッシュから読み取ります。これらのすべての行は、私の意見では、私の経験では、1つの説明をどのように行ったかに関するすべての情報なしに、環境で何が起こっているのかを明確に理解することはほとんど不可能だと言っています。そして、すべての情報を持っている場合でも、すべての変数が役割を果たすため、答えを見つけるのが非常に難しい場合があります。私の2セント

Erwin Brandstetter · Answer

それはすべて依存します...

まあ、いくつかのことが確かです：

提示された使用例では、トライグラムGINインデックスが最高の読み取りパフォーマンスを提供するはずです。 Gist（遅い）でもtext_pattern_opsでもない（左アンカー/先行パターンにのみ適用可能）。したがって、このインデックスに集中してください。
```
CREATE INDEX name_gin_trgm_idx ON residentfiles USING gin (name gin_trgm_ops); 
```
などを参照してください。
- LIKEクエリパフォーマンスバリエーション
Postgres 9.5は古くなっています。執筆時点での現在のメジャーバージョンはPostgres 11です。ビッグデータ全般、およびGIN（およびGist）インデックス以降、majorの改善が行われています特に-特にトライグラムGINインデックスの場合。早急にアップグレードを検討してください。
- それが失敗した場合、少なくとも最新のマイナーバージョンにアップグレードします Postgresプロジェクトの推奨事項
追加されたLIMIT nはゲームチェンジャーです。 sのコメントでのみ述べましたが、それはタスクを完全に変更します。共通のパターンが与えられた場合（あなたの嫌な例 'a'のように）、Postgresは十分な一致がポップアップするまで少数のデータページをスキャンしてフィルタリングするという正当な理由で、順次スキャンに切り替えます。
- 念のため：ORDER BYを上に追加すると、状況は再び変わります。
ハードウェアとサーバーの構成は、重要です。しかし、あなたはほとんど何も開示しません...
サイズが重要です。ベーステーブルを再設計することで、数パーセントを得ることができます。見る：
- PostgreSQLのスペースの計算と保存

Arkhena · Answer

Postgresの「like」クエリを高速化するために、特別な演算子varchar_pattern_opsおよびtext_pattern_ops 使用すべきです。

このようなインデックスの作成方法を説明するドキュメントのそのページを参照してください。

ごきげんよう

Michał Zaborowski · Answer

質問をありがとう。この場合、私は pg_trgm を使用します。これがサンプルと説明のあるブログ投稿です： https://niallburkley.com/blog/index-columns-for-like-in-postgres/

要するに：

CREATE EXTENSION pg_trgm; CREATE INDEX residentfiles_parentpath_trgm_idx ON residentfiles USING gin (parentpath gin_trgm_ops); EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM residentfiles WHERE name LIKE '%a%'

一方、ここではその投稿を繰り返さないでください。意味のある応答になるはずです。そうでない場合-正しい方向に案内してください。