Postgresqlクエリで複数の連続する範囲の開始と終了を効率的に選択

Question

1から88の範囲の名前と整数を持つテーブルに、約10億行のデータがあります。指定されたnameの場合、すべてのintは一意であり、可能なすべての範囲内の整数が存在するため、ギャップがあります。

このクエリは、ケースの例を生成します。

--what I have: SELECT * FROM ( VALUES ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) ) AS baz ("name", "int")

名前と連続する整数のシーケンスごとに1行を含むルックアップテーブルを生成したいと思います。そのような各行には以下が含まれます。

name-name列の値
start-連続するシーケンスの最初の整数
end-連続したシーケンスの最後の値
span-end-start + 1

このクエリは、上記の例の出力例を生成します。

--what I need: SELECT * FROM ( VALUES ('foo', 2, 4, 3), ('foo', 10, 11, 2), ('foo', 13, 13, 1), ('bar', 1, 3, 3) ) AS contiguous_ranges ("name", "start", "end", span)

私は非常に多くの行を持っているので、より効率的です。つまり、このクエリを実行する必要があるのは1回だけなので、絶対的な要件ではありません。

前もって感謝します！

編集：

私はPL/pgSQLソリューションが歓迎されていることを追加する必要があります（ファンシートリックについて説明してください-私はまだPL/pgSQLを初めて使用しています）。

Jack says try topanswers.xyz · Answer

with recursive

テストビュー：

create view v as select * from ( values ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) ) as baz ("name", "int");

クエリ：

with recursive t("name", "int") as ( select "name", "int", 1 as span from v union all select "name", v."int", t.span+1 as span from v join t using ("name") where v."int"=t."int"+1 ) select "name", "start", "start"+span-1 as "end", span from( select "name", ("int"-span+1) as "start", max(span) as span from ( select "name", "int", max(span) as span from t group by "name", "int" ) z group by "name", ("int"-span+1) ) z;

結果：

 name | start | end | span ------+-------+-----+------ foo | 2 | 4 | 3 foo | 13 | 13 | 1 bar | 1 | 3 | 3 foo | 10 | 11 | 2 (4 rows)

10億行のテーブルでそれがどのように機能するか知りたいです。

nate c · Answer

ウィンドウ処理関数でそれを行うことができます。基本的な考え方は、leadおよびlagウィンドウ関数を使用して、現在の行の前後に行をプルすることです。次に、シーケンスの開始または終了があるかどうかを計算できます。

create temp view temp_view as select n, val, (lead <> val + 1 or lead is null) as islast, (lag <> val - 1 or lag is null) as isfirst, (lead <> val + 1 or lead is null) and (lag <> val - 1 or lag is null) as Orphan from ( select n, lead(val, 1) over( partition by n order by n, val), lag(val, 1) over(partition by n order by n, val ), val from test order by n, val ) as t ; select * from temp_view; n | val | islast | isfirst | Orphan -----+-----+--------+---------+-------- bar | 1 | f | t | f bar | 2 | f | f | f bar | 3 | t | f | f bar | 24 | t | t | t bar | 42 | t | t | t foo | 2 | f | t | f foo | 3 | f | f | f foo | 4 | t | f | f foo | 10 | f | t | f foo | 11 | t | f | f foo | 13 | t | t | t (11 rows)

（ビューを使用したので、以下のロジックが理解しやすくなります。）これで、行が開始か終了かがわかります。それを行に折りたたむ必要があります。

select n as "name", first, coalesce (last, first) as last, coalesce (last - first + 1, 1) as span from ( select n, val as first, -- this will not be excellent perf. since were calling the view -- for each row sequence found. Changing view into temp table -- will probably help with lots of values. ( select min(val) from temp_view as last where islast = true -- need this since isfirst=true, islast=true on an Orphan sequence and last.Orphan = false and first.val < last.val and first.n = last.n ) as last from (select * from temp_view where isfirst = true) as first ) as t ; name | first | last | span ------+-------+------+------ bar | 1 | 3 | 3 bar | 24 | 24 | 1 bar | 42 | 42 | 1 foo | 2 | 4 | 3 foo | 10 | 11 | 2 foo | 13 | 13 | 1 (6 rows)

私には正しいようです:)

A-K · Answer

SQL Serverでは、previousIntという名前の列をもう1つ追加します。

SELECT * FROM ( VALUES ('foo', 2, NULL), ('foo', 3, 2), ('foo', 4, 3), ('foo', 10, 4), ('foo', 11, 10), ('foo', 13, 11), ('bar', 1, NULL), ('bar', 2, 1), ('bar', 3, 2) ) AS baz ("name", "int", "previousInt")

CHECK制約を使用して、previousInt <int、およびFK制約（name、previousInt）が（name、int）を参照していることを確認し、さらに2つの制約を使用して、データの完全な整合性を確保します。これで、ギャップの選択は簡単になります。

SELECT NAME, PreviousInt, Int from YourTable WHERE PreviousInt < Int - 1;

スピードアップするために、ギャップのみを含むフィルターされたインデックスを作成する場合があります。これは、すべてのギャップが事前計算されるため、選択が非常に高速であり、制約によって事前計算されたデータの整合性が保証されることを意味します。私はそのようなソリューションを頻繁に使用しています。それらは私のシステム全体にあります。

ypercubeᵀᴹ · Answer

別のウィンドウ関数ソリューション。効率についてはわからないので、最後に実行プランを追加しました（行が非常に少ないため、おそらくあまり価値がありません）。試したい場合： SQL-Fiddle test

表とデータ：

CREATE TABLE baz ( name VARCHAR(10) NOT NULL , i INT NOT NULL , UNIQUE (name, i) ) ; INSERT INTO baz VALUES ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) ;

クエリ：

SELECT a.name AS name , a.i AS start , b.i AS "end" , b.i-a.i+1 AS span FROM ( SELECT name, i , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn FROM baz AS a WHERE NOT EXISTS ( SELECT * FROM baz AS prev WHERE prev.name = a.name AND prev.i = a.i - 1 ) ) AS a JOIN ( SELECT name, i , ROW_NUMBER() OVER (PARTITION BY name ORDER BY i) AS rn FROM baz AS a WHERE NOT EXISTS ( SELECT * FROM baz AS next WHERE next.name = a.name AND next.i = a.i + 1 ) ) AS b ON b.name = a.name AND b.rn = a.rn ;

クエリプラン

Merge Join (cost=442.74..558.76 rows=18 width=46) Merge Cond: ((a.name)::text = (a.name)::text) Join Filter: ((row_number() OVER (?)) = (row_number() OVER (?))) -> WindowAgg (cost=221.37..238.33 rows=848 width=42) -> Sort (cost=221.37..223.49 rows=848 width=42) Sort Key: a.name, a.i -> Merge Anti Join (cost=157.21..180.13 rows=848 width=42) Merge Cond: (((a.name)::text = (prev.name)::text) AND (((a.i - 1)) = prev.i)) -> Sort (cost=78.60..81.43 rows=1130 width=42) Sort Key: a.name, ((a.i - 1)) -> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42) -> Sort (cost=78.60..81.43 rows=1130 width=42) Sort Key: prev.name, prev.i -> Seq Scan on baz prev (cost=0.00..21.30 rows=1130 width=42) -> Materialize (cost=221.37..248.93 rows=848 width=50) -> WindowAgg (cost=221.37..238.33 rows=848 width=42) -> Sort (cost=221.37..223.49 rows=848 width=42) Sort Key: a.name, a.i -> Merge Anti Join (cost=157.21..180.13 rows=848 width=42) Merge Cond: (((a.name)::text = (next.name)::text) AND (((a.i + 1)) = next.i)) -> Sort (cost=78.60..81.43 rows=1130 width=42) Sort Key: a.name, ((a.i + 1)) -> Seq Scan on baz a (cost=0.00..21.30 rows=1130 width=42) -> Sort (cost=78.60..81.43 rows=1130 width=42) Sort Key: next.name, next.i -> Seq Scan on baz next (cost=0.00..21.30 rows=1130 width=42)

Carlos S · Answer

たびたびトサンメソッドを探すことができます：

https://community.Oracle.com/docs/DOC-915680 http://rwijk.blogspot.com/2014/01/tabibitosan.html https://www.xaprb.com/blog/2006/03/22/find-contiguous-ranges-with-sql/

基本的に：

SQL> create table mytable (nr) 2 as 3 select 1 from dual union all 4 select 2 from dual union all 5 select 3 from dual union all 6 select 6 from dual union all 7 select 7 from dual union all 8 select 11 from dual union all 9 select 18 from dual union all 10 select 19 from dual union all 11 select 20 from dual union all 12 select 21 from dual union all 13 select 22 from dual union all 14 select 25 from dual 15 / Table created. SQL> with tabibitosan as 2 ( select nr 3 , nr - row_number() over (order by nr) grp 4 from mytable 5 ) 6 select min(nr) 7 , max(nr) 8 from tabibitosan 9 group by grp 10 order by grp 11 / MIN(NR) MAX(NR) ---------- ---------- 1 3 6 7 11 11 18 22 25 25 5 rows selected.

私はこのパフォーマンスが良いと思います：

SQL> r 1 select min(nr) as range_start 2 ,max(nr) as range_end 3 from (-- our previous query 4 select nr 5 ,rownum 6 ,nr - rownum grp 7 from (select nr 8 from mytable 9 order by 1 10 ) 11 ) 12 group by grp 13* order by 1 RANGE_START RANGE_END ----------- ---------- 1 3 6 7 11 11 18 22 25 25

C Perkins · Answer

このソリューションは、ウィンドウ関数とOVER句を使用したnate cの回答から発想を得ています。興味深いことに、その答えは外部参照を持つサブクエリに戻ります。別のレベルのウィンドウ関数を使用して行統合を完了することができます。見た目は美しくないかもしれませんが、強力なウィンドウ関数の組み込みロジックを利用しているため、より効率的だと思います。

私はnateのソリューションから、最初の行のセットが1）開始および終了範囲の値を選択し、かつ2）間にある余分な行を削除するために必要なフラグをすでに生成していることに気付きました。列のエイリアスの使用方法を制限するウィンドウ関数の制限により、クエリには2つの深さのサブクエリがネストされています。論理的には、1つのネストされたサブクエリだけで結果を生成できます。

その他の注意事項：以下はSQLite3のコードです。 SQLite方言はpostgresqlから派生しているため、非常によく似ており、変更せずに機能する場合もあります。 lag()関数とlead()関数は、それぞれ前と後の単一行ウィンドウのみを必要とするため、OVER句にフレーミング制限を追加しました（したがって、デフォルトセットを維持する必要はありませんでした） of all前の行）。 Word firstは予約されているため、lastおよびendという名前も選択しました。

create temp view test as with cte(name, int) AS ( select * from ( values ('foo', 2), ('foo', 3), ('foo', 4), ('foo', 10), ('foo', 11), ('foo', 13), ('bar', 1), ('bar', 2), ('bar', 3) )) select * from cte; SELECT name, int AS first, endpoint AS last, (endpoint - int + 1) AS span FROM ( SELECT name, int, CASE WHEN prev <> 1 AND next <> -1 -- Orphan THEN int WHEN next = -1 -- start of range THEN lead(int) OVER (PARTITION BY name ORDER BY int ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING) ELSE null END AS endpoint FROM ( SELECT name, int, coalesce(int - lag(int) OVER (PARTITION BY name ORDER BY int ROWS BETWEEN 1 PRECEDING AND CURRENT ROW), 0) AS prev, coalesce(int - lead(int) OVER (PARTITION BY name ORDER BY int ROWS BETWEEN CURRENT ROW AND 1 FOLLOWING), 0) AS next FROM test ) AS mark_boundaries WHERE NOT (prev = 1 AND next = -1) -- discard values within range ) as raw_ranges WHERE endpoint IS NOT null ORDER BY name, first

結果は、他の回答とまったく同じです。

 name | first | last | span ------+-------+------+------ bar | 1 | 3 | 3 foo | 2 | 4 | 3 foo | 10 | 11 | 2 foo | 13 | 13 | 1

user unknown · Answer

大まかな計画：

各名前の最小値を選択します（名前でグループ化）
名前ごとにminimum2を選択します。ここで、min2> min1であり、存在しません（サブクエリ：SEL min2-1）。
Sel max val1> min val1 where max val1 <min val2。

更新が行われなくなるまで、2から繰り返します。そこから複雑になります、Gordian、最大の最小値と最大の最小値をグループ化します。私はプログラミング言語に行くと思います。

PS：いくつかのサンプル値を持つニースのサンプルテーブルは問題なく、誰でも使用できるため、誰もが最初からテストデータを作成するわけではありません。