GoogleBigQueryで数値シーケンスの中央値を効率的に計算する必要があります。同じことが可能ですか?
ええ、それは PERCENTILE_CONT ウィンドウ関数で可能です。
ORDER BY句に従って順序付けした後、グループの値間の線形補間に基づく値を返します。
0から1の間でなければなりません。
このウィンドウ関数には、OVER句にORDERBYが必要です。
したがって、クエリの例は次のようになります(max()は、グループ全体で機能するためだけにありますが、数学ロジックとして使用されていないため、混乱しないでください)。
SELECT room,
max(median) FROM (SELECT room,
percentile_cont(0.5) OVER (PARTITION BY room
ORDER BY temperature) AS median FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature)) GROUP BY room
これは次を返します:
+------+-------------+
| room | temperature |
+------+-------------+
| 1 | 13 |
| 2 | 21 |
| 3 | 30 |
+------+-------------+
絶対的に正確な結果を必要とせず、近似で問題がない場合の代替ソリューション-NTHとQUANTILESの集計関数を組み合わせて使用できます。この方法の利点は、分析ウィンドウ関数よりもはるかにスケーラブルであるということですが、欠点は、おおよその結果が得られることです。
SELECT room,
NTH(50, QUANTILES(temperature, 101)) FROM
(SELECT 1 AS room,
11 AS temperature),
(SELECT 1 AS room,
12 AS temperature),
(SELECT 1 AS room,
14 AS temperature),
(SELECT 1 AS room,
19 AS temperature),
(SELECT 1 AS room,
13 AS temperature),
(SELECT 2 AS room,
20 AS temperature),
(SELECT 2 AS room,
21 AS temperature),
(SELECT 2 AS room,
29 AS temperature),
(SELECT 3 AS room,
30 AS temperature) GROUP BY room
これは
room temperature
1 13
2 21
3 30
2018アップデートより多くのメトリック:
BigQuery SQL:平均、幾何平均、外れ値の削除、中央値
私自身の記憶の目的で、タクシーデータを使用してクエリを実行します。
おおよその分位数:
SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1
PERCENTILE_DISCと同じ結果が得られます。
SELECT month, FIRST(median) median
FROM (
SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1
StandardSQL:
#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1