scikit-learn-信頼区間のあるROC曲線

Question

_scikit-learn_とfpr、tpr、thresholds = metrics.roc_curve(y_true,y_pred, pos_label=1)を使用してROC曲線を取得できます。ここで、_y_true_はに基づく値のリストです。私のゴールドスタンダード（つまり、ネガティブの場合は_0_、ポジティブの場合は_1_）、_y_pred_は対応するスコアのリストです（例：_0.053497243_、_0.008521122_、 _0.022781548_、_0.101885263_、_0.012913795_、_0.0_、_0.042881547_ [...]）

その曲線に信頼区間を追加する方法を見つけようとしていますが、sklearnでそれを行う簡単な方法が見つかりませんでした。

ogrisel · Accepted Answer

bootstrap roc計算（元のy_true/y_predから新しいバージョンのy_true/y_predを置き換えてサンプルを作成し、再計算できます）毎回roc_curveの新しい値）、この方法で信頼区間を推定します。

トレインテスト分割によって引き起こされる変動を考慮に入れるために、 ShuffleSplit CVイテレータを何度も使用し、モデルをトレイン分割に適合させ、モデルごとにy_predを生成して、 roc_curvesの経験的分布も収集し、最後にそれらの信頼区間を計算します。

編集：Pythonでのブートストラップ

これは、単一モデルの予測からROCAUCスコアをブートストラップするための例です。スタックオーバーフローの回答としてわかりやすくするために、ROC AUCをブートストラップすることを選択しましたが、代わりにbootstrap曲線全体に適合させることができます：

import numpy as np from scipy.stats import sem from sklearn.metrics import roc_auc_score y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04]) y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0 ]) print("Original ROC area: {:0.3f}".format(roc_auc_score(y_true, y_pred))) n_bootstraps = 1000 rng_seed = 42 # control reproducibility bootstrapped_scores = [] rng = np.random.RandomState(rng_seed) for i in range(n_bootstraps): # bootstrap by sampling with replacement on the prediction indices indices = rng.randint(0, len(y_pred), len(y_pred)) if len(np.unique(y_true[indices])) < 2: # We need at least one positive and one negative sample for ROC AUC # to be defined: reject the sample continue score = roc_auc_score(y_true[indices], y_pred[indices]) bootstrapped_scores.append(score) print("Bootstrap #{} ROC area: {:0.3f}".format(i + 1, score))

いくつかの無効なリサンプルを拒否する必要があることがわかります。ただし、多くの予測がある実際のデータでは、これは非常にまれなイベントであり、信頼区間に大きな影響を与えることはありません（rng_seedを変更して確認できます）。

ヒストグラムは次のとおりです。

import matplotlib.pyplot as plt plt.hist(bootstrapped_scores, bins=50) plt.title('Histogram of the bootstrapped ROC AUC scores') plt.show()

Histogram of the bootstrapped ROC AUC scores

再サンプリングされたスコアは[0-1]の範囲で打ち切られ、最後のビンに多数のスコアが発生することに注意してください。

信頼区間を取得するには、サンプルを並べ替えます。

sorted_scores = np.array(bootstrapped_scores) sorted_scores.sort() # Computing the lower and upper bound of the 90% confidence interval # You can change the bounds percentiles to 0.025 and 0.975 to get # a 95% confidence interval instead. confidence_lower = sorted_scores[int(0.05 * len(sorted_scores))] confidence_upper = sorted_scores[int(0.95 * len(sorted_scores))] print("Confidence interval for the score: [{:0.3f} - {:0.3}]".format( confidence_lower, confidence_upper))

これは：

Confidence interval for the score: [0.444 - 1.0]

信頼区間は非常に広いですが、これはおそらく私の予測の選択（9つの予測のうち3つの間違い）と予測の総数が非常に少ないことの結果です。

プロットに関するもう1つの注意点：スコアは量子化されています（多くの空のヒストグラムビン）。これは、予測の数が少ないためです。スコア（またはy_pred値）にガウスノイズを少し導入して、分布を滑らかにし、ヒストグラムの見栄えを良くすることができます。ただし、平滑化帯域幅の選択には注意が必要です。

最後に、前述のように、この信頼区間はトレーニングセットに固有です。モデルクラスとパラメーターによって誘発されるROCの変動性をより正確に推定するには、代わりに反復交差検定を実行する必要があります。ただし、ランダムなトレイン/テスト分割ごとに新しいモデルをトレーニングする必要があるため、これは多くの場合、はるかにコストがかかります。

Raul · Answer

DeLongソリューション[ブートストラップなし]

ここでいくつか示唆されているように、pROCアプローチは素晴らしいでしょう。 pROC documentation によると、信頼区間はDeLongを介して計算されます。

DeLongは、AUCの不確実性を評価するための漸近的に正確な方法です（DeLong et al。（1988））。バージョン1.9以降、pROCはSun and Xu（2014）によって提案されたアルゴリズムを使用します。このアルゴリズムは、O（N log N）の複雑さを持ち、常にブートストラップよりも高速です。デフォルトでは、pROCは可能な限りDeLongメソッドを選択します。

Yandex Data Schoolは、パブリックリポジトリにFastDeLongを実装しています。

https://github.com/yandexdataschool/roc_comparison

したがって、この例で使用されているDeLongの実装に対するすべてのクレジット。したがって、DeLongを介してCIを取得する方法は次のとおりです。

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Tue Nov 6 10:06:52 2018 @author: yandexdataschool Original Code found in: https://github.com/yandexdataschool/roc_comparison updated: Raul Sanchez-Vazquez """ import numpy as np import scipy.stats from scipy import stats # AUC comparison adapted from # https://github.com/Netflix/vmaf/ def compute_midrank(x): """Computes midranks. Args: x - a 1D numpy array Returns: array of midranks """ J = np.argsort(x) Z = x[J] N = len(x) T = np.zeros(N, dtype=np.float) i = 0 while i < N: j = i while j < N and Z[j] == Z[i]: j += 1 T[i:j] = 0.5*(i + j - 1) i = j T2 = np.empty(N, dtype=np.float) # Note(kazeevn) +1 is due to Python using 0-based indexing # instead of 1-based in the AUC formula in the paper T2[J] = T + 1 return T2 def compute_midrank_weight(x, sample_weight): """Computes midranks. Args: x - a 1D numpy array Returns: array of midranks """ J = np.argsort(x) Z = x[J] cumulative_weight = np.cumsum(sample_weight[J]) N = len(x) T = np.zeros(N, dtype=np.float) i = 0 while i < N: j = i while j < N and Z[j] == Z[i]: j += 1 T[i:j] = cumulative_weight[i:j].mean() i = j T2 = np.empty(N, dtype=np.float) T2[J] = T return T2 def fastDeLong(predictions_sorted_transposed, label_1_count, sample_weight): if sample_weight is None: return fastDeLong_no_weights(predictions_sorted_transposed, label_1_count) else: return fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight) def fastDeLong_weights(predictions_sorted_transposed, label_1_count, sample_weight): """ The fast version of DeLong's method for computing the covariance of unadjusted AUC. Args: predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples] sorted such as the examples with label "1" are first Returns: (AUC value, DeLong covariance) Reference: @article{Sun2014fast, title={Fast Implementation of DeLong's Algorithm for Comparing the Areas Under Correlated Receiver Oerating Characteristic Curves}, author={Xu Sun and Weichao Xu}, journal={IEEE Signal Processing Letters}, volume={21}, number={11}, pages={1389--1393}, year={2014}, publisher={IEEE} } """ # Short variables are named as they are in the paper m = label_1_count n = predictions_sorted_transposed.shape[1] - m positive_examples = predictions_sorted_transposed[:, :m] negative_examples = predictions_sorted_transposed[:, m:] k = predictions_sorted_transposed.shape[0] tx = np.empty([k, m], dtype=np.float) ty = np.empty([k, n], dtype=np.float) tz = np.empty([k, m + n], dtype=np.float) for r in range(k): tx[r, :] = compute_midrank_weight(positive_examples[r, :], sample_weight[:m]) ty[r, :] = compute_midrank_weight(negative_examples[r, :], sample_weight[m:]) tz[r, :] = compute_midrank_weight(predictions_sorted_transposed[r, :], sample_weight) total_positive_weights = sample_weight[:m].sum() total_negative_weights = sample_weight[m:].sum() pair_weights = np.dot(sample_weight[:m, np.newaxis], sample_weight[np.newaxis, m:]) total_pair_weights = pair_weights.sum() aucs = (sample_weight[:m]*(tz[:, :m] - tx)).sum(axis=1) / total_pair_weights v01 = (tz[:, :m] - tx[:, :]) / total_negative_weights v10 = 1. - (tz[:, m:] - ty[:, :]) / total_positive_weights sx = np.cov(v01) sy = np.cov(v10) delongcov = sx / m + sy / n return aucs, delongcov def fastDeLong_no_weights(predictions_sorted_transposed, label_1_count): """ The fast version of DeLong's method for computing the covariance of unadjusted AUC. Args: predictions_sorted_transposed: a 2D numpy.array[n_classifiers, n_examples] sorted such as the examples with label "1" are first Returns: (AUC value, DeLong covariance) Reference: @article{Sun2014fast, title={Fast Implementation of DeLong's Algorithm for Comparing the Areas Under Correlated Receiver Oerating Characteristic Curves}, author={Xu Sun and Weichao Xu}, journal={IEEE Signal Processing Letters}, volume={21}, number={11}, pages={1389--1393}, year={2014}, publisher={IEEE} } """ # Short variables are named as they are in the paper m = label_1_count n = predictions_sorted_transposed.shape[1] - m positive_examples = predictions_sorted_transposed[:, :m] negative_examples = predictions_sorted_transposed[:, m:] k = predictions_sorted_transposed.shape[0] tx = np.empty([k, m], dtype=np.float) ty = np.empty([k, n], dtype=np.float) tz = np.empty([k, m + n], dtype=np.float) for r in range(k): tx[r, :] = compute_midrank(positive_examples[r, :]) ty[r, :] = compute_midrank(negative_examples[r, :]) tz[r, :] = compute_midrank(predictions_sorted_transposed[r, :]) aucs = tz[:, :m].sum(axis=1) / m / n - float(m + 1.0) / 2.0 / n v01 = (tz[:, :m] - tx[:, :]) / n v10 = 1.0 - (tz[:, m:] - ty[:, :]) / m sx = np.cov(v01) sy = np.cov(v10) delongcov = sx / m + sy / n return aucs, delongcov def calc_pvalue(aucs, sigma): """Computes log(10) of p-values. Args: aucs: 1D array of AUCs sigma: AUC DeLong covariances Returns: log10(pvalue) """ l = np.array([[1, -1]]) z = np.abs(np.diff(aucs)) / np.sqrt(np.dot(np.dot(l, sigma), l.T)) return np.log10(2) + scipy.stats.norm.logsf(z, loc=0, scale=1) / np.log(10) def compute_ground_truth_statistics(ground_truth, sample_weight): assert np.array_equal(np.unique(ground_truth), [0, 1]) order = (-ground_truth).argsort() label_1_count = int(ground_truth.sum()) if sample_weight is None: ordered_sample_weight = None else: ordered_sample_weight = sample_weight[order] return order, label_1_count, ordered_sample_weight def delong_roc_variance(ground_truth, predictions, sample_weight=None): """ Computes ROC AUC variance for a single set of predictions Args: ground_truth: np.array of 0 and 1 predictions: np.array of floats of the probability of being class 1 """ order, label_1_count, ordered_sample_weight = compute_ground_truth_statistics( ground_truth, sample_weight) predictions_sorted_transposed = predictions[np.newaxis, order] aucs, delongcov = fastDeLong(predictions_sorted_transposed, label_1_count, ordered_sample_weight) assert len(aucs) == 1, "There is a bug in the code, please forward this to the developers" return aucs[0], delongcov alpha = .95 y_pred = np.array([0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04]) y_true = np.array([0, 1, 0, 0, 1, 1, 0, 1, 0 ]) auc, auc_cov = delong_roc_variance( y_true, y_pred) auc_std = np.sqrt(auc_cov) lower_upper_q = np.abs(np.array([0, 1]) - (1 - alpha) / 2) ci = stats.norm.ppf( lower_upper_q, loc=auc, scale=auc_std) ci[ci > 1] = 1 print('AUC:', auc) print('AUC COV:', auc_cov) print('95% AUC CI:', ci)

出力：

AUC: 0.8 AUC COV: 0.028749999999999998 95% AUC CI: [0.46767194, 1.]

また、この実装がpROCから取得したRの結果と一致することも確認しました。

library(pROC) y_true = c(0, 1, 0, 0, 1, 1, 0, 1, 0) y_pred = c(0.21, 0.32, 0.63, 0.35, 0.92, 0.79, 0.82, 0.99, 0.04) # Build a ROC object and compute the AUC roc = roc(y_true, y_pred) roc

出力：

Call: roc.default(response = y_true, predictor = y_pred) Data: y_pred in 5 controls (y_true 0) < 4 cases (y_true 1). Area under the curve: 0.8

次に

# Compute the Confidence Interval ci(roc)

出力

95% CI: 0.4677-1 (DeLong)