February 1, 2024

ROC Curves and AUC: Picking the Right Threshold for Your Model

Why accuracy fails on skewed data, how confusion matrices work, and how to use ROC curves and AUC to evaluate and compare probabilistic classifiers.

machine learningevaluationROCAUCclassification

In this blog we will learn about validation metrics. One of the most useful is the ROC curve.

By the end of this blog you will know

True/False Positive & Negative Confusion matrix ROC curve Why accuracy fails Best threshold selection AUC and model comparison

Suppose we have a model that predicts whether a patient has cancer or not. Output is 1 (positive) or 0 (benign). Our model is a binary probabilistic classifier. It does not give a hard label. It gives a probability.

If you are a doctor, you want this model to be as accurate as possible. Cancer is serious. So choosing the right validation metric matters a lot.

Why Accuracy Is Not Enough

The obvious first choice is accuracy: the number of correctly classified samples divided by total samples.

$$\text{Accuracy} = \frac{\text{correct predictions}}{\text{total samples}}$$

Now suppose 70% of your 1000 test samples are cancer positive. A naive model that predicts 1 for everyone gets 70% accuracy. If the split was 90/10, it gets 90%. That is a broken metric on skewed data.

Accuracy is only meaningful when your class distribution is balanced. In medical data, it almost never is.

True Positive, True Negative, False Positive, False Negative

Four values that actually tell you what is going on:

True Positive (TP) Patient has cancer. Model predicted cancer.

True Negative (TN) Patient is benign. Model predicted benign.

False Positive (FP) Patient is benign. Model predicted cancer.

False Negative (FN) Patient has cancer. Model predicted benign.

False negatives are the dangerous ones here. Telling a cancer patient they are fine is a serious mistake. We want FN as low as possible. A few FP (sending healthy patients for more tests) is more acceptable.

Sample data

y_true = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1]
y_pred = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05, 0.9,
          0.5, 0.3, 0.66, 0.3, 0.2, 0.85, 0.15, 0.99]

The model gives probabilities, not hard labels. We need to choose a threshold to turn probabilities into 0 or 1.

Threshold value

A threshold is the cutoff: above it we predict positive, below it we predict negative. At threshold 0.5:

threshold = 0.5
y_output = [1 if x >= threshold else 0 for x in y_pred]
# [0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1]

from sklearn import metrics

accuracy = metrics.accuracy_score(y_true, y_output)
# 0.733

73% accuracy. Not bad looking. But we already know accuracy is misleading here.

Computing the four values

def true_positive(y_true, y_pred):
    return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)

def true_negative(y_true, y_pred):
    return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 0)

def false_positive(y_true, y_pred):
    return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)

def false_negative(y_true, y_pred):
    return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)

tp = true_positive(y_true, y_output)   # 4
tn = true_negative(y_true, y_output)   # 7
fp = false_positive(y_true, y_output)  # 3
fn = false_negative(y_true, y_output)  # 1

Confusion Matrix

A confusion matrix puts those four values into a 2×2 table:

	Predicted Negative	Predicted Positive
Actual Negative	True Negative (7)	False Positive (3)
Actual Positive	False Negative (1)	True Positive (4)

metrics.confusion_matrix(y_true, y_output)
# array([[7, 3],
#        [1, 4]])

TPR and FPR

From the confusion matrix we derive two rates. The first is True Positive Rate (TPR), also called sensitivity: how good is the model at catching actual positive cases.

$$\text{TPR} = \frac{TP}{TP + FN}$$

The second is False Positive Rate (FPR): how often the model incorrectly flags a negative case as positive.

$$\text{FPR} = \frac{FP}{TN + FP}$$

At threshold 0.5: TPR = 4/(4+1) = 0.8, FPR = 3/(7+3) = 0.3. A good model has high TPR and low FPR. We want top-left on the ROC curve.

You can also compute Specificity (True Negative Rate), which is just $1 - \text{FPR}$:

$$\text{TNR} = 1 - \text{FPR} = \frac{TN}{TN + FP}$$

ROC Curve

ROC stands for Receiver Operating Characteristics. Instead of picking one threshold and computing one (TPR, FPR) pair, we compute it for every possible threshold and plot them all.

thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]

tprs = []
fprs = []
for threshold in thresholds:
    y_output = [1 if x >= threshold else 0 for x in y_pred]
    tp = true_positive(y_true, y_output)
    tn = true_negative(y_true, y_output)
    fp = false_positive(y_true, y_output)
    fn = false_negative(y_true, y_output)
    tprs.append(tp / (tp + fn))
    fprs.append(fp / (tn + fp))

import pandas as pd
df = pd.DataFrame({'Threshold': thresholds, 'TPR': tprs, 'FPR': fprs})
print(df)

As we raise the threshold, the model becomes harder to trigger a positive prediction. TPR drops. FPR also drops because if even true positives are hard to call, false positives drop even faster.

Threshold	TPR	FPR
0.0	1.0	1.0
0.1	1.0	0.9
0.2	1.0	0.7
0.3	0.8	0.6
0.4	0.8	0.3
0.5	0.8	0.3
0.6 ★	0.8	0.2
0.7	0.6	0.1
0.8	0.6	0.1
0.9	0.4	0.0
1.0	0.0	0.0

ROC curve

Plot (FPR, TPR) for every threshold. The filled area is the AUC.

Finding the best threshold

We want the point on the curve closest to the top-left corner (0, 1). That is where TPR is highest and FPR is lowest. Threshold 0.6 gives (FPR=0.2, TPR=0.8) which is the closest to the ideal point.

Mathematically, the best threshold minimises $\sqrt{FPR^2 + (1 - TPR)^2}$, the Euclidean distance to the ideal point (0, 1).

AUC (Area Under the Curve)

AUC is the area under the ROC curve. It summarises the model's performance across all thresholds into a single number.

AUC = 1.0 Perfect model. TPR=1 at every threshold, FPR=0 everywhere except threshold 0.

AUC = 0.5 Random guess. The curve is just the diagonal line.

AUC < 0.5 Worse than random. Flip your predictions and you get >0.5.

Comparing two models

AUC is also useful for comparing models. Here we have two sets of predictions for the same ground truth.

y_pred1 = [0.6, 0.4, 0.1, 0.3, 0.9, 0.15, 0.95,
           0.7, 0.4, 0.5, 0.6, 0.4, 0.95, 0.10, 0.80]

model1_auc = metrics.roc_auc_score(y_true, y_pred)
model2_auc = metrics.roc_auc_score(y_true, y_pred1)

print("AUC model1:", model1_auc)   # 0.83
print("AUC model2:", model2_auc)   # 0.77

Model 1 has AUC 0.83 vs model 2 at 0.77. Higher AUC means the model is better at ranking positive samples above negative ones across all thresholds. Model 1 wins.

AUC has a nice probabilistic interpretation: it is the probability that the model ranks a randomly chosen positive sample higher than a randomly chosen negative sample.

To learn about precision and recall, read this blog.

Questions? me@arshad-kazi.com