In this blog we will learn about validation metrics. One of the most useful is the ROC curve.
By the end of this blog you will know
Suppose we have a model that predicts whether a patient has cancer or not. Output is 1 (positive) or 0 (benign). Our model is a binary probabilistic classifier. It does not give a hard label. It gives a probability.
If you are a doctor, you want this model to be as accurate as possible. Cancer is serious. So choosing the right validation metric matters a lot.
Why Accuracy Is Not Enough
The obvious first choice is accuracy: the number of correctly classified samples divided by total samples.
Now suppose 70% of your 1000 test samples are cancer positive. A naive model that predicts 1 for everyone gets 70% accuracy. If the split was 90/10, it gets 90%. That is a broken metric on skewed data.
True Positive, True Negative, False Positive, False Negative
Four values that actually tell you what is going on:
False negatives are the dangerous ones here. Telling a cancer patient they are fine is a serious mistake. We want FN as low as possible. A few FP (sending healthy patients for more tests) is more acceptable.
Sample data
y_true = [0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1]
y_pred = [0.1, 0.3, 0.2, 0.6, 0.8, 0.05, 0.9,
0.5, 0.3, 0.66, 0.3, 0.2, 0.85, 0.15, 0.99] The model gives probabilities, not hard labels. We need to choose a threshold to turn probabilities into 0 or 1.
Threshold value
A threshold is the cutoff: above it we predict positive, below it we predict negative. At threshold 0.5:
threshold = 0.5
y_output = [1 if x >= threshold else 0 for x in y_pred]
# [0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1] from sklearn import metrics
accuracy = metrics.accuracy_score(y_true, y_output)
# 0.733 73% accuracy. Not bad looking. But we already know accuracy is misleading here.
Computing the four values
def true_positive(y_true, y_pred):
return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
def true_negative(y_true, y_pred):
return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 0)
def false_positive(y_true, y_pred):
return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
def false_negative(y_true, y_pred):
return sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0) tp = true_positive(y_true, y_output) # 4
tn = true_negative(y_true, y_output) # 7
fp = false_positive(y_true, y_output) # 3
fn = false_negative(y_true, y_output) # 1 Confusion Matrix
A confusion matrix puts those four values into a 2×2 table:
| Predicted Negative | Predicted Positive | |
|---|---|---|
| Actual Negative | True Negative (7) | False Positive (3) |
| Actual Positive | False Negative (1) | True Positive (4) |
metrics.confusion_matrix(y_true, y_output)
# array([[7, 3],
# [1, 4]]) TPR and FPR
From the confusion matrix we derive two rates. The first is True Positive Rate (TPR), also called sensitivity: how good is the model at catching actual positive cases.
The second is False Positive Rate (FPR): how often the model incorrectly flags a negative case as positive.
You can also compute Specificity (True Negative Rate), which is just $1 - \text{FPR}$:
ROC Curve
ROC stands for Receiver Operating Characteristics. Instead of picking one threshold and computing one (TPR, FPR) pair, we compute it for every possible threshold and plot them all.
thresholds = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
tprs = []
fprs = []
for threshold in thresholds:
y_output = [1 if x >= threshold else 0 for x in y_pred]
tp = true_positive(y_true, y_output)
tn = true_negative(y_true, y_output)
fp = false_positive(y_true, y_output)
fn = false_negative(y_true, y_output)
tprs.append(tp / (tp + fn))
fprs.append(fp / (tn + fp))
import pandas as pd
df = pd.DataFrame({'Threshold': thresholds, 'TPR': tprs, 'FPR': fprs})
print(df) As we raise the threshold, the model becomes harder to trigger a positive prediction. TPR drops. FPR also drops because if even true positives are hard to call, false positives drop even faster.
| Threshold | TPR | FPR |
|---|---|---|
| 0.0 | 1.0 | 1.0 |
| 0.1 | 1.0 | 0.9 |
| 0.2 | 1.0 | 0.7 |
| 0.3 | 0.8 | 0.6 |
| 0.4 | 0.8 | 0.3 |
| 0.5 | 0.8 | 0.3 |
| 0.6 ★ | 0.8 | 0.2 |
| 0.7 | 0.6 | 0.1 |
| 0.8 | 0.6 | 0.1 |
| 0.9 | 0.4 | 0.0 |
| 1.0 | 0.0 | 0.0 |
ROC curve
Plot (FPR, TPR) for every threshold. The filled area is the AUC.
Finding the best threshold
We want the point on the curve closest to the top-left corner (0, 1). That is where TPR is highest and FPR is lowest. Threshold 0.6 gives (FPR=0.2, TPR=0.8) which is the closest to the ideal point.
AUC (Area Under the Curve)
AUC is the area under the ROC curve. It summarises the model's performance across all thresholds into a single number.
Comparing two models
AUC is also useful for comparing models. Here we have two sets of predictions for the same ground truth.
y_pred1 = [0.6, 0.4, 0.1, 0.3, 0.9, 0.15, 0.95,
0.7, 0.4, 0.5, 0.6, 0.4, 0.95, 0.10, 0.80]
model1_auc = metrics.roc_auc_score(y_true, y_pred)
model2_auc = metrics.roc_auc_score(y_true, y_pred1)
print("AUC model1:", model1_auc) # 0.83
print("AUC model2:", model2_auc) # 0.77 Model 1 has AUC 0.83 vs model 2 at 0.77. Higher AUC means the model is better at ranking positive samples above negative ones across all thresholds. Model 1 wins.
To learn about precision and recall, read this blog.
Questions? me@arshad-kazi.com