When Accuracy Fails
Imagine a hospital develops a screening test for a rare disease that affects 1% of the population. A radiologist's assistant writes a classifier that simply predicts "no disease" for every patient. On a test set of 10,000 patients (100 have the disease, 9,900 don't), this classifier achieves 99% accuracy.
Is this a good model? Of course not. It fails every patient who actually needs help. Yet accuracy — the standard metric — gives it a nearly perfect score.
Accuracy fails whenever class frequencies are unequal or error types have different costs. The confusion matrix is the tool that exposes what accuracy hides.
The Confusion Matrix
For a binary classifier, every prediction falls into one of four cells:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | TP (True Positive) | FN (False Negative) |
| Actually Negative | FP (False Positive) | TN (True Negative) |
Definitions with the medical test analogy (positive = has disease):
- : patient has disease, test says disease ✓
- : patient is healthy, test says disease ✗ (false alarm)
- : patient is healthy, test says healthy ✓
- : patient has disease, test says healthy ✗ (missed case)
Precision: "When I Raise an Alarm, Am I Right?"
The measures the quality of positive predictions:
- true positives
- false positives — cases incorrectly flagged as positive
High precision means: when the model says "positive," you can trust it. Low precision means the model raises many false alarms.
Recall: "Of All Real Positives, How Many Did I Find?"
The (also called sensitivity or true positive rate) measures coverage:
- true positives
- false negatives — positive cases the model missed
High recall means: the model catches most real positives. Low recall means it misses many.
F1: Harmonic Mean of Precision and Recall
If you want a single number that reflects both, the uses the harmonic mean:
- precision
- recall
The harmonic mean penalizes imbalance: a model with precision=0.9 and recall=0.1 gets — far below the arithmetic mean of 0.5. This is intentional. A model that gets one thing right and completely fails the other is not useful.
Worked Numerical Example
A fraud detection system processes 1,000 transactions: 50 are fraudulent, 950 are legitimate.
| Predicted Fraud | Predicted Legit | |
|---|---|---|
| Actually Fraud | TP = 40 | FN = 10 |
| Actually Legit | FP = 5 | TN = 945 |
- Accuracy = (40 + 945) / 1000 = 98.5% (sounds great)
- Precision = 40 / (40 + 5) = 88.9% (of fraud alerts, 89% were real)
- Recall = 40 / (40 + 10) = 80.0% (20% of fraud went undetected)
- F1 = 2(0.889)(0.8)/(0.889+0.8) = 84.2%
Accuracy of 98.5% masked the fact that 1 in 5 fraudulent transactions slipped through.
When to Optimize Precision vs. Recall
The right tradeoff depends on the cost of each error type:
Multi-Class Extension
For a K-class problem, the confusion matrix is K×K. Cell (i, j) counts examples of true class i predicted as class j. The diagonal holds all correct predictions.
Macro-averaging: compute precision/recall per class, then average (treats all classes equally regardless of size). Weighted averaging: average precision/recall weighted by class frequency (better summary when classes vary in size).
Code: Confusion Matrix in Python
from sklearn.metrics import confusion_matrix, classification_report import numpy as np y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1] y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0] # Confusion matrix cm = confusion_matrix(y_true, y_pred) print(cm) # [[3 1] # [2 4]] → TN=3, FP=1, FN=2, TP=4 # Full report: precision, recall, F1 per class print(classification_report(y_true, y_pred))
classification_report is your go-to tool during model development. Always report it alongside accuracy for any classification problem with class imbalance.