Confusion matrices and class metrics — Classification

When Accuracy Fails

Imagine a hospital develops a screening test for a rare disease that affects 1% of the population. A radiologist's assistant writes a classifier that simply predicts "no disease" for every patient. On a test set of 10,000 patients (100 have the disease, 9,900 don't), this classifier achieves 99% accuracy.

Is this a good model? Of course not. It fails every patient who actually needs help. Yet accuracy — the standard metric — gives it a nearly perfect score.

Accuracy fails whenever class frequencies are unequal or error types have different costs. The confusion matrix is the tool that exposes what accuracy hides.

The Confusion Matrix

For a binary classifier, every prediction falls into one of four cells:

	Predicted Positive	Predicted Negative
Actually Positive	TP (True Positive)	FN (False Negative)
Actually Negative	FP (False Positive)	TN (True Negative)

Definitions with the medical test analogy (positive = has disease):

: patient has disease, test says disease ✓
: patient is healthy, test says disease ✗ (false alarm)
: patient is healthy, test says healthy ✓
: patient has disease, test says healthy ✗ (missed case)

Precision: "When I Raise an Alarm, Am I Right?"

The measures the quality of positive predictions:

\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

$\text{TP}$: true positives
$\text{FP}$: false positives — cases incorrectly flagged as positive

High precision means: when the model says "positive," you can trust it. Low precision means the model raises many false alarms.

Recall: "Of All Real Positives, How Many Did I Find?"

The (also called sensitivity or true positive rate) measures coverage:

\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

$\text{TP}$: true positives
$\text{FN}$: false negatives — positive cases the model missed

High recall means: the model catches most real positives. Low recall means it misses many.

F1: Harmonic Mean of Precision and Recall

If you want a single number that reflects both, the uses the harmonic mean:

F_1 = \frac{2 \cdot P \cdot R}{P + R}

$P$: precision
$R$: recall

The harmonic mean penalizes imbalance: a model with precision=0.9 and recall=0.1 gets $F_1 = 2(0.9)(0.1)/(0.9+0.1) = 0.18$ — far below the arithmetic mean of 0.5. This is intentional. A model that gets one thing right and completely fails the other is not useful.

Worked Numerical Example

A fraud detection system processes 1,000 transactions: 50 are fraudulent, 950 are legitimate.

	Predicted Fraud	Predicted Legit
Actually Fraud	TP = 40	FN = 10
Actually Legit	FP = 5	TN = 945

Accuracy = (40 + 945) / 1000 = 98.5% (sounds great)
Precision = 40 / (40 + 5) = 88.9% (of fraud alerts, 89% were real)
Recall = 40 / (40 + 10) = 80.0% (20% of fraud went undetected)
F1 = 2(0.889)(0.8)/(0.889+0.8) = 84.2%

Accuracy of 98.5% masked the fact that 1 in 5 fraudulent transactions slipped through.

When to Optimize Precision vs. Recall

The right tradeoff depends on the cost of each error type:

Multi-Class Extension

For a K-class problem, the confusion matrix is K×K. Cell (i, j) counts examples of true class i predicted as class j. The diagonal holds all correct predictions.

Macro-averaging: compute precision/recall per class, then average (treats all classes equally regardless of size). Weighted averaging: average precision/recall weighted by class frequency (better summary when classes vary in size).

Code: Confusion Matrix in Python

from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[3 1]
#  [2 4]]  → TN=3, FP=1, FN=2, TP=4

# Full report: precision, recall, F1 per class
print(classification_report(y_true, y_pred))

classification_report is your go-to tool during model development. Always report it alongside accuracy for any classification problem with class imbalance.