Skip to content
Classification
Lesson 7 ⏱ 14 min

Confusion matrices and class metrics

Video coming soon

Confusion Matrices - When Accuracy Lies

The class imbalance trap, the 2×2 confusion matrix built from scratch on a medical test example, deriving precision and recall from first principles, and a decision framework for choosing which metric to optimize.

⏱ ~7 min

🧮

Quick refresher

Classification accuracy

Accuracy is the fraction of predictions that were correct: accuracy = (number correct) / (total examples). For a balanced dataset where each class is equally common, accuracy is a useful summary metric. It treats all errors as equally costly.

Example

A classifier makes 100 predictions: 40 are class A (all correct), 60 are class B (55 correct, 5 wrong).

Accuracy = (40+55)/100 = 95%.

This looks great — but if class B is 'no cancer,' we missed 5 cancer cases.

When Accuracy Fails

Imagine a hospital develops a screening test for a rare disease that affects 1% of the population. A radiologist's assistant writes a classifier that simply predicts "no disease" for every patient. On a test set of 10,000 patients (100 have the disease, 9,900 don't), this classifier achieves 99% accuracy.

Is this a good model? Of course not. It fails every patient who actually needs help. Yet accuracy — the standard metric — gives it a nearly perfect score.

Accuracy fails whenever class frequencies are unequal or error types have different costs. The confusion matrix is the tool that exposes what accuracy hides.

The Confusion Matrix

For a binary classifier, every prediction falls into one of four cells:

Predicted PositivePredicted Negative
Actually PositiveTP (True Positive)FN (False Negative)
Actually NegativeFP (False Positive)TN (True Negative)

Definitions with the medical test analogy (positive = has disease):

  • : patient has disease, test says disease ✓
  • : patient is healthy, test says disease ✗ (false alarm)
  • : patient is healthy, test says healthy ✓
  • : patient has disease, test says healthy ✗ (missed case)

Precision: "When I Raise an Alarm, Am I Right?"

The measures the quality of positive predictions:

Precision=TPTP+FP\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}
TP\text{TP}
true positives
FP\text{FP}
false positives — cases incorrectly flagged as positive

High precision means: when the model says "positive," you can trust it. Low precision means the model raises many false alarms.

Recall: "Of All Real Positives, How Many Did I Find?"

The (also called sensitivity or true positive rate) measures coverage:

Recall=TPTP+FN\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
TP\text{TP}
true positives
FN\text{FN}
false negatives — positive cases the model missed

High recall means: the model catches most real positives. Low recall means it misses many.

F1: Harmonic Mean of Precision and Recall

If you want a single number that reflects both, the uses the harmonic mean:

F1=2PRP+RF_1 = \frac{2 \cdot P \cdot R}{P + R}
PP
precision
RR
recall

The harmonic mean penalizes imbalance: a model with precision=0.9 and recall=0.1 gets F1=2(0.9)(0.1)/(0.9+0.1)=0.18F_1 = 2(0.9)(0.1)/(0.9+0.1) = 0.18 — far below the arithmetic mean of 0.5. This is intentional. A model that gets one thing right and completely fails the other is not useful.

Worked Numerical Example

A fraud detection system processes 1,000 transactions: 50 are fraudulent, 950 are legitimate.

Predicted FraudPredicted Legit
Actually FraudTP = 40FN = 10
Actually LegitFP = 5TN = 945
  • Accuracy = (40 + 945) / 1000 = 98.5% (sounds great)
  • Precision = 40 / (40 + 5) = 88.9% (of fraud alerts, 89% were real)
  • Recall = 40 / (40 + 10) = 80.0% (20% of fraud went undetected)
  • F1 = 2(0.889)(0.8)/(0.889+0.8) = 84.2%

Accuracy of 98.5% masked the fact that 1 in 5 fraudulent transactions slipped through.

When to Optimize Precision vs. Recall

The right tradeoff depends on the cost of each error type:

Multi-Class Extension

For a K-class problem, the confusion matrix is K×K. Cell (i, j) counts examples of true class i predicted as class j. The diagonal holds all correct predictions.

Macro-averaging: compute precision/recall per class, then average (treats all classes equally regardless of size). Weighted averaging: average precision/recall weighted by class frequency (better summary when classes vary in size).

Code: Confusion Matrix in Python

from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
# [[3 1]
#  [2 4]]  → TN=3, FP=1, FN=2, TP=4

# Full report: precision, recall, F1 per class
print(classification_report(y_true, y_pred))

classification_report is your go-to tool during model development. Always report it alongside accuracy for any classification problem with class imbalance.

Quiz

1 / 3

A cancer test has: TP=90, FP=10, TN=880, FN=20. What is the precision?