Skip to content
Evaluation & Model Assessment
Lesson 6 ⏱ 12 min

Calibration: does probability mean probability?

Video coming soon

Does Your Model's 70% Actually Mean 70%?

Demonstrates calibration curves on a weather prediction model, shows how Platt scaling and temperature scaling fix overconfident neural networks.

⏱ ~7 min

🧮

Quick refresher

Regularization and overfitting

Regularization prevents overfitting by penalizing model complexity. Techniques like L2 weight decay and dropout act as implicit regularizers, keeping model predictions from becoming overconfident on training noise.

Example

A student who memorizes the textbook can ace the practice exam but freeze on a new question — regularization forces the student to learn principles, not answers.

The 70% That Wasn't 70%

In 2012, weather forecasters in the US were remarkably well calibrated: when they said "70% chance of rain," it rained about 70% of the time. Machine learning models, by contrast, are often terribly calibrated — their numeric confidence scores don't correspond to real-world frequencies.

This matters enormously in any high-stakes domain. A medical diagnosis system that outputs "92% probability of malignant tumor" needs that 92% to be real — not an artifact of the model architecture or training procedure. Calibration is the bridge between a model's internal scores and trustworthy probabilities.

What Calibration Means

A satisfies:

P(Y=1p^=p)=pp[0,1]P(Y = 1 \mid \hat{p} = p) = p \quad \forall p \in [0,1]
p^\hat{p}
Predicted probability
YY
True outcome (0 or 1)

In plain English: among all samples where the model predicts probability p, the fraction that are actually positive should equal p.

The Reliability Diagram

A (calibration curve) is the standard visualization:

  1. Divide predictions into equal-width bins by predicted probability.
  2. For each bin : compute the mean predicted probability (conf) and the fraction of true positives (acc).
  3. Plot (conf, acc). A perfectly calibrated model lies on the diagonal y = x.

Typical patterns:

  • Sigmoid curve (below diagonal at high confidence): the model is overconfident — it predicts extreme probabilities (0.05 or 0.95) more often than reality warrants. Common in neural networks.
  • Inverse sigmoid (above diagonal at high confidence): the model is underconfident — it hedges too much. Common in Naive Bayes.

Expected Calibration Error

ECE=m=1MBmnacc(Bm)conf(Bm)\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|
MM
Number of bins
BmB_m
Set of samples in bin m
nn
Total samples
acc(Bm)\text{acc}(B_m)
Fraction of positives in bin m
conf(Bm)\text{conf}(B_m)
Mean predicted probability in bin m

The summarizes the reliability diagram into one number. A perfectly calibrated model has ECE = 0.

Example: Model with 10 bins, 1000 test samples.

| Bin | Conf (avg) | Acc (fraction) | |Acc − Conf| | Samples | Weight | |-----|-----------|----------------|------------|---------|--------| | 0.1 | 0.05 | 0.04 | 0.01 | 80 | 0.08 | | 0.3 | 0.28 | 0.35 | 0.07 | 120 | 0.12 | | 0.7 | 0.72 | 0.58 | 0.14 | 200 | 0.20 | | 0.9 | 0.91 | 0.71 | 0.20 | 300 | 0.30 | | … | … | … | … | … | … |

ECE ≈ 0.08×0.01 + 0.12×0.07 + 0.20×0.14 + 0.30×0.20 + … — the high-confidence bins, holding many samples, dominate.

Platt Scaling

The was originally proposed for SVMs, which produce uncalibrated margin scores.

Procedure:

  1. Hold out a calibration set (not used in training).
  2. Collect the model's raw output scores on the calibration set.
  3. Fit logistic regression: = σ(A·f(x) + B) where A and B are learned parameters.
  4. At inference, apply this logistic transformation after the base model.

Platt scaling is fast and works well when the calibration set is reasonably sized (hundreds of samples minimum).

Temperature Scaling

For deep neural networks, a simpler and often more effective approach is :

p^k=exp(zk/T)jexp(zj/T)\hat{p}_k = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}
zkz_k
Logit for class k
TT
Temperature (T>1 softens, T<1 sharpens)
p^k\hat{p}_k
Calibrated probability for class k

Training T: minimize on the calibration set with a simple 1D optimization (e.g., BFGS). Only one parameter to learn — no risk of over-fitting the calibration.

Why Calibration Matters

DomainCalibration failure consequence
Medical diagnosis"95% benign" leads to skipped biopsy → missed cancer
Weather forecastingEmergency decisions based on overstated certainty
Credit riskMiscalibrated default probabilities → wrong loan pricing
RLHF reward modelsOverconfident rewards mislead policy gradient updates
Autonomous drivingOverconfident obstacle detection → dangerous behavior

Interactive example

Adjust model confidence and see the reliability diagram and ECE update as you move a temperature slider

Coming soon

Summary

  • Calibration means predicted probabilities match observed frequencies.
  • Reliability diagrams plot mean confidence vs. actual fraction positive — perfect calibration = diagonal line.
  • ECE summarizes calibration error into one weighted number.
  • Platt scaling fits logistic regression on raw scores; good for SVMs and simple models.
  • Temperature scaling divides neural network logits by T; fast, one-parameter, no accuracy loss.
  • Calibration is crucial whenever model outputs drive real-world decisions — not just ranking.

Quiz

1 / 3

A weather model outputs 0.9 probability of rain for 1000 days. It actually rains on only 600 of those days. This model is: