The 70% That Wasn't 70%
In 2012, weather forecasters in the US were remarkably well calibrated: when they said "70% chance of rain," it rained about 70% of the time. Machine learning models, by contrast, are often terribly calibrated — their numeric confidence scores don't correspond to real-world frequencies.
This matters enormously in any high-stakes domain. A medical diagnosis system that outputs "92% probability of malignant tumor" needs that 92% to be real — not an artifact of the model architecture or training procedure. Calibration is the bridge between a model's internal scores and trustworthy probabilities.
What Calibration Means
A satisfies:
- Predicted probability
- True outcome (0 or 1)
In plain English: among all samples where the model predicts probability p, the fraction that are actually positive should equal p.
The Reliability Diagram
A (calibration curve) is the standard visualization:
- Divide predictions into equal-width bins by predicted probability.
- For each bin : compute the mean predicted probability (conf) and the fraction of true positives (acc).
- Plot (conf, acc). A perfectly calibrated model lies on the diagonal y = x.
Typical patterns:
- Sigmoid curve (below diagonal at high confidence): the model is overconfident — it predicts extreme probabilities (0.05 or 0.95) more often than reality warrants. Common in neural networks.
- Inverse sigmoid (above diagonal at high confidence): the model is underconfident — it hedges too much. Common in Naive Bayes.
Expected Calibration Error
- Number of bins
- Set of samples in bin m
- Total samples
- Fraction of positives in bin m
- Mean predicted probability in bin m
The summarizes the reliability diagram into one number. A perfectly calibrated model has ECE = 0.
Example: Model with 10 bins, 1000 test samples.
| Bin | Conf (avg) | Acc (fraction) | |Acc − Conf| | Samples | Weight | |-----|-----------|----------------|------------|---------|--------| | 0.1 | 0.05 | 0.04 | 0.01 | 80 | 0.08 | | 0.3 | 0.28 | 0.35 | 0.07 | 120 | 0.12 | | 0.7 | 0.72 | 0.58 | 0.14 | 200 | 0.20 | | 0.9 | 0.91 | 0.71 | 0.20 | 300 | 0.30 | | … | … | … | … | … | … |
ECE ≈ 0.08×0.01 + 0.12×0.07 + 0.20×0.14 + 0.30×0.20 + … — the high-confidence bins, holding many samples, dominate.
Platt Scaling
The was originally proposed for SVMs, which produce uncalibrated margin scores.
Procedure:
- Hold out a calibration set (not used in training).
- Collect the model's raw output scores on the calibration set.
- Fit logistic regression: = σ(A·f(x) + B) where A and B are learned parameters.
- At inference, apply this logistic transformation after the base model.
Platt scaling is fast and works well when the calibration set is reasonably sized (hundreds of samples minimum).
Temperature Scaling
For deep neural networks, a simpler and often more effective approach is :
- Logit for class k
- Temperature (T>1 softens, T<1 sharpens)
- Calibrated probability for class k
Training T: minimize on the calibration set with a simple 1D optimization (e.g., BFGS). Only one parameter to learn — no risk of over-fitting the calibration.
Why Calibration Matters
| Domain | Calibration failure consequence |
|---|---|
| Medical diagnosis | "95% benign" leads to skipped biopsy → missed cancer |
| Weather forecasting | Emergency decisions based on overstated certainty |
| Credit risk | Miscalibrated default probabilities → wrong loan pricing |
| RLHF reward models | Overconfident rewards mislead policy gradient updates |
| Autonomous driving | Overconfident obstacle detection → dangerous behavior |
Interactive example
Adjust model confidence and see the reliability diagram and ECE update as you move a temperature slider
Coming soon
Summary
- Calibration means predicted probabilities match observed frequencies.
- Reliability diagrams plot mean confidence vs. actual fraction positive — perfect calibration = diagonal line.
- ECE summarizes calibration error into one weighted number.
- Platt scaling fits logistic regression on raw scores; good for SVMs and simple models.
- Temperature scaling divides neural network logits by T; fast, one-parameter, no accuracy loss.
- Calibration is crucial whenever model outputs drive real-world decisions — not just ranking.