Calibration: does probability mean probability? — Evaluation & Model Assessment

The 70% That Wasn't 70%

In 2012, weather forecasters in the US were remarkably well calibrated: when they said "70% chance of rain," it rained about 70% of the time. Machine learning models, by contrast, are often terribly calibrated — their numeric confidence scores don't correspond to real-world frequencies.

This matters enormously in any high-stakes domain. A medical diagnosis system that outputs "92% probability of malignant tumor" needs that 92% to be real — not an artifact of the model architecture or training procedure. Calibration is the bridge between a model's internal scores and trustworthy probabilities.

What Calibration Means

A satisfies:

P(Y = 1 \mid \hat{p} = p) = p \quad \forall p \in [0,1]

$\hat{p}$: Predicted probability
$Y$: True outcome (0 or 1)

In plain English: among all samples where the model predicts probability p, the fraction that are actually positive should equal p.

The Reliability Diagram

A (calibration curve) is the standard visualization:

Divide predictions into equal-width bins by predicted probability.
For each bin : compute the mean predicted probability (conf) and the fraction of true positives (acc).
Plot (conf, acc). A perfectly calibrated model lies on the diagonal y = x.

Typical patterns:

Sigmoid curve (below diagonal at high confidence): the model is overconfident — it predicts extreme probabilities (0.05 or 0.95) more often than reality warrants. Common in neural networks.
Inverse sigmoid (above diagonal at high confidence): the model is underconfident — it hedges too much. Common in Naive Bayes.

Expected Calibration Error

\text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n} \left| \text{acc}(B_m) - \text{conf}(B_m) \right|

$M$: Number of bins
$B_m$: Set of samples in bin m
$n$: Total samples
$\text{acc}(B_m)$: Fraction of positives in bin m
$\text{conf}(B_m)$: Mean predicted probability in bin m

The summarizes the reliability diagram into one number. A perfectly calibrated model has ECE = 0.

Example: Model with 10 bins, 1000 test samples.

| Bin | Conf (avg) | Acc (fraction) | |Acc − Conf| | Samples | Weight | |-----|-----------|----------------|------------|---------|--------| | 0.1 | 0.05 | 0.04 | 0.01 | 80 | 0.08 | | 0.3 | 0.28 | 0.35 | 0.07 | 120 | 0.12 | | 0.7 | 0.72 | 0.58 | 0.14 | 200 | 0.20 | | 0.9 | 0.91 | 0.71 | 0.20 | 300 | 0.30 | | … | … | … | … | … | … |

ECE ≈ 0.08×0.01 + 0.12×0.07 + 0.20×0.14 + 0.30×0.20 + … — the high-confidence bins, holding many samples, dominate.

Platt Scaling

The was originally proposed for SVMs, which produce uncalibrated margin scores.

Procedure:

Hold out a calibration set (not used in training).
Collect the model's raw output scores on the calibration set.
Fit logistic regression: = σ(A·f(x) + B) where A and B are learned parameters.
At inference, apply this logistic transformation after the base model.

Platt scaling is fast and works well when the calibration set is reasonably sized (hundreds of samples minimum).

Temperature Scaling

For deep neural networks, a simpler and often more effective approach is :

\hat{p}_k = \frac{\exp(z_k / T)}{\sum_j \exp(z_j / T)}

$z_k$: Logit for class k
$T$: Temperature (T>1 softens, T<1 sharpens)
$\hat{p}_k$: Calibrated probability for class k

Training T: minimize on the calibration set with a simple 1D optimization (e.g., BFGS). Only one parameter to learn — no risk of over-fitting the calibration.

Platt scaling in Python

from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import calibration_curve
import matplotlib.pyplot as plt

# Gradient boosting is notoriously poorly calibrated
base_clf = GradientBoostingClassifier()
base_clf.fit(X_train, y_train)

# Wrap with Platt scaling (sigmoid calibration)
cal_clf = CalibratedClassifierCV(base_clf, cv=5, method="sigmoid")
cal_clf.fit(X_train, y_train)

# Compare calibration curves
for clf, name in [(base_clf, "Uncalibrated"), (cal_clf, "Platt Scaled")]:
    prob_true, prob_pred = calibration_curve(y_test, clf.predict_proba(X_test)[:,1], n_bins=10)
    plt.plot(prob_pred, prob_true, label=name)

plt.plot([0,1],[0,1], "k--", label="Perfect")
plt.xlabel("Predicted probability"); plt.ylabel("True frequency")
plt.legend(); plt.show()

Why Calibration Matters

Domain	Calibration failure consequence
Medical diagnosis	"95% benign" leads to skipped biopsy → missed cancer
Weather forecasting	Emergency decisions based on overstated certainty
Credit risk	Miscalibrated default probabilities → wrong loan pricing
RLHF reward models	Overconfident rewards mislead policy gradient updates
Autonomous driving	Overconfident obstacle detection → dangerous behavior

Interactive example

Adjust model confidence and see the reliability diagram and ECE update as you move a temperature slider

Coming soon

Summary

Calibration means predicted probabilities match observed frequencies.
Reliability diagrams plot mean confidence vs. actual fraction positive — perfect calibration = diagonal line.
ECE summarizes calibration error into one weighted number.
Platt scaling fits logistic regression on raw scores; good for SVMs and simple models.
Temperature scaling divides neural network logits by T; fast, one-parameter, no accuracy loss.
Calibration is crucial whenever model outputs drive real-world decisions — not just ranking.