Why Single Models Are High-Variance
Train the same neural network architecture twice on the same data with different random seeds. You'll get two models with similar average performance — but different individual predictions on specific examples. Neither model is perfect, and their mistakes are somewhat different.
This is — the predictions depend on which particular random seed, data shuffles, and minibatch orderings the training happened to use.
The key observation: if errors are uncorrelated, averaging cancels them out.
Imagine five friends each independently guess how many jelly beans are in a jar. Each person's guess is off by a different amount in a different direction. If you average their guesses, the random errors partially cancel. The average is almost always closer to the truth than any single guess — not because any one person is smarter, but because the errors don't all point in the same direction.
The Variance Reduction Proof
Let be N model predictions, each with variance and zero bias (centered on the true value).
- ensemble prediction (average of N models)
- number of ensemble members
With N=5 models, variance is reduced 5×. With N=10, reduced 10×. The law of large numbers is working in our favor: averaging independent estimates converges to the true value.
The critical assumption: independence (uncorrelated errors). If all models fail identically, the average fails too. The diversity strategy determines how well this assumption holds.
Strategy 1: Bagging (Bootstrap Aggregating)
Train N models, each on a different bootstrap sample — a random sample of size N drawn with replacement from the training set.
Why it creates diversity: Each bootstrap sample excludes about 36.8% of the original training examples (on average). Each model sees slightly different data → slightly different learned parameters → somewhat different errors.
Random forests are the canonical bagging example: many decision trees, each trained on a bootstrap sample with random feature subsets.
Neural network bagging is effective but expensive: N× training time and N× inference time. Worthwhile for high-stakes applications where accuracy matters more than speed.
Strategy 2: Model Averaging
Train N models with:
- Different random initialization seeds
- Different architectures (vary depth, width, activations)
- Different hyperparameters (learning rate, dropout rate, weight decay)
Then average predictions at inference. No bootstrapping needed.
Worked example: 3-model ensemble
| Model | Predicted probabilities: [cat, dog, bird] |
|---|---|
| Model 1 | [0.80, 0.15, 0.05] |
| Model 2 | [0.65, 0.25, 0.10] |
| Model 3 | [0.72, 0.20, 0.08] |
| Ensemble average | [0.723, 0.200, 0.077] |
The ensemble's average is typically more accurate than any individual model's prediction, and its probability estimates are better calibrated.
Strategy 3: Snapshot Ensembling
What if you can't afford N separate training runs? Snapshot ensembling (Huang et al., 2017) gets an ensemble from a single training run using cyclic learning rates:
- Train with cosine annealing that resets periodically (SGDR)
- Save a checkpoint at the end of each cycle (when lr ≈ 0 and the model is well-converged)
- Each checkpoint represents a different local minimum the model settled into
- Ensemble the checkpoints
Result: M ensemble members at the cost of 1 training run. The diversity comes from the different loss minima explored across cycles.
Strategy 4: Monte Carlo Dropout
In Lesson 7-4, dropout randomly zeros activations during training. At inference, it's usually disabled (all neurons active). But what if we keep dropout enabled at inference and run forward passes multiple times?
Each forward pass uses a different random dropout mask, effectively sampling from a different model. Average T such passes:
- ensemble prediction
- number of forward passes
- prediction with the t-th random dropout mask
This is the Monte Carlo dropout approximation (Gal & Ghahramani, 2016) — it approximates Bayesian inference and provides uncertainty estimates. Running T=30 passes gives both a prediction and a variance estimate for free, using the model that was already trained.
Code: Ensemble Prediction in PyTorch
import torch
# Model averaging: given a list of trained models
def ensemble_predict(models, x):
probs = torch.stack([
torch.softmax(model(x), dim=-1)
for model in models
]) # shape: (N_models, batch, num_classes)
return probs.mean(dim=0) # average over models
# Monte Carlo dropout: enable dropout at inference
def mc_dropout_predict(model, x, T=30):
model.train() # enables dropout
with torch.no_grad():
preds = torch.stack([
torch.softmax(model(x), dim=-1) for _ in range(T)
])
model.eval()
mean = preds.mean(0)
variance = preds.var(0) # uncertainty estimate per class
return mean, variance
For production ensembles, consider caching model outputs for static datasets rather than running all models on every request.