Skip to content
Regularization
Lesson 8 ⏱ 12 min

Ensemble methods

Video coming soon

Ensemble Methods - Averaging Away the Variance

The bias-variance decomposition for ensembles, why uncorrelated errors cancel when averaged, three practical ensemble strategies with their tradeoffs, and how dropout at test time approximates an infinite ensemble cheaply.

⏱ ~7 min

🧮

Quick refresher

Bias-variance tradeoff

Model error decomposes into bias (consistently wrong in the same direction) and variance (wrong in different ways on different samples). High-variance models are individually unreliable but their errors tend to cancel when averaged. Ensembles reduce variance without increasing bias.

Example

Five models each predict 55% for an example with true label 1.

Average prediction: 55%.

All five are wrong in the same direction — averaging doesn't help here.

But if one predicts 30%, one 80%, one 50%, one 70%, one 40%, averaging gives 54% — still close to the true label, with smaller average error than any individual model.

Why Single Models Are High-Variance

Train the same neural network architecture twice on the same data with different random seeds. You'll get two models with similar average performance — but different individual predictions on specific examples. Neither model is perfect, and their mistakes are somewhat different.

This is — the predictions depend on which particular random seed, data shuffles, and minibatch orderings the training happened to use.

The key observation: if errors are uncorrelated, averaging cancels them out.

Imagine five friends each independently guess how many jelly beans are in a jar. Each person's guess is off by a different amount in a different direction. If you average their guesses, the random errors partially cancel. The average is almost always closer to the truth than any single guess — not because any one person is smarter, but because the errors don't all point in the same direction.

The Variance Reduction Proof

Let be N model predictions, each with variance and zero bias (centered on the true value).

Var(xˉ)=Var!(1Ni=1Nxi)=1N2Nσ2=σ2N\text{Var}(\bar{x}) = \text{Var}!\left(\frac{1}{N}\sum_{i=1}^N x_i\right) = \frac{1}{N^2} \cdot N\sigma^2 = \frac{\sigma^2}{N}
xˉ\bar{x}
ensemble prediction (average of N models)
NN
number of ensemble members

With N=5 models, variance is reduced 5×. With N=10, reduced 10×. The law of large numbers is working in our favor: averaging independent estimates converges to the true value.

The critical assumption: independence (uncorrelated errors). If all models fail identically, the average fails too. The diversity strategy determines how well this assumption holds.

Strategy 1: Bagging (Bootstrap Aggregating)

Train N models, each on a different bootstrap sample — a random sample of size N drawn with replacement from the training set.

Why it creates diversity: Each bootstrap sample excludes about 36.8% of the original training examples (on average). Each model sees slightly different data → slightly different learned parameters → somewhat different errors.

Random forests are the canonical bagging example: many decision trees, each trained on a bootstrap sample with random feature subsets.

Neural network bagging is effective but expensive: N× training time and N× inference time. Worthwhile for high-stakes applications where accuracy matters more than speed.

Strategy 2: Model Averaging

Train N models with:

  • Different random initialization seeds
  • Different architectures (vary depth, width, activations)
  • Different hyperparameters (learning rate, dropout rate, weight decay)

Then average predictions at inference. No bootstrapping needed.

Worked example: 3-model ensemble

ModelPredicted probabilities: [cat, dog, bird]
Model 1[0.80, 0.15, 0.05]
Model 2[0.65, 0.25, 0.10]
Model 3[0.72, 0.20, 0.08]
Ensemble average[0.723, 0.200, 0.077]

The ensemble's average is typically more accurate than any individual model's prediction, and its probability estimates are better calibrated.

Strategy 3: Snapshot Ensembling

What if you can't afford N separate training runs? Snapshot ensembling (Huang et al., 2017) gets an ensemble from a single training run using cyclic learning rates:

  1. Train with cosine annealing that resets periodically (SGDR)
  2. Save a checkpoint at the end of each cycle (when lr ≈ 0 and the model is well-converged)
  3. Each checkpoint represents a different local minimum the model settled into
  4. Ensemble the checkpoints

Result: M ensemble members at the cost of 1 training run. The diversity comes from the different loss minima explored across cycles.

Strategy 4: Monte Carlo Dropout

In Lesson 7-4, dropout randomly zeros activations during training. At inference, it's usually disabled (all neurons active). But what if we keep dropout enabled at inference and run forward passes multiple times?

Each forward pass uses a different random dropout mask, effectively sampling from a different model. Average T such passes:

y^=1Tt=1Tft(x)\hat{y} = \frac{1}{T} \sum_{t=1}^{T} f_t(x)
y^\hat{y}
ensemble prediction
TT
number of forward passes
ft(x)f_t(x)
prediction with the t-th random dropout mask

This is the Monte Carlo dropout approximation (Gal & Ghahramani, 2016) — it approximates Bayesian inference and provides uncertainty estimates. Running T=30 passes gives both a prediction and a variance estimate for free, using the model that was already trained.

Code: Ensemble Prediction in PyTorch

import torch

# Model averaging: given a list of trained models
def ensemble_predict(models, x):
    probs = torch.stack([
        torch.softmax(model(x), dim=-1)
        for model in models
    ])  # shape: (N_models, batch, num_classes)
    return probs.mean(dim=0)  # average over models

# Monte Carlo dropout: enable dropout at inference
def mc_dropout_predict(model, x, T=30):
    model.train()  # enables dropout
    with torch.no_grad():
        preds = torch.stack([
            torch.softmax(model(x), dim=-1) for _ in range(T)
        ])
    model.eval()
    mean = preds.mean(0)
    variance = preds.var(0)  # uncertainty estimate per class
    return mean, variance

For production ensembles, consider caching model outputs for static datasets rather than running all models on every request.

Quiz

1 / 3

If N independent models each have prediction variance σ², what is the variance of their average prediction?