Skip to content
Regularization
Lesson 4 ⏱ 10 min

Dropout

Video coming soon

Dropout - Training an Exponential Ensemble

How randomly zeroing neurons during training prevents co-adaptation, and why inverted dropout makes inference clean.

⏱ ~6 min

🧮

Quick refresher

Expected value of a random variable

The expected value E[X] is the average outcome of a random process. If X = 1 with probability 0.5 and X = 0 with probability 0.5, then E[X] = 0.5.

Example

A neuron with activation 2.0, dropout rate p=0.5.

During training it is active half the time: E[activation] = 0.5 * 2.0 = 1.0.

At test time we want this expected value, so we scale by (1-p) = 0.5.

Imagine you are on a team where one person does all the important work. Everyone else just goes along for the ride - they have learned they do not need to try because that one person handles everything. Now imagine randomly removing team members from each meeting. Suddenly everyone has to be capable and independent, because they never know who else will be there. That is dropout.

The Mechanism

During each training forward pass, randomly set each neuron's activation to zero with probability . The rest of the activations are kept.

With p=0.5p = 0.5 (50% dropout), roughly half the neurons are silenced on any given step. Critically, a different random subset is dropped each training step.

h~i=mihi,miBernoulli(1p)\tilde{h}_i = m_i \cdot h_i, \quad m_i \sim \text{Bernoulli}(1-p)
hih_i
activation of neuron i
mim_i
binary mask — 1 if neuron survives, 0 if dropped. Think of it as a coin flip: heads (prob 1-p) means the neuron stays, tails (prob p) means it's set to zero.
pp
dropout probability

The Bernoulli(1p)\text{Bernoulli}(1-p) distribution is just a weighted coin flip: it equals 1 with probability (1p)(1-p) and 0 with probability pp. So if p=0.3p = 0.3, each neuron independently has a 30% chance of being zeroed on each forward pass. It cannot develop dependencies of the form "neuron 47 always reinforces neuron 12, so neuron 12 does not need to be useful on its own."

Why This Regularizes

Without dropout, a high-capacity network can develop : complex feature detectors that only work when many specific neurons all fire together.

Dropout breaks these co-adaptations. If neuron A's signal is only useful when combined with neurons B and C (which might be zeroed), then A learns to carry useful information independently. The result is more robust, redundant feature representations - exactly what you want for generalization.

You can also think of it as noise injection during training. The network must learn to be correct despite random corruptions of its own activations. Robustness to corruption leads to robustness to distribution shift.

Test Time: Rescaling

There is a subtlety at test time. During training with p=0.5p = 0.5, only about 50% of neurons were active on any given step. At test time, all neurons are active. Activations are roughly twice as large as what the network was trained to expect.

Two equivalent approaches fix this:

Classic dropout: at test time, multiply each neuron's output by (1p)(1-p). With p=0.5p=0.5, halve all activations.

Inverted dropout (modern standard): at train time, divide surviving activations by (1p)(1-p). With p=0.5p=0.5, multiply active neurons by 2 during training. At test time, use the network as-is with no modification needed.

h~i=mihi1p(inverted dropout, train time)\tilde{h}_i = \frac{m_i \cdot h_i}{1-p} \quad \text{(inverted dropout, train time)}
(1p)(1-p)
survival probability - fraction of neurons kept active

PyTorch and TensorFlow both use inverted dropout by default. It is cleaner for deployment - inference is just the normal network.

Interactive example

Dropout visualization - toggle neurons on/off and watch feature map change

Coming soon

Choosing Dropout Rates

Not all layers need the same rate:

  • Large hidden layers (1024, 2048 neurons): p=0.5p = 0.5 is common and usually effective.
  • Smaller layers (128, 256 neurons): p=0.1p = 0.1 to 0.20.2; aggressive dropout hurts too much.
  • Output layer: never apply dropout to the output layer.
  • Convolutional layers: spatial dropout (dropping entire feature maps) works better than element-wise dropout.
  • Small networks: if the network does not have excess capacity, dropout can hurt performance.

Relationship to Data Augmentation

Both dropout and data augmentation create variations during training:

  • Data augmentation: varies the input (flip the image, add noise, crop differently).
  • Dropout: varies the network (randomly disable neurons).

Both force the model to learn invariant features that work across the variation. They are complementary and are often used together in image classification pipelines.

Interactive example

Dropout comparison - training vs. validation loss with and without dropout over epochs

Coming soon

Quiz

1 / 3

During training with dropout rate p=0.5, each neuron...