Dropout — Regularization

Imagine you are on a team where one person does all the important work. Everyone else just goes along for the ride - they have learned they do not need to try because that one person handles everything. Now imagine randomly removing team members from each meeting. Suddenly everyone has to be capable and independent, because they never know who else will be there. That is dropout.

The Mechanism

During each training forward pass, randomly set each neuron's activation to zero with probability . The rest of the activations are kept.

With $p = 0.5$ (50% dropout), roughly half the neurons are silenced on any given step. Critically, a different random subset is dropped each training step.

\tilde{h}_i = m_i \cdot h_i, \quad m_i \sim \text{Bernoulli}(1-p)

$h_i$: activation of neuron i
$m_i$: binary mask — 1 if neuron survives, 0 if dropped. Think of it as a coin flip: heads (prob 1-p) means the neuron stays, tails (prob p) means it's set to zero.
$p$: dropout probability

The $\text{Bernoulli}(1-p)$ distribution is just a weighted coin flip: it equals 1 with probability $(1-p)$ and 0 with probability $p$ . So if $p = 0.3$ , each neuron independently has a 30% chance of being zeroed on each forward pass. It cannot develop dependencies of the form "neuron 47 always reinforces neuron 12, so neuron 12 does not need to be useful on its own."

Why This Regularizes

Without dropout, a high-capacity network can develop : complex feature detectors that only work when many specific neurons all fire together.

Dropout breaks these co-adaptations. If neuron A's signal is only useful when combined with neurons B and C (which might be zeroed), then A learns to carry useful information independently. The result is more robust, redundant feature representations - exactly what you want for generalization.

You can also think of it as noise injection during training. The network must learn to be correct despite random corruptions of its own activations. Robustness to corruption leads to robustness to distribution shift.

Test Time: Rescaling

There is a subtlety at test time. During training with $p = 0.5$ , only about 50% of neurons were active on any given step. At test time, all neurons are active. Activations are roughly twice as large as what the network was trained to expect.

Two equivalent approaches fix this:

Classic dropout: at test time, multiply each neuron's output by $(1-p)$ . With $p=0.5$ , halve all activations.

Inverted dropout (modern standard): at train time, divide surviving activations by $(1-p)$ . With $p=0.5$ , multiply active neurons by 2 during training. At test time, use the network as-is with no modification needed.

\tilde{h}_i = \frac{m_i \cdot h_i}{1-p} \quad \text{(inverted dropout, train time)}

$(1-p)$: survival probability - fraction of neurons kept active

PyTorch and TensorFlow both use inverted dropout by default. It is cleaner for deployment - inference is just the normal network.

Implementing inverted dropout manually

import torch

def dropout(x, p, training=True):
    """Inverted dropout: keeps expected value constant."""
    if not training or p == 0:
        return x   # no dropout at test time
    # Create a mask: each element is 1 with prob (1-p), 0 with prob p
    mask = (torch.rand_like(x) > p).float()
    # Scale up surviving neurons to keep expected value the same
    return mask * x / (1 - p)

x = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
# Training: roughly half the neurons zeroed, survivors scaled up
out_train = dropout(x, p=0.5, training=True)
print("Train (with dropout):", out_train)

# Test: no dropout, same expected output magnitude
out_test = dropout(x, p=0.5, training=False)
print("Test  (no dropout):", out_test)

# PyTorch built-in (same behavior):
import torch.nn as nn
dropout_layer = nn.Dropout(p=0.5)
dropout_layer.train()   # training mode
print("PyTorch train:", dropout_layer(x))
dropout_layer.eval()    # evaluation mode (no dropout)
print("PyTorch eval: ", dropout_layer(x))

Interactive example

Dropout visualization - toggle neurons on/off and watch feature map change

Coming soon

Choosing Dropout Rates

Not all layers need the same rate:

Large hidden layers (1024, 2048 neurons): $p = 0.5$ is common and usually effective.
Smaller layers (128, 256 neurons): $p = 0.1$ to $0.2$ ; aggressive dropout hurts too much.
Output layer: never apply dropout to the output layer.
Convolutional layers: spatial dropout (dropping entire feature maps) works better than element-wise dropout.
Small networks: if the network does not have excess capacity, dropout can hurt performance.

Relationship to Data Augmentation

Both dropout and data augmentation create variations during training:

Data augmentation: varies the input (flip the image, add noise, crop differently).
Dropout: varies the network (randomly disable neurons).

Both force the model to learn invariant features that work across the variation. They are complementary and are often used together in image classification pipelines.