Why Initialization Matters
A neural network at initialization has random weights. Those random weights get multiplied together across every layer to produce activations and gradients. This is where a subtle but catastrophic failure can happen before training even begins.
Consider a 10-layer network. If each layer multiplies the signal magnitude by 1.5 on average, after 10 layers the magnitude is . If each layer multiplies by 0.7, after 10 layers the magnitude is .
Exploding activations: values grow exponentially, gradients blow up, training fails immediately. Vanishing activations: values shrink to near zero, gradients also near zero, no learning happens.
The right initialization keeps signal magnitudes approximately stable as they pass through the network.
Deriving the Variance Condition
Consider a single layer: , where is n×m and is m×1. Output neuron :
Assume W and x are independent, both zero-mean. Each term has variance . Summing m independent terms:
- variance of output neuron i
- fan_in — number of input neurons (input dimension)
- variance of each weight (all weights share this)
- variance of input features
To keep (no signal amplification), we need:
- fan_in
Xavier/Glorot Initialization
The forward-pass condition gives Var(W) = 1/fan_in. The backward-pass (gradient) condition gives Var(W) = 1/fan_out. Xavier/Glorot initialization takes the compromise:
- \text{fan_in}
- number of input neurons to this layer
- \text{fan_out}
- number of output neurons from this layer
In practice, this is usually implemented as a uniform distribution:
- each weight, drawn uniformly
Xavier is designed for tanh and sigmoid activations — functions that are roughly linear near zero, preserving the variance analysis above.
He Initialization
ReLU introduces a new wrinkle. For a zero-mean Gaussian input, ReLU zeroes out roughly 50% of values (all the negatives). The variance of the output is:
Each layer with ReLU halves the signal variance. He initialization (Kaiming initialization) corrects for this by doubling the weight variance:
- \text{fan_in}
- number of input neurons to this layer
The factor 2 in the numerator exactly compensates for ReLU's 50% kill rate, keeping variance stable across layers.
Worked Numerical Comparison
5-layer network, each layer has fan_in = fan_out = 256:
Bad initialization: Var(W) = 1 (too large)
| Layer | Var(activation) |
|---|---|
| 1 | 256 × 1 × 1 = 256 |
| 2 | 256 × 1 × 256 = 65,536 |
| 3 | ~16,777,216 |
| 4 | ~4.3 × 10⁹ |
| 5 | overflow |
He initialization: Var(W) = 2/256 ≈ 0.0078 (for ReLU)
| Layer | Var(activation) |
|---|---|
| 1 | 256 × 0.0078 × 1 ≈ 1.0 (ReLU halves: 0.5 → × 2 = 1) |
| 2 | ≈ 1.0 |
| 3 | ≈ 1.0 |
| 4 | ≈ 1.0 |
| 5 | ≈ 1.0 |
Signal variance stays stable through the entire forward pass.
Which to Use
| Activation | Initialization | Why |
|---|---|---|
| tanh, sigmoid | Xavier/Glorot | Designed for near-linear activations |
| ReLU | He (Kaiming) | Corrects for 50% kill rate |
| GELU, Swish | He (usually) | Similar to ReLU, safe default |
| Linear (no activation) | Xavier or He | Both work; Xavier common |
Code: Initialization in PyTorch
import torch.nn as nn
# PyTorch default for Linear: Kaiming uniform (He-like)
layer = nn.Linear(256, 256) # already initialized with Kaiming uniform
# Explicit initialization
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu') # He normal
nn.init.xavier_uniform_(layer.weight) # Xavier uniform
# Custom initialization for a whole model
def init_weights(module):
if isinstance(module, nn.Linear):
nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
nn.init.zeros_(module.bias)
model.apply(init_weights)
kaiming_normal_ is He initialization with a normal distribution. kaiming_uniform_ uses a uniform distribution. For most ReLU-based networks, either works — the difference is minor in practice when BatchNorm is present.