You've set up your network architecture. Every parameter starts as a random number. But how random? What distribution should you sample from?
This sounds like an implementation detail — surely the network will learn its way to good weights regardless. It won't. The initial scale of weights determines whether training succeeds at all, especially for deep networks. Let's see exactly why.
Poor weight initialization can cause training to fail before it starts. A network initialized with weights that are slightly too large will explode activations through every layer; slightly too small and all gradients vanish. Xavier and Kaiming initialization are the principled solutions that every modern deep learning library uses by default.
The Problem: Weights Are Multiplied
A network's forward pass is a long chain of matrix multiplications. Each layer takes the previous layer's output, multiplies it by a weight matrix, and passes the result forward. If the weight matrices are slightly too large, the multiplication amplifies signals. Slightly too small, and it shrinks them.
Neither effect sounds catastrophic for a single layer. But they're applied again, and again, and again — once per layer. Amplification or shrinkage compounds exponentially.
Variance Propagation: The Math
Consider one layer of neurons. Each output neuron computes:
- output of one neuron
- weight connecting input i to this neuron
- i-th input value
Assume and are independent, each with mean 0 and variances and .
Since each term is independent with variance , and we're summing such terms:
Biologists will recognize this structure from the variance of a sum of independent random variables — this is the same variance propagation rule used in error propagation analysis.
- variance of the output neuron
This one equation governs everything. Layer by layer, variance is multiplied by .
The Explosion Case
Say neurons per layer, weights drawn from so :
Each layer multiplies variance by 100. Starting with , after layers:
- variance after L layers
| Layers | Variance | Std Dev |
|---|---|---|
| 1 | 100 | 10 |
| 2 | 10,000 | 100 |
| 5 | 10¹⁰ | 10⁵ |
| 10 | 10²⁰ | 10¹⁰ |
After 10 layers, activations have standard deviation ten billion. Any loss computed on these values is numerically meaningless — usually NaN within the first forward pass.
The Vanishing Case
Now try small weights: , so :
After 10 layers: . Standard deviation: . All activations collapse to near-zero.
A network with vanished activations learns nothing. All the neurons produce the same near-zero output regardless of input — no signal can propagate.
The Goldilocks Condition
We want variance to be preserved through the network: . From the propagation formula:
Solving:
Initialize each weight as and the variance stays constant through every layer. Activations neither grow nor shrink.
Let's verify numerically. With and :
Variance is exactly preserved. After 10 layers: . ✓
Why This Still Isn't Complete
The analysis above assumed linear layers (no activation functions). With nonlinearities, the story changes:
ReLU zeros out half of all activations (those where the input is negative). This effectively halves the variance at every layer. You'd need to compensate by doubling the weight variance: .
Tanh and sigmoid compress values into . For small inputs near zero, they're approximately linear (slope ≈ 1). For large inputs, they saturate (slope ≈ 0). If activations are too large, all units saturate and gradients vanish — even with well-initialized weights.
The next lesson derives the two most important initialization schemes — Xavier (for tanh/sigmoid) and He (for ReLU) — from exactly this analysis.
Interactive example
Track activation variance across layers — adjust weight scale and layer count
Coming soon
A Quick Reference
# BAD: Default Normal — explodes at scale 1 nn.init.normal_(layer.weight, mean=0, std=1) # GOOD: Fan-in normalization n = layer.weight.shape[1] # fan-in nn.init.normal_(layer.weight, mean=0, std=(1/n)**0.5) # BETTER: Use the derived schemes (next lesson) nn.init.xavier_normal_(layer.weight) # for tanh/sigmoid nn.init.kaiming_normal_(layer.weight) # for ReLU
The derived schemes apply exactly the analysis above, with careful corrections for the specific activation function used.