The previous lesson showed that we need to keep variance stable through a linear layer. This was derived for a purely linear network. Two refinements make this practical: correcting for the forward-backward asymmetry (Xavier), and correcting for activation functions that kill neurons (He).
Xavier and Kaiming (He) initialization are the defaults in PyTorch and every other major deep learning library. They are why torch.nn.Linear and torch.nn.Conv2d work out of the box without requiring you to set the initial weights manually.
Xavier Initialization: Forward and Backward
The variance propagation analysis has two sides. The forward pass cares about . The backward pass cares about .
Forward pass condition: to keep activation variance stable:
- fan-in: number of inputs to each neuron in this layer
Backward pass condition: gradients flow backward through the transpose of each weight matrix. By an identical analysis, to keep gradient variance stable:
- fan-out: number of outputs from each neuron in this layer
These two conditions can't both be satisfied exactly unless . Glorot and Bengio (2010) proposed the harmonic mean:
- Xavier initialization variance
This approximately preserves variance in both directions. For the uniform version, recall that , so we need :
- half-width of uniform distribution
Xavier uniform:
Xavier normal:
Xavier was designed for like tanh and sigmoid. The key assumption: the derivative of the activation function at zero is approximately 1 (true for tanh: \tanh'(0) = 1).
He Initialization: Correcting for ReLU
ReLU violates the assumption that the activation derivative is 1. outputs zero for all negative inputs. For a layer whose inputs have mean 0, roughly half the values are negative, so roughly half the ReLU outputs are exactly zero.
This means the is , not . The forward variance analysis becomes:
For this to equal :
- He initialization variance
This is He initialization (He et al., 2015, also called Kaiming initialization). The factor of 2 compensates for ReLU zeroing half the neurons.
He normal:
He uniform:
Numerical Verification: 10-Layer ReLU Network
Let's verify both initializations through 10 layers.
Xavier with ReLU (wrong choice):
Each layer:
After 10 layers: . Variance shrinks 1000× — vanishing activations.
He initialization with ReLU:
Each layer:
After 10 layers: . Variance is stable. ✓
Variants for Leaky ReLU
Leaky ReLU has slope for negative inputs instead of 0. This means (1-a) fraction of inputs pass through, but with scaling rather than 0. The correction factor becomes:
- negative slope parameter of Leaky ReLU
For standard Leaky ReLU with : denominator = — barely changes. For : denominator = — a modest correction.
PyTorch Defaults and How to Override
import torch.nn as nn # Xavier (Glorot) — default for Linear with tanh/sigmoid nn.init.xavier_uniform_(layer.weight) nn.init.xavier_normal_(layer.weight) # He (Kaiming) — for ReLU nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu') nn.init.kaiming_normal_(layer.weight, nonlinearity='relu') # PyTorch defaults by layer type: # nn.Linear → Kaiming uniform (He) with fan_in mode # nn.Conv2d → Kaiming uniform (He) with fan_in mode # nn.Embedding → Normal(0, 1) — you often want to override this
One subtlety: kaiming_uniform_ in PyTorch defaults to mode='fan_in', using only the input dimension. Pass mode='fan_out' to use the output dimension — useful if you're more concerned about gradient variance than forward variance.
The practical rule: use He/Kaiming for any network with ReLU or Leaky ReLU. Use Xavier for tanh, sigmoid, or linear activations. For transformers with Layer Normalization, initialization matters less because LayerNorm keeps statistics stable regardless — but Xavier or He are still sensible defaults.