Xavier and He initialization: the math — Normalization & Initialization

The previous lesson showed that we need $\text{Var}(W) = 1/n$ to keep variance stable through a linear layer. This was derived for a purely linear network. Two refinements make this practical: correcting for the forward-backward asymmetry (Xavier), and correcting for activation functions that kill neurons (He).

Xavier and Kaiming (He) initialization are the defaults in PyTorch and every other major deep learning library. They are why torch.nn.Linear and torch.nn.Conv2d work out of the box without requiring you to set the initial weights manually.

Xavier Initialization: Forward and Backward

The variance propagation analysis has two sides. The forward pass cares about . The backward pass cares about .

Forward pass condition: to keep activation variance stable:

\text{Var}(W) = \frac{1}{n_\text{in}}

$n_in$: fan-in: number of inputs to each neuron in this layer

Backward pass condition: gradients flow backward through the transpose of each weight matrix. By an identical analysis, to keep gradient variance stable:

\text{Var}(W) = \frac{1}{n_\text{out}}

$n_out$: fan-out: number of outputs from each neuron in this layer

These two conditions can't both be satisfied exactly unless $n_\text{in} = n_\text{out}$ . Glorot and Bengio (2010) proposed the harmonic mean:

\text{Var}(W) = \frac{2}{n_\text{in} + n_\text{out}}

$Var(W)$: Xavier initialization variance

This approximately preserves variance in both directions. For the uniform version, recall that $\text{Var}(U(-L, L)) = L^2/3$ , so we need $L^2/3 = 2/(n_\text{in} + n_\text{out})$ :

L = \sqrt{\frac{6}{n_\text{in} + n_\text{out}}}

$L$: half-width of uniform distribution

Xavier uniform: $W \sim U!\left(-\sqrt{\frac{6}{n_\text{in}+n_\text{out}}},; +\sqrt{\frac{6}{n_\text{in}+n_\text{out}}}\right)$

Xavier normal: $W \sim \mathcal{N}!\left(0,; \sqrt{\frac{2}{n_\text{in}+n_\text{out}}}\right)$

Xavier was designed for like tanh and sigmoid. The key assumption: the derivative of the activation function at zero is approximately 1 (true for tanh: $\tanh'(0) = 1$ ).

He Initialization: Correcting for ReLU

ReLU violates the assumption that the activation derivative is 1. $\text{ReLU}(x) = \max(0, x)$ outputs zero for all negative inputs. For a layer whose inputs have mean 0, roughly half the values are negative, so roughly half the ReLU outputs are exactly zero.

This means the is $n_\text{in}/2$ , not $n_\text{in}$ . The forward variance analysis becomes:

\text{Var}(y) \approx \frac{n_\text{in}}{2} \cdot \text{Var}(W) \cdot \text{Var}(x)

For this to equal $\text{Var}(x)$ :

\text{Var}(W) = \frac{2}{n_\text{in}}

$Var(W)$: He initialization variance

This is He initialization (He et al., 2015, also called Kaiming initialization). The factor of 2 compensates for ReLU zeroing half the neurons.

He normal: $W \sim \mathcal{N}!\left(0,; \sqrt{\frac{2}{n_\text{in}}}\right)$

He uniform: $W \sim U!\left(-\sqrt{\frac{6}{n_\text{in}}},; +\sqrt{\frac{6}{n_\text{in}}}\right)$

Numerical Verification: 10-Layer ReLU Network

Let's verify both initializations through 10 layers.

Xavier with ReLU (wrong choice): $\text{Var}(W) = 2/(n_\text{in}+n_\text{out}) \approx 1/n$

Each layer: $\text{Var}(y) \approx (n/2) \times (1/n) \times \text{Var}(x) = 0.5 \times \text{Var}(x)$

After 10 layers: $0.5^{10} \approx 0.001$ . Variance shrinks 1000× — vanishing activations.

He initialization with ReLU: $\text{Var}(W) = 2/n$

Each layer: $\text{Var}(y) \approx (n/2) \times (2/n) \times \text{Var}(x) = 1.0 \times \text{Var}(x)$

After 10 layers: $1.0^{10} = 1.0$ . Variance is stable. ✓

Variants for Leaky ReLU

Leaky ReLU has slope for negative inputs instead of 0. This means (1-a) fraction of inputs pass through, but with scaling $a$ rather than 0. The correction factor becomes:

\text{Var}(W) = \frac{2}{(1+a^2) \cdot n_\text{in}}

$a$: negative slope parameter of Leaky ReLU

For standard Leaky ReLU with $a = 0.01$ : denominator = $1.0001 \times n_\text{in} \approx n_\text{in}$ — barely changes. For $a = 0.2$ : denominator = $1.04 \times n_\text{in}$ — a modest correction.

PyTorch Defaults and How to Override

import torch.nn as nn

# Xavier (Glorot) — default for Linear with tanh/sigmoid
nn.init.xavier_uniform_(layer.weight)
nn.init.xavier_normal_(layer.weight)

# He (Kaiming) — for ReLU
nn.init.kaiming_uniform_(layer.weight, nonlinearity='relu')
nn.init.kaiming_normal_(layer.weight, nonlinearity='relu')

# PyTorch defaults by layer type:
# nn.Linear    → Kaiming uniform (He) with fan_in mode
# nn.Conv2d    → Kaiming uniform (He) with fan_in mode
# nn.Embedding → Normal(0, 1) — you often want to override this

One subtlety: kaiming_uniform_ in PyTorch defaults to mode='fan_in', using only the input dimension. Pass mode='fan_out' to use the output dimension — useful if you're more concerned about gradient variance than forward variance.

The practical rule: use He/Kaiming for any network with ReLU or Leaky ReLU. Use Xavier for tanh, sigmoid, or linear activations. For transformers with Layer Normalization, initialization matters less because LayerNorm keeps statistics stable regardless — but Xavier or He are still sensible defaults.