Layer normalization — Normalization & Initialization

BatchNorm has a critical dependency: it needs a batch of examples to compute its statistics. At inference time we patched this with running averages, but the awkwardness reveals a deeper limitation. What if we want to normalize with no batch at all?

LayerNorm solves this by normalizing along a different axis entirely.

Layer normalization stabilizes transformer training — it is a required component in every modern transformer implementation, from BERT to GPT to LLaMA. Without it, deep transformers diverge during training. This is the normalization technique you will use most if you work with language models.

Flipping the Axis

In BatchNorm, for each feature , we compute statistics by looking across all examples in the batch. In LayerNorm, for each example , we compute statistics by looking across all features.

For a single example with features, $x = [x_1, x_2, \ldots, x_d]$ :

Step 1 — Mean across features:

\mu = \frac{1}{d} \sum_{i=1}^{d} x_i

$μ$: mean of all features for this single example
$d$: number of features

Step 2 — Variance across features:

\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

$σ²$: variance of all features for this single example

Step 3 — Normalize:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}

$x̂_i$: normalized i-th feature
$ε$: small constant for numerical stability

Step 4 — Scale and shift:

y_i = \gamma_i \cdot \hat{x}_i + \beta_i

$y_i$: LayerNorm output for feature i
$γ_i$: learned scale for feature i
$β_i$: learned shift for feature i

Notice: everything is computed using one example's own features. No other examples involved. This means LayerNorm works identically whether your batch size is 1 or 1000.

Worked Example

Single example: $x = [3, 7, 2, 8]$ (4 features).

Mean:

\mu = \frac{3+7+2+8}{4} = 5

Variance:

\sigma^2 = \frac{(3-5)^2 + (7-5)^2 + (2-5)^2 + (8-5)^2}{4} = \frac{4+4+9+9}{4} = 6.5

Normalize (ε = 0 for clarity):

\hat{x} = \left[\frac{3-5}{\sqrt{6.5}},; \frac{7-5}{\sqrt{6.5}},; \frac{2-5}{\sqrt{6.5}},; \frac{8-5}{\sqrt{6.5}}\right] \approx \left[-0.784,; 0.784,; -1.176,; 1.176\right]

Verify: mean of $\hat{x}$ = 0 ✓, variance = 1 ✓.

No other examples were needed. This is the core advantage.

Why Transformers Use LayerNorm

A transformer processes sequences: each input is a sequence of tokens, and each token is represented by an embedding vector. The shape of the activations at each layer is $[B, T, d]$ where is batch size, is sequence length, and is the embedding dimension.

Why not BatchNorm? If you tried BatchNorm, you'd compute statistics for position $t$ by looking at that position across all $B$ sequences. But position 3 of sentence A ("the dog sat") and position 3 of sentence B ("photosynthesis involves") are semantically unrelated — mixing their statistics is meaningless. Furthermore, $T$ varies between sequences, making it unclear how to define a "batch" across positions at all.

LayerNorm fits naturally: for each token (each [example, position] pair), normalize over the $d$ -dimensional embedding. Each token is self-contained. Variable sequence lengths are no problem.

Pre-LN vs Post-LN

The original transformer placed LayerNorm after adding the residual:

Post-LN: x → [Multi-Head Attention] → +residual → LayerNorm → output

Modern transformers use Pre-LN, which places LayerNorm before the sub-layer:

Pre-LN: x → LayerNorm → [Multi-Head Attention] → +residual → output

Why does the placement matter? In Post-LN, gradients must flow through the addition and then through LayerNorm before reaching the attention weights. For very deep networks (100+ layers), this can cause gradient instability early in training, requiring careful learning rate warm-up schedules.

In Pre-LN, each residual block starts with normalized inputs. Gradients can flow directly through the without passing through normalization. Training is more stable from the start — no warm-up required.

BatchNorm vs LayerNorm: When to Use Which

Criterion	BatchNorm	LayerNorm
Works with batch size 1	✗	✓
Works for variable-length sequences	✗	✓
Separate train/inference behavior	✓ (complex)	✗ (same)
Best for CNNs with large batches	✓	✗
Best for transformers/RNNs	✗	✓
Regularization effect	Strong	Mild

The fundamental rule: BatchNorm when you have large fixed-size batches and no sequence structure (image CNNs). LayerNorm when you have sequences, variable lengths, or small batches (transformers, language models, on-device inference).