Skip to content
Normalization & Initialization
Lesson 4 ⏱ 12 min

Layer normalization

Video coming soon

Layer Normalization: No Batch Required

Derives LayerNorm by flipping the normalization axis from batch to features, explains why transformers exclusively use LayerNorm, and compares Pre-LN vs Post-LN placement.

⏱ ~6 min

🧮

Quick refresher

Axis of computation

When you compute statistics like mean or variance over a set of numbers, the 'axis' determines which numbers you're summarizing together. BatchNorm averages across examples (the batch axis). LayerNorm averages across features (the feature axis).

Example

For a matrix with rows = examples and columns = features: BatchNorm takes each column's mean.

LayerNorm takes each row's mean.

BatchNorm has a critical dependency: it needs a batch of examples to compute its statistics. At inference time we patched this with running averages, but the awkwardness reveals a deeper limitation. What if we want to normalize with no batch at all?

LayerNorm solves this by normalizing along a different axis entirely.

Layer normalization stabilizes transformer training — it is a required component in every modern transformer implementation, from BERT to GPT to LLaMA. Without it, deep transformers diverge during training. This is the normalization technique you will use most if you work with language models.

Flipping the Axis

In BatchNorm, for each feature , we compute statistics by looking across all examples in the batch. In LayerNorm, for each example , we compute statistics by looking across all features.

For a single example with features, x=[x1,x2,,xd]x = [x_1, x_2, \ldots, x_d]:

Step 1 — Mean across features:

μ=1di=1dxi\mu = \frac{1}{d} \sum_{i=1}^{d} x_i
μμ
mean of all features for this single example
dd
number of features

Step 2 — Variance across features:

σ2=1di=1d(xiμ)2\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2
σ2σ²
variance of all features for this single example

Step 3 — Normalize:

x^i=xiμσ2+ε\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}
x^ix̂_i
normalized i-th feature
εε
small constant for numerical stability

Step 4 — Scale and shift:

yi=γix^i+βiy_i = \gamma_i \cdot \hat{x}_i + \beta_i
yiy_i
LayerNorm output for feature i
γiγ_i
learned scale for feature i
βiβ_i
learned shift for feature i

Notice: everything is computed using one example's own features. No other examples involved. This means LayerNorm works identically whether your batch size is 1 or 1000.

Worked Example

Single example: x=[3,7,2,8]x = [3, 7, 2, 8] (4 features).

Mean:

μ=3+7+2+84=5\mu = \frac{3+7+2+8}{4} = 5

Variance:

σ2=(35)2+(75)2+(25)2+(85)24=4+4+9+94=6.5\sigma^2 = \frac{(3-5)^2 + (7-5)^2 + (2-5)^2 + (8-5)^2}{4} = \frac{4+4+9+9}{4} = 6.5

Normalize (ε = 0 for clarity):

x^=[356.5,;756.5,;256.5,;856.5][0.784,;0.784,;1.176,;1.176]\hat{x} = \left[\frac{3-5}{\sqrt{6.5}},; \frac{7-5}{\sqrt{6.5}},; \frac{2-5}{\sqrt{6.5}},; \frac{8-5}{\sqrt{6.5}}\right] \approx \left[-0.784,; 0.784,; -1.176,; 1.176\right]

Verify: mean of x^\hat{x} = 0 ✓, variance = 1 ✓.

No other examples were needed. This is the core advantage.

Why Transformers Use LayerNorm

A transformer processes sequences: each input is a sequence of tokens, and each token is represented by an embedding vector. The shape of the activations at each layer is [B,T,d][B, T, d] where is batch size, is sequence length, and is the embedding dimension.

Why not BatchNorm? If you tried BatchNorm, you'd compute statistics for position tt by looking at that position across all BB sequences. But position 3 of sentence A ("the dog sat") and position 3 of sentence B ("photosynthesis involves") are semantically unrelated — mixing their statistics is meaningless. Furthermore, TT varies between sequences, making it unclear how to define a "batch" across positions at all.

LayerNorm fits naturally: for each token (each [example, position] pair), normalize over the dd-dimensional embedding. Each token is self-contained. Variable sequence lengths are no problem.

Pre-LN vs Post-LN

The original transformer placed LayerNorm after adding the residual:

Post-LN: x → [Multi-Head Attention] → +residual → LayerNorm → output

Modern transformers use Pre-LN, which places LayerNorm before the sub-layer:

Pre-LN: x → LayerNorm → [Multi-Head Attention] → +residual → output

Why does the placement matter? In Post-LN, gradients must flow through the addition and then through LayerNorm before reaching the attention weights. For very deep networks (100+ layers), this can cause gradient instability early in training, requiring careful learning rate warm-up schedules.

In Pre-LN, each residual block starts with normalized inputs. Gradients can flow directly through the without passing through normalization. Training is more stable from the start — no warm-up required.

BatchNorm vs LayerNorm: When to Use Which

CriterionBatchNormLayerNorm
Works with batch size 1
Works for variable-length sequences
Separate train/inference behavior✓ (complex)✗ (same)
Best for CNNs with large batches
Best for transformers/RNNs
Regularization effectStrongMild

The fundamental rule: BatchNorm when you have large fixed-size batches and no sequence structure (image CNNs). LayerNorm when you have sequences, variable lengths, or small batches (transformers, language models, on-device inference).

Quiz

1 / 3

The key difference between BatchNorm and LayerNorm is...