Layer normalization — Neural Networks

The Limits of Batch Normalization

BatchNorm has a fundamental dependency: it needs multiple examples in a batch to compute meaningful statistics. In the three scenarios below, this dependency causes real problems:

Small batches: When memory constraints or architectural choices force batch size 1 or 2, batch statistics are noisy or undefined. BatchNorm degrades or breaks entirely.

Variable-length sequences: In NLP, sentences have different lengths. Padding brings them to equal length, but padded positions are not real data. Mixing statistics across real and padded positions corrupts the normalization.

Recurrent networks: In an RNN processing sequences step by step, each time step sees different input patterns. Computing batch statistics that mix different time steps conflates structurally different positions.

Layer Normalization sidesteps all of these problems with a single key change: normalize across the feature dimension, not the batch dimension.

The Algorithm

For a single example with feature vector of dimension :

Step 1: Compute the mean over features for this single example:

\mu = \frac{1}{d} \sum_{i=1}^{d} x_i

$\mu$: mean of this example's features
$d$: number of features
$x_i$: the i-th feature of this example

Step 2: Compute the variance over features:

\sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2

$\sigma^2$: variance of this example's features

Step 3: Normalize:

\hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \varepsilon}}

$\hat{x}_i$: normalized value of feature i
$\varepsilon$: numerical stability constant — typical: 1e-5

Step 4: Learned rescaling:

y = \gamma \odot \hat{x} + \beta

$\gamma$: learnable scale vector of dimension d
$\beta$: learnable shift vector of dimension d
$\odot$: elementwise multiplication
$y$: final LayerNorm output

Note that and here are vectors (one per feature), not scalars.

Worked Numerical Example

Single example with features $x = [6, 2, 4, 8]$ , $\gamma = [1,1,1,1]$ , $\beta = [0,0,0,0]$ :

\mu = (6+2+4+8)/4 = 5

\sigma^2 = [(6-5)^2 + (2-5)^2 + (4-5)^2 + (8-5)^2]/4 = [1+9+1+9]/4 = 5

\sigma = \sqrt{5} \approx 2.236

\hat{x} = [(6-5)/2.236,\ (2-5)/2.236,\ (4-5)/2.236,\ (8-5)/2.236]

\hat{x} \approx [0.447,\ -1.342,\ -0.447,\ 1.342]

These values have mean 0 and variance 1 within this single example, regardless of what other examples look like or what batch size is used.

Same math, different axis

BatchNorm and LayerNorm are the same normalization operation — subtract mean, divide by std. The only difference is which axis the statistics are computed over:

BatchNorm: for feature j, compute μⱼ and σⱼ across examples in the batch
LayerNorm: for example i, compute μᵢ and σᵢ across features of that example

Small concrete example. Suppose you have a 2×3 matrix (2 examples, 3 features):

Example 1: [10, 20, 30]
Example 2: [40, 50, 60]

BatchNorm normalizes each column using all rows:

Feature 0: mean=(10+40)/2=25, std≈21.2 → normalize [10,40] using 25, 21.2
Feature 1: mean=(20+50)/2=35, std≈21.2 → normalize [20,50] using 35, 21.2
Feature 2: mean=(30+60)/2=45, std≈21.2 → normalize [30,60] using 45, 21.2

LayerNorm normalizes each row using all columns:

Example 1: mean=(10+20+30)/3=20, std≈8.16 → normalize [10,20,30] using 20, 8.16
Example 2: mean=(40+50+60)/3=50, std≈8.16 → normalize [40,50,60] using 50, 8.16

Visualize a matrix where rows are examples and columns are features. BatchNorm normalizes each column; LayerNorm normalizes each row.

Where Each Is Used

Architecture type	Norm type	Why
Vision CNNs (ResNet, EfficientNet)	BatchNorm	Large batches, fixed spatial structure
Transformers (BERT, GPT, T5)	LayerNorm	Sequences, variable lengths, small batches
RNNs, LSTMs	LayerNorm	Sequential processing, batch stats unstable
Object detection	GroupNorm or BatchNorm	Small batch sizes sometimes
Diffusion models	GroupNorm	Operates on spatial feature groups

GroupNorm is a middle ground: normalize over groups of channels within each example. It avoids batch dependency (like LayerNorm) while respecting spatial structure (like BatchNorm). Used when batch size is forced to be small (e.g., detection with high-resolution images).

Code: LayerNorm in PyTorch

import torch.nn as nn

d_model = 512  # feature dimension

# LayerNorm over last dimension (features)
ln = nn.LayerNorm(d_model)  # γ and β are d_model-dimensional vectors

# In a transformer block:
class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.ff = nn.Sequential(nn.Linear(d_model, 4*d_model), nn.GELU(), nn.Linear(4*d_model, d_model))
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Pre-norm style (modern)
        x = x + self.attn(self.norm1(x), self.norm1(x), self.norm1(x))[0]
        x = x + self.ff(self.norm2(x))
        return x

nn.LayerNorm(d_model) normalizes over the last d_model dimensions. It automatically handles any batch size including 1, making it safe for both training and inference.