Instance, group, and weight normalization — Normalization & Initialization

You've seen BatchNorm and LayerNorm. The rest of the normalization family follows a single organizing principle: which dimensions do you average over to compute the statistics? Once you see this, the whole family falls into place.

Instance normalization powers neural style transfer, group normalization is the standard for object detection with small batch sizes, and root mean square normalization is used in LLaMA. Knowing which normalization to reach for — and why — is a practical skill every deep learning practitioner needs.

The 4D CNN Tensor

For convolutional networks, activations have shape $[N, C, H, W]$ :

Symbol: : batch size
Symbol: : channels
Symbol: : spatial height
Symbol: : spatial width

A typical mid-network activation for image classification might be $[32, 256, 14, 14]$ : 32 images, 256 feature maps, 14×14 pixels each.

The four normalization schemes differ only in which subset of ${N, C, H, W}$ they reduce over:

Method	Normalize over	One statistic per
BatchNorm	N, H, W	Channel C
LayerNorm	C, H, W	Example N
Instance Norm	H, W	(N, C) pair
Group Norm	(subset of C), H, W	(N, group) pair

Let's understand each one concretely.

BatchNorm (Recap for CNNs)

BatchNorm computes one mean per channel, averaging over all $N$ examples and all $H \times W$ spatial positions. For 256 channels, you get 256 means and 256 variances — one pair per channel, shared across the whole batch and all spatial locations.

This means every pixel in the same channel, across all examples in the batch, gets normalized by the same statistics. Works great when N is large. Breaks down when N is small.

Instance Normalization

The goes further: for each of the $N \times C$ feature maps, compute separate statistics from the $H \times W$ spatial positions within that map.

For a feature map at example $n$ , channel $c$ :

\mu_{nc} = \frac{1}{HW} \sum_{h=1}^{H} \sum_{w=1}^{W} x_{nchw}

$μ_{nc}$: mean of feature map at example n, channel c
$x_{nchw}$: activation at spatial position (h,w)

Each feature map is normalized against itself, with no mixing between examples or channels.

Use case: style transfer. In style transfer (Gatys et al., 2016 and follow-ups), you want to transfer the style of one image to the content of another. Style is captured by per-channel activation statistics. If you use BatchNorm, statistics are computed across the batch — mixing the style of different images. Instance Norm keeps each image's style statistics pure, making it far more effective for style transfer and image generation tasks.

Group Normalization

The is a middle ground between LayerNorm (all channels together) and Instance Norm (one channel at a time). Divide the C channels into G groups of C/G channels each. For each (example, group) pair, compute statistics over the channels in that group and the H×W spatial positions.

\mu_{ng} = \frac{1}{(C/G) \cdot HW} \sum_{c \in \text{group } g} \sum_{h,w} x_{nchw}

$G$: number of groups to divide channels into
$C/G$: channels per group

With G=C, Group Norm is Instance Norm. With G=1, it's LayerNorm (over all channels).

Use case: object detection. State-of-the-art object detectors like Faster R-CNN and Mask R-CNN train with 1-2 high-resolution images per GPU. BatchNorm is useless at these batch sizes. Group Norm maintains stable statistics because it doesn't depend on N at all. Wu & He (2018) showed Group Norm with G=32 matches or exceeds BatchNorm performance for batch sizes ≤ 8.

Weight Normalization: A Different Approach

All methods above normalize activations. Weight Normalization (Salimans & Kingma, 2016) normalizes the weights themselves.

For each weight vector , reparameterize as:

\mathbf{w} = \frac{g}{|\mathbf{v}|} \mathbf{v}

$w$: weight vector to be reparameterized
$g$: scalar magnitude parameter — learned
$v$: direction vector — learned
$||v||$: Euclidean norm of v

Instead of learning $w$ directly, the network learns $g$ (scalar magnitude) and $v$ (direction vector) separately. The weight is always scaled to magnitude $g$ , regardless of how $v$ changes.

Why this helps: gradient descent can independently scale the magnitude without rotating the direction, and vice versa. In the original parameterization, scaling and rotating are coupled — changing one element of $w$ affects both its magnitude and direction simultaneously.

Key difference from other norms: weight normalization has no batch or layer statistics. It's entirely determined by the current weight values. This makes it useful for recurrent networks and reinforcement learning, where batch statistics are unreliable or undefined.

Decision Guide

What architecture are you using?
├── CNN with batch size ≥ 16?       → BatchNorm
├── CNN with batch size < 8?        → Group Norm (G=32)
├── Transformer / language model?   → LayerNorm
├── Style transfer / image generation? → Instance Norm
├── RNN or online learning?         → Weight Norm or Layer Norm
└── Variable-length, batch size 1?  → LayerNorm

In PyTorch:

nn.BatchNorm2d(C)    # norm over (N, H, W), per channel
nn.LayerNorm([C, H, W])  # norm over (C, H, W), per example
nn.InstanceNorm2d(C) # norm over (H, W), per (N, C) pair
nn.GroupNorm(G, C)   # norm over (C/G, H, W), per (N, group) pair

Next we turn to the other side of initialization: not which values the network produces, but where it starts — the initial weight values before any training begins.