The activation distribution problem — Normalization & Initialization

You've seen what happens when gradients vanish or explode during backpropagation — training stalls or diverges. But there's a second, subtler way that deep networks break down: the activations themselves develop pathological distributions as training progresses. Understanding this problem is the key to understanding why normalization is non-negotiable in modern deep learning.

Normalization is not optional in deep networks — it is what makes them trainable. Without batch normalization or layer normalization, most networks deeper than ~10 layers either diverge or fail to learn. Understanding the root cause of this instability explains every normalization technique you will encounter.

What "Distribution" Means for a Layer

Every layer in your network produces output activations — vectors of numbers fed to the next layer. Those numbers have a statistical distribution: a mean (where the values cluster), a variance (how spread out they are), and a shape.

A healthy distribution for activations might look like: mean ≈ 0, variance ≈ 1, values mostly between −3 and 3. A pathological distribution might have mean 50 and variance 10,000 — values ranging from −150 to +250.

Why does this matter? The next layer's weights are scaled to work with a certain range of inputs. If the inputs suddenly have wildly different scale, the layer is miscalibrated. But it gets worse.

Activation Explosion: A Numerical Example

Consider a simple network with no nonlinearity — just 10 matrix multiplications in a row, each with weights drawn from $\mathcal{N}(0, 1)$ . Each multiplication scales the variance.

If each weight matrix has entries drawn from $\mathcal{N}(0, 1)$ and input dimension , then the variance of each output neuron is approximately:

This formula says: the spread of values coming out of a layer equals the number of inputs times how spread out the weights are times how spread out the inputs were. Each of those multiplies together, so if any one is too big, the product blows up.

\text{Var}(y) = n \cdot \text{Var}(W) \cdot \text{Var}(x) = 100 \cdot 1 \cdot 1 = 100

$Var(y)$: variance of one output neuron
$n$: fan-in: number of inputs to this neuron
$Var(W)$: variance of each weight
$Var(x)$: variance of each input

After just one layer, variance is 100. After 10 layers:

\text{Var}_{10} = 100^{10} = 10^{20}

$Var_L$: variance after L layers

That's one followed by twenty zeros. Any value fed into the next layer will be astronomically large. Activations explode.

Now try the opposite: weights from $\mathcal{N}(0, 0.01)$ :

\text{Var}_{10} = (100 \times 0.01)^{10} = 1^{10} = \ldots \text{actually } (0.01)^{10} = 10^{-20}

Wait — let's be careful: $\text{Var per layer} = 100 \times 0.01 \times 1 = 1$ this time. But if Var(W) = 0.001: per layer = 0.1, after 10 layers = $0.1^{10} = 10^{-10}$ — effectively zero. Activations vanish.

Both extremes prevent any learning. Gradients over exploded or vanished activations are useless.

Internal Covariate Shift

There's a second, dynamic version of this problem that emerges during training. Suppose layer 5 has learned good weights for the distribution it receives from layer 4. Now layer 1 takes a gradient step — its weights change. Layer 1's output distribution shifts. That shifts layer 2's output. That shifts layer 3's. By the time you reach layer 5, its inputs have a completely different mean and variance than before.

Layer 5 is now miscalibrated. It must re-learn its weights to suit the new distribution. But while it's adapting, layer 6 becomes miscalibrated, because layer 5's outputs also changed. This cascading readaptation — each layer chasing a moving distribution — is called .

The deeper the network, the worse this gets. A 100-layer network may spend most of its training budget just re-calibrating rather than actually learning.

The Fix: Normalize at Every Layer

The conceptual solution is straightforward: after each layer's computation, normalize the activations so their distribution is well-behaved — roughly zero mean, unit variance. If we do this consistently, no layer ever faces a runaway input distribution, and internal covariate shift is suppressed.

But naive normalization kills expressivity. If you force every layer's output to be exactly zero-mean and unit-variance, the network can only represent a very restricted class of functions. The solution is to normalize and then apply a learned (scale) and (shift), giving the network the ability to "undo" the normalization if needed.

This is the key insight that makes normalization practical, and it underlies all the variants we'll cover:

Batch Normalization — normalizes over the batch dimension (next lesson)
Layer Normalization — normalizes over the feature dimension (per example)
Group Normalization — a compromise for small batches
Instance Normalization — per-example, per-channel

Each variant differs in which dimensions it normalizes over, making different tradeoffs between batch size requirements, sequence compatibility, and training stability.

Interactive example

Watch activations explode or vanish across layers — adjust weight variance and layer count

Coming soon