Skip to content
Normalization & Initialization
Lesson 1 ⏱ 10 min

The activation distribution problem

Video coming soon

The Activation Distribution Problem

Visual demonstration of activation explosion and vanishing across deep networks, why layer outputs shift as training proceeds, and how normalizing activations fixes both problems.

⏱ ~6 min

🧮

Quick refresher

Variance and standard deviation

Variance measures the spread of a set of numbers around their mean. A variance of 1 means values typically stray about 1 unit from the mean; a variance of 100 means they stray about 10 units.

Example

The set [9, 10, 11] has mean 10 and variance ≈ 1.

The set [1, 10, 19] has mean 10 and variance ≈ 48.

You've seen what happens when gradients vanish or explode during backpropagation — training stalls or diverges. But there's a second, subtler way that deep networks break down: the activations themselves develop pathological distributions as training progresses. Understanding this problem is the key to understanding why normalization is non-negotiable in modern deep learning.

Normalization is not optional in deep networks — it is what makes them trainable. Without batch normalization or layer normalization, most networks deeper than ~10 layers either diverge or fail to learn. Understanding the root cause of this instability explains every normalization technique you will encounter.

What "Distribution" Means for a Layer

Every layer in your network produces output activations — vectors of numbers fed to the next layer. Those numbers have a statistical distribution: a mean (where the values cluster), a variance (how spread out they are), and a shape.

A healthy distribution for activations might look like: mean ≈ 0, variance ≈ 1, values mostly between −3 and 3. A pathological distribution might have mean 50 and variance 10,000 — values ranging from −150 to +250.

Why does this matter? The next layer's weights are scaled to work with a certain range of inputs. If the inputs suddenly have wildly different scale, the layer is miscalibrated. But it gets worse.

Activation Explosion: A Numerical Example

Consider a simple network with no nonlinearity — just 10 matrix multiplications in a row, each with weights drawn from N(0,1)\mathcal{N}(0, 1). Each multiplication scales the variance.

If each weight matrix has entries drawn from N(0,1)\mathcal{N}(0, 1) and input dimension , then the variance of each output neuron is approximately:

This formula says: the spread of values coming out of a layer equals the number of inputs times how spread out the weights are times how spread out the inputs were. Each of those multiplies together, so if any one is too big, the product blows up.

Var(y)=nVar(W)Var(x)=10011=100\text{Var}(y) = n \cdot \text{Var}(W) \cdot \text{Var}(x) = 100 \cdot 1 \cdot 1 = 100
Var(y)Var(y)
variance of one output neuron
nn
fan-in: number of inputs to this neuron
Var(W)Var(W)
variance of each weight
Var(x)Var(x)
variance of each input

After just one layer, variance is 100. After 10 layers:

Var10=10010=1020\text{Var}_{10} = 100^{10} = 10^{20}
VarLVar_L
variance after L layers

That's one followed by twenty zeros. Any value fed into the next layer will be astronomically large. Activations explode.

Now try the opposite: weights from N(0,0.01)\mathcal{N}(0, 0.01):

Var10=(100×0.01)10=110=actually (0.01)10=1020\text{Var}_{10} = (100 \times 0.01)^{10} = 1^{10} = \ldots \text{actually } (0.01)^{10} = 10^{-20}

Wait — let's be careful: Var per layer=100×0.01×1=1\text{Var per layer} = 100 \times 0.01 \times 1 = 1 this time. But if Var(W) = 0.001: per layer = 0.1, after 10 layers = 0.110=10100.1^{10} = 10^{-10} — effectively zero. Activations vanish.

Both extremes prevent any learning. Gradients over exploded or vanished activations are useless.

Internal Covariate Shift

There's a second, dynamic version of this problem that emerges during training. Suppose layer 5 has learned good weights for the distribution it receives from layer 4. Now layer 1 takes a gradient step — its weights change. Layer 1's output distribution shifts. That shifts layer 2's output. That shifts layer 3's. By the time you reach layer 5, its inputs have a completely different mean and variance than before.

Layer 5 is now miscalibrated. It must re-learn its weights to suit the new distribution. But while it's adapting, layer 6 becomes miscalibrated, because layer 5's outputs also changed. This cascading readaptation — each layer chasing a moving distribution — is called .

The deeper the network, the worse this gets. A 100-layer network may spend most of its training budget just re-calibrating rather than actually learning.

The Fix: Normalize at Every Layer

The conceptual solution is straightforward: after each layer's computation, normalize the activations so their distribution is well-behaved — roughly zero mean, unit variance. If we do this consistently, no layer ever faces a runaway input distribution, and internal covariate shift is suppressed.

But naive normalization kills expressivity. If you force every layer's output to be exactly zero-mean and unit-variance, the network can only represent a very restricted class of functions. The solution is to normalize and then apply a learned (scale) and (shift), giving the network the ability to "undo" the normalization if needed.

This is the key insight that makes normalization practical, and it underlies all the variants we'll cover:

  • Batch Normalization — normalizes over the batch dimension (next lesson)
  • Layer Normalization — normalizes over the feature dimension (per example)
  • Group Normalization — a compromise for small batches
  • Instance Normalization — per-example, per-channel

Each variant differs in which dimensions it normalizes over, making different tradeoffs between batch size requirements, sequence compatibility, and training stability.

Interactive example

Watch activations explode or vanish across layers — adjust weight variance and layer count

Coming soon

Quiz

1 / 3

You stack 20 layers, each multiplying variance by 2. What is the variance after all 20 layers if you start with variance 1?