Skip to content
Normalization & Initialization
Lesson 2 ⏱ 14 min

Batch normalization: the algorithm

Video coming soon

Batch Normalization Step by Step

Walks through every step of the BatchNorm algorithm on a concrete batch of 4 values, derives the γ and β parameters, and shows why they're necessary for expressivity.

⏱ ~7 min

🧮

Quick refresher

Mean and variance

The mean of a set of numbers is their sum divided by count. The variance is the average squared deviation from the mean. These two statistics describe where a distribution is centered and how spread out it is.

Example

For [2, 4, 4, 8]: mean = (2+4+4+8)/4 = 4.5, variance = ((2-4.5)²+(4-4.5)²+(4-4.5)²+(8-4.5)²)/4 = (6.25+0.25+0.25+12.25)/4 = 4.75.

The previous lesson established why we need normalization — activation distributions explode, vanish, and drift during training. Now let's build the BatchNorm algorithm from scratch, step by step, so every piece makes sense before the next one appears.

Batch normalization was the breakthrough that made training very deep networks practical. Before it, networks with 20+ layers were almost impossible to train reliably. After it, researchers started training 100+ layer networks. It is still a required component in most CNN architectures today.

Setup: What We're Normalizing

Consider a single layer producing output activations. We have a mini-batch of examples. For each example, the layer produces a vector of features. Let's focus on one feature across all mm examples, giving us values x1,x2,,xmx_1, x_2, \ldots, x_m.

BatchNorm normalizes this feature using statistics computed from the current batch. The full algorithm has four steps.

The Four Steps

Step 1 — Compute batch mean:

μB=1mi=1mxi\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i
μBμ_B
batch mean for this feature
mm
number of examples in the mini-batch
xix_i
activation value for example i

Step 2 — Compute batch variance:

σB2=1mi=1m(xiμB)2\sigma^2_B = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2
σB2σ²_B
batch variance for this feature

Step 3 — Normalize:

x^i=xiμBσB2+ε\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \varepsilon}}
x^ix̂_i
normalized activation for example i
εε
small constant for numerical stability, typically 1e-5

After step 3, each x^i\hat{x}_i has mean 0 and variance 1 across the batch. The prevents division by zero when all values in the batch are identical.

Step 4 — Scale and shift:

yi=γx^i+βy_i = \gamma \cdot \hat{x}_i + \beta
yiy_i
final BatchNorm output for example i
γγ
learned scale parameter — initialized to 1
ββ
learned shift parameter — initialized to 0

The parameters and are initialized to 1 and 0 respectively (which leaves the normalized values unchanged at the start of training).

Why γ and β Are Essential

You might think: "Doesn't adding γ and β undo the normalization?" Yes — that's exactly the point.

Without γ and β, every layer is forced to produce outputs with mean 0 and variance 1. This severely limits what functions the network can represent. Some tasks genuinely benefit from a layer outputting values with mean 5 and variance 0.1. Forcing zero mean and unit variance everywhere removes that flexibility.

With γ and β, the network can choose how much normalization to apply. If the optimal behavior for some layer is to have no normalization at all, gradient descent will push γσB\gamma \to \sigma_B and βμB\beta \to \mu_B, effectively recovering the original distribution. If some amount of normalization helps, γ and β settle at intermediate values.

Worked Example: Batch of 4

Say a single feature produces values x=[1,3,5,7]x = [1, 3, 5, 7] for 4 training examples.

Mean:

μB=1+3+5+74=4\mu_B = \frac{1+3+5+7}{4} = 4

Variance:

σB2=(14)2+(34)2+(54)2+(74)24=9+1+1+94=5\sigma^2_B = \frac{(1-4)^2 + (3-4)^2 + (5-4)^2 + (7-4)^2}{4} = \frac{9+1+1+9}{4} = 5

Normalize (using ε=0\varepsilon = 0 for clarity):

x^=[145,;345,;545,;745]=[1.342,;0.447,;0.447,;1.342]\hat{x} = \left[\frac{1-4}{\sqrt{5}},; \frac{3-4}{\sqrt{5}},; \frac{5-4}{\sqrt{5}},; \frac{7-4}{\sqrt{5}}\right] = \left[-1.342,; -0.447,; 0.447,; 1.342\right]

Verify: mean of x^\hat{x} = 0 ✓, variance = 1 ✓.

Scale and shift (say γ = 2, β = 1):

y=[2(1.342)+1,;2(0.447)+1,;2(0.447)+1,;2(1.342)+1]=[1.684,;0.106,;1.894,;3.684]y = \left[2(-1.342)+1,; 2(-0.447)+1,; 2(0.447)+1,; 2(1.342)+1\right] = \left[-1.684,; 0.106,; 1.894,; 3.684\right]

Now the output has mean 1 and variance 4 — controlled by γ and β.

Applying BatchNorm to Multiple Features

In practice, a layer produces an activation vector, not a single scalar. Say each example has features. BatchNorm applies the four-step algorithm to each of the C features independently. Feature 1 gets its own μ and σ² from the batch; feature 2 gets its own; and so on.

This means BatchNorm adds exactly 2C2C parameters (one γ and one β per feature). For a layer with 512 features, that's 1,024 extra parameters — tiny compared to the weight matrix.

Summary of the Algorithm

The full BatchNorm operation for feature jj across mini-batch examples i=1mi = 1 \ldots m:

  1. Compute batch mean: μB,j=1mixi,j\mu_{B,j} = \frac{1}{m}\sum_i x_{i,j}
  2. Compute batch variance: σB,j2=1mi(xi,jμB,j)2\sigma^2_{B,j} = \frac{1}{m}\sum_i (x_{i,j} - \mu_{B,j})^2
  3. Normalize: x^i,j=xi,jμB,jσB,j2+ε\hat{x}{i,j} = \frac{x{i,j} - \mu_{B,j}}{\sqrt{\sigma^2_{B,j} + \varepsilon}}
  4. Scale and shift: yi,j=γjx^i,j+βjy_{i,j} = \gamma_j \cdot \hat{x}_{i,j} + \beta_j

The learned parameters γj\gamma_j and βj\beta_j are updated by backprop just like any other weight. The normalization in step 3 is a deterministic function of the batch, so gradients flow through it cleanly.

Next lesson: what happens to this algorithm at inference time — because there's no "batch" anymore, and this requires a careful fix.

Quiz

1 / 3

In BatchNorm, γ and β are...