Batch normalization: the algorithm — Normalization & Initialization

The previous lesson established why we need normalization — activation distributions explode, vanish, and drift during training. Now let's build the BatchNorm algorithm from scratch, step by step, so every piece makes sense before the next one appears.

Batch normalization was the breakthrough that made training very deep networks practical. Before it, networks with 20+ layers were almost impossible to train reliably. After it, researchers started training 100+ layer networks. It is still a required component in most CNN architectures today.

Setup: What We're Normalizing

Consider a single layer producing output activations. We have a mini-batch of examples. For each example, the layer produces a vector of features. Let's focus on one feature across all $m$ examples, giving us values $x_1, x_2, \ldots, x_m$ .

BatchNorm normalizes this feature using statistics computed from the current batch. The full algorithm has four steps.

The Four Steps

Step 1 — Compute batch mean:

\mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i

$μ_B$: batch mean for this feature
$m$: number of examples in the mini-batch
$x_i$: activation value for example i

Step 2 — Compute batch variance:

\sigma^2_B = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2

$σ²_B$: batch variance for this feature

Step 3 — Normalize:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \varepsilon}}

$x̂_i$: normalized activation for example i
$ε$: small constant for numerical stability, typically 1e-5

After step 3, each $\hat{x}_i$ has mean 0 and variance 1 across the batch. The prevents division by zero when all values in the batch are identical.

Step 4 — Scale and shift:

y_i = \gamma \cdot \hat{x}_i + \beta

$y_i$: final BatchNorm output for example i
$γ$: learned scale parameter — initialized to 1
$β$: learned shift parameter — initialized to 0

The parameters and are initialized to 1 and 0 respectively (which leaves the normalized values unchanged at the start of training).

Why γ and β Are Essential

You might think: "Doesn't adding γ and β undo the normalization?" Yes — that's exactly the point.

Without γ and β, every layer is forced to produce outputs with mean 0 and variance 1. This severely limits what functions the network can represent. Some tasks genuinely benefit from a layer outputting values with mean 5 and variance 0.1. Forcing zero mean and unit variance everywhere removes that flexibility.

With γ and β, the network can choose how much normalization to apply. If the optimal behavior for some layer is to have no normalization at all, gradient descent will push $\gamma \to \sigma_B$ and $\beta \to \mu_B$ , effectively recovering the original distribution. If some amount of normalization helps, γ and β settle at intermediate values.

Worked Example: Batch of 4

Say a single feature produces values $x = [1, 3, 5, 7]$ for 4 training examples.

Mean:

\mu_B = \frac{1+3+5+7}{4} = 4

Variance:

\sigma^2_B = \frac{(1-4)^2 + (3-4)^2 + (5-4)^2 + (7-4)^2}{4} = \frac{9+1+1+9}{4} = 5

Normalize (using $\varepsilon = 0$ for clarity):

\hat{x} = \left[\frac{1-4}{\sqrt{5}},; \frac{3-4}{\sqrt{5}},; \frac{5-4}{\sqrt{5}},; \frac{7-4}{\sqrt{5}}\right] = \left[-1.342,; -0.447,; 0.447,; 1.342\right]

Verify: mean of $\hat{x}$ = 0 ✓, variance = 1 ✓.

Scale and shift (say γ = 2, β = 1):

y = \left[2(-1.342)+1,; 2(-0.447)+1,; 2(0.447)+1,; 2(1.342)+1\right] = \left[-1.684,; 0.106,; 1.894,; 3.684\right]

Now the output has mean 1 and variance 4 — controlled by γ and β.

Applying BatchNorm to Multiple Features

In practice, a layer produces an activation vector, not a single scalar. Say each example has features. BatchNorm applies the four-step algorithm to each of the C features independently. Feature 1 gets its own μ and σ² from the batch; feature 2 gets its own; and so on.

This means BatchNorm adds exactly $2C$ parameters (one γ and one β per feature). For a layer with 512 features, that's 1,024 extra parameters — tiny compared to the weight matrix.

Summary of the Algorithm

The full BatchNorm operation for feature $j$ across mini-batch examples $i = 1 \ldots m$ :

Compute batch mean: $\mu_{B,j} = \frac{1}{m}\sum_i x_{i,j}$
Compute batch variance: $\sigma^2_{B,j} = \frac{1}{m}\sum_i (x_{i,j} - \mu_{B,j})^2$
Normalize: $\hat{x}{i,j} = \frac{x{i,j} - \mu_{B,j}}{\sqrt{\sigma^2_{B,j} + \varepsilon}}$
Scale and shift: $y_{i,j} = \gamma_j \cdot \hat{x}_{i,j} + \beta_j$

The learned parameters $\gamma_j$ and $\beta_j$ are updated by backprop just like any other weight. The normalization in step 3 is a deterministic function of the batch, so gradients flow through it cleanly.

Next lesson: what happens to this algorithm at inference time — because there's no "batch" anymore, and this requires a careful fix.