Stochastic & mini-batch GD — Gradient Descent

The Scaling Problem

Pure gradient descent computes the gradient using all training examples at every step:

\nabla L = \frac{1}{n} \sum_{i=1}^{n} \nabla L_i

$\nabla L$: full-batch gradient - exact average over all n examples
$\nabla L_i$: gradient for example i alone
$n$: total number of training examples

For $n = 1{,}000{,}000$ examples, each gradient step requires: forward pass for all 1M examples, loss for all 1M examples, backward pass for all 1M examples, then average. One parameter update requires processing a million examples. With a large model, this takes minutes — and you need thousands of updates to converge.

If you had to find the lowest valley in a mountain range, would you survey every inch of terrain before taking a single step — or would you sample the ground under your feet, take a step in the right direction, and repeat? Full-batch gradient descent tries to survey everything first. That approach becomes physically impossible once your dataset has millions of examples. Stochastic gradient descent makes the obvious trade-off: use a small sample to estimate the direction, move fast, and repeat.

This is called full-batch gradient descent, and for large datasets it's completely impractical.

The Other Extreme: Pure SGD

uses just one randomly chosen training example per update:

\nabla L \approx \nabla L_i \quad \text{(one random example)}

$i$: randomly chosen example index - changes every step
$\nabla L_i$: gradient computed on example i alone - a noisy estimate of the true gradient

This is extremely fast — one forward/backward pass per update. But the gradient estimate is highly noisy: one example might be unusual or unrepresentative. Updates zigzag toward the minimum rather than walking a smooth path.

Mini-Batch: The Sweet Spot

The standard approach in modern ML is mini-batch gradient descent:

\nabla L \approx \nabla L_{\text{batch}} = \frac{1}{B} \sum_{i \in \text{batch}} \nabla L_i

$B$: batch size - number of examples sampled each step. Typically 32, 64, or 128
$\nabla L}$: mini-batch gradient estimate - average over the B selected examples

Why batch sizes of 32–128?

Statistical quality: averaging 32–128 examples gives a gradient pointing in roughly the right direction
Speed: far faster than computing over all $n$ examples per step
Hardware utilization: GPUs process a batch of 64 almost as fast as a single example, because the hardware parallelizes across the batch — matrix operations on batches are extremely efficient
Memory: only $B$ examples need to fit in GPU memory at once, not the full dataset

Epoch vs. Iteration

Two terms you'll see constantly:

: one gradient update, processing one mini-batch of $B$ examples.
: one complete pass through the entire training dataset.

Example: 10,000 training examples, batch size 100:

Iterations per epoch: $10{,}000 / 100 = 100$
Training for 50 epochs: $5{,}000$ total gradient updates
Data is shuffled each epoch before batching, so each example appears in a different batch each time

Batch Size as a Hyperparameter

Batch size affects not just speed but generalization:

Large batches (256–1024): lower-noise gradient estimates, stable training, better GPU utilization — but weaker regularization effect and sometimes worse final generalization
Small batches (8–32): noisier gradients act as implicit regularization, sometimes better generalization — but less efficient GPU utilization and noisier loss curves

The noise in small-batch training prevents the model from overfitting specific data subsets — it acts like a regularizer. Start with batch size 32 or 64; tune if needed.

Interactive example

Compare full-batch vs mini-batch convergence paths on a 2D loss surface - see the noisy zig-zag vs the smooth descent

Coming soon