Skip to content
Gradient Descent
Lesson 5 ⏱ 12 min

Stochastic & mini-batch GD

Video coming soon

SGD and Mini-Batches - Fast, Noisy, and Surprisingly Good

Why full-batch gradient descent doesn't scale, how using a random subset of data gives noisy but usable gradients, and why noise can actually help training.

⏱ ~7 min

🧮

Quick refresher

The gradient update w ← w - α·∇L

The gradient ∇L is typically computed as an average over all training examples. Stochastic and mini-batch methods approximate this average using a subset of examples.

Example

Full dataset gradient (n=10,000 examples): exact but slow.

Single example gradient: fast but noisy.

Mini-batch of 64: the practical middle ground.

The Scaling Problem

Pure gradient descent computes the gradient using all training examples at every step:

L=1ni=1nLi\nabla L = \frac{1}{n} \sum_{i=1}^{n} \nabla L_i
L\nabla L
full-batch gradient - exact average over all n examples
Li\nabla L_i
gradient for example i alone
nn
total number of training examples

For n=1,000,000n = 1{,}000{,}000 examples, each gradient step requires: forward pass for all 1M examples, loss for all 1M examples, backward pass for all 1M examples, then average. One parameter update requires processing a million examples. With a large model, this takes minutes — and you need thousands of updates to converge.

If you had to find the lowest valley in a mountain range, would you survey every inch of terrain before taking a single step — or would you sample the ground under your feet, take a step in the right direction, and repeat? Full-batch gradient descent tries to survey everything first. That approach becomes physically impossible once your dataset has millions of examples. Stochastic gradient descent makes the obvious trade-off: use a small sample to estimate the direction, move fast, and repeat.

This is called full-batch gradient descent, and for large datasets it's completely impractical.

The Other Extreme: Pure SGD

uses just one randomly chosen training example per update:

LLi(one random example)\nabla L \approx \nabla L_i \quad \text{(one random example)}
ii
randomly chosen example index - changes every step
Li\nabla L_i
gradient computed on example i alone - a noisy estimate of the true gradient

This is extremely fast — one forward/backward pass per update. But the gradient estimate is highly noisy: one example might be unusual or unrepresentative. Updates zigzag toward the minimum rather than walking a smooth path.

Mini-Batch: The Sweet Spot

The standard approach in modern ML is mini-batch gradient descent:

LLbatch=1BibatchLi\nabla L \approx \nabla L_{\text{batch}} = \frac{1}{B} \sum_{i \in \text{batch}} \nabla L_i
BB
batch size - number of examples sampled each step. Typically 32, 64, or 128
\nabla L}
mini-batch gradient estimate - average over the B selected examples

Why batch sizes of 32–128?

  • Statistical quality: averaging 32–128 examples gives a gradient pointing in roughly the right direction
  • Speed: far faster than computing over all nn examples per step
  • Hardware utilization: GPUs process a batch of 64 almost as fast as a single example, because the hardware parallelizes across the batch — matrix operations on batches are extremely efficient
  • Memory: only BB examples need to fit in GPU memory at once, not the full dataset

Epoch vs. Iteration

Two terms you'll see constantly:

  • : one gradient update, processing one mini-batch of BB examples.
  • : one complete pass through the entire training dataset.

Example: 10,000 training examples, batch size 100:

  • Iterations per epoch: 10,000/100=10010{,}000 / 100 = 100
  • Training for 50 epochs: 5,0005{,}000 total gradient updates
  • Data is shuffled each epoch before batching, so each example appears in a different batch each time

Batch Size as a Hyperparameter

Batch size affects not just speed but generalization:

  • Large batches (256–1024): lower-noise gradient estimates, stable training, better GPU utilization — but weaker regularization effect and sometimes worse final generalization
  • Small batches (8–32): noisier gradients act as implicit regularization, sometimes better generalization — but less efficient GPU utilization and noisier loss curves

The noise in small-batch training prevents the model from overfitting specific data subsets — it acts like a regularizer. Start with batch size 32 or 64; tune if needed.

Interactive example

Compare full-batch vs mini-batch convergence paths on a 2D loss surface - see the noisy zig-zag vs the smooth descent

Coming soon

Quiz

1 / 3

Mini-batch gradient descent computes the gradient using...