The Scaling Problem
Pure gradient descent computes the gradient using all training examples at every step:
- full-batch gradient - exact average over all n examples
- gradient for example i alone
- total number of training examples
For examples, each gradient step requires: forward pass for all 1M examples, loss for all 1M examples, backward pass for all 1M examples, then average. One parameter update requires processing a million examples. With a large model, this takes minutes — and you need thousands of updates to converge.
If you had to find the lowest valley in a mountain range, would you survey every inch of terrain before taking a single step — or would you sample the ground under your feet, take a step in the right direction, and repeat? Full-batch gradient descent tries to survey everything first. That approach becomes physically impossible once your dataset has millions of examples. Stochastic gradient descent makes the obvious trade-off: use a small sample to estimate the direction, move fast, and repeat.
This is called full-batch gradient descent, and for large datasets it's completely impractical.
The Other Extreme: Pure SGD
uses just one randomly chosen training example per update:
- randomly chosen example index - changes every step
- gradient computed on example i alone - a noisy estimate of the true gradient
This is extremely fast — one forward/backward pass per update. But the gradient estimate is highly noisy: one example might be unusual or unrepresentative. Updates zigzag toward the minimum rather than walking a smooth path.
Mini-Batch: The Sweet Spot
The standard approach in modern ML is mini-batch gradient descent:
- batch size - number of examples sampled each step. Typically 32, 64, or 128
- \nabla L}
- mini-batch gradient estimate - average over the B selected examples
Why batch sizes of 32–128?
- Statistical quality: averaging 32–128 examples gives a gradient pointing in roughly the right direction
- Speed: far faster than computing over all examples per step
- Hardware utilization: GPUs process a batch of 64 almost as fast as a single example, because the hardware parallelizes across the batch — matrix operations on batches are extremely efficient
- Memory: only examples need to fit in GPU memory at once, not the full dataset
Epoch vs. Iteration
Two terms you'll see constantly:
- : one gradient update, processing one mini-batch of examples.
- : one complete pass through the entire training dataset.
Example: 10,000 training examples, batch size 100:
- Iterations per epoch:
- Training for 50 epochs: total gradient updates
- Data is shuffled each epoch before batching, so each example appears in a different batch each time
Batch Size as a Hyperparameter
Batch size affects not just speed but generalization:
- Large batches (256–1024): lower-noise gradient estimates, stable training, better GPU utilization — but weaker regularization effect and sometimes worse final generalization
- Small batches (8–32): noisier gradients act as implicit regularization, sometimes better generalization — but less efficient GPU utilization and noisier loss curves
The noise in small-batch training prevents the model from overfitting specific data subsets — it acts like a regularizer. Start with batch size 32 or 64; tune if needed.
Interactive example
Compare full-batch vs mini-batch convergence paths on a 2D loss surface - see the noisy zig-zag vs the smooth descent
Coming soon