Hyperparameter tuning — Putting It Together

What Are Hyperparameters?

Model parameters (like weights $\mathbf{W}$ ) are learned from data during training. Hyperparameters are the configuration choices made before training starts that control how learning happens.

The cannot be learned by gradient descent because they are not differentiable components of the loss. You have to set them manually or search over them.

Hyperparameter tuning is often the difference between a model that trains and one that doesn't. The learning rate alone can make training converge in hours or diverge completely — and in large training runs, a bad hyperparameter choice can waste days of compute before you notice.

Common hyperparameters:

Learning rate : controls step size in gradient descent
Batch size: how many examples per gradient update
Number of layers / width: model architecture choices
Regularization strength : for L1/L2 regularization
Dropout rate: fraction of neurons randomly zeroed
Number of epochs / early stopping patience: how long to train

The Learning Rate Is King

No other hyperparameter has as large an impact on training success as the learning rate.

\theta_{t+1} = \theta_t - \alpha \cdot \nabla_{\theta}L(\theta_t)

$\theta_t$: parameters at step t
$\alpha$: learning rate
$\nabla_{\theta}L$: gradient of loss with respect to parameters

Too large $\alpha$ : parameters overshoot the minimum. Loss oscillates or diverges. Training fails.
Too small $\alpha$ : training is extremely slow. Parameters barely move per step.
Just right: loss decreases smoothly and quickly.

Learning rate schedules help: start larger and decay over time. Common schedules:

Finding the learning rate: do a "learning rate range test." Run training for one epoch with lr increasing from $10^{-7}$ to $10^{0}$ . Plot loss vs. lr. Pick the lr where the loss decreases most steeply, before it starts diverging. This is typically one order of magnitude below the divergence point.

Batch Size Trade-offs

The batch size is how many examples you use for each gradient update:

\text{steps per epoch} = \left\lceil \frac{m}{B} \right\rceil

$B$: batch size
$m$: total training set size

Larger batches:

More accurate gradient estimates (less noise)
Higher throughput (better GPU utilization)
Often require larger learning rates to compensate for reduced noise
Can generalize slightly worse (sharp minima, not flat minima)

Smaller batches:

Noisier gradients (but sometimes noise helps escape local minima)
More parameter updates per epoch
Lower memory requirement

Typical starting points: batch size 32-256 for images, 16-64 for text. Always make batch size a power of 2 (hardware efficiency).

Important: if you change batch size by $k\times$ , scale the learning rate by $\sqrt{k}$ (or linearly by $k$ for large batches). This is the linear scaling rule.

Search Strategies

Grid Search

Try all combinations of a discrete set of values:

Random Search

Sample each hyperparameter independently from its distribution. Run $n$ random trials and take the best:

Continuous hyperparameters (learning rate, regularization): sample log-uniformly on appropriate scale
Integer hyperparameters (layers, width): sample uniformly over a range
Categorical (activation function, optimizer): sample uniformly

Random search finds good configurations much faster than grid search when the number of hyperparameters is more than 2-3. The intuition: important hyperparameters get explored at many different values regardless of what the other (irrelevant) hyperparameters are doing.

Bayesian Optimization

Builds a probabilistic model of the hyperparameter-to-performance landscape, uses it to decide which configuration to try next. Balances exploration (trying uncertain regions) and exploitation (trying regions near known good configurations).

Tools: Optuna, Weights & Biases sweeps, Ray Tune, HyperOpt.

Bayesian optimization finds good configurations in 20-50 trials that random search might need 100-200 to find. Worth the setup for expensive experiments.

Interactive example

Hyperparameter search simulator - compare grid, random, and Bayesian search strategies

Coming soon

Practical Tuning Priority

Given a limited compute budget, tune in this order:

Learning rate - tune this first, always. Try 3-5 values spanning two orders of magnitude.
Batch size - has both performance and compute implications.
Regularization (weight decay, dropout) - tune after the architecture is mostly fixed.
Architecture (layers, width) - changing architecture invalidates all previous tuning, so do it last.
Other (optimizer choice, activation function, normalization) - usually matters less than the above.

The key discipline: one change at a time. If you change three things simultaneously and performance improves, you do not know which change caused it.