Skip to content
Putting It Together
Lesson 3 ⏱ 12 min

Hyperparameter tuning

Video coming soon

Hyperparameter Tuning: What to Tune and How

Learning rate, batch size, architecture choices. Grid search, random search, and Bayesian optimization. Practical priorities for a limited compute budget.

⏱ ~6 min

🧮

Quick refresher

Training from scratch vs. transfer learning

Pretrained models transfer learned representations to new tasks. Fine-tuning uses small learning rates to update pretrained weights without destroying them. Early stopping prevents overfitting by saving the best checkpoint.

Example

ImageNet pretrained ResNet fine-tuned on 2,000 medical images - freeze backbone first, then unfreeze with lr=1e-5, monitor validation AUC for early stopping.

What Are Hyperparameters?

Model parameters (like weights W\mathbf{W}) are learned from data during training. Hyperparameters are the configuration choices made before training starts that control how learning happens.

The cannot be learned by gradient descent because they are not differentiable components of the loss. You have to set them manually or search over them.

Hyperparameter tuning is often the difference between a model that trains and one that doesn't. The learning rate alone can make training converge in hours or diverge completely — and in large training runs, a bad hyperparameter choice can waste days of compute before you notice.

Common hyperparameters:

  • Learning rate : controls step size in gradient descent
  • Batch size: how many examples per gradient update
  • Number of layers / width: model architecture choices
  • Regularization strength : for L1/L2 regularization
  • Dropout rate: fraction of neurons randomly zeroed
  • Number of epochs / early stopping patience: how long to train

The Learning Rate Is King

No other hyperparameter has as large an impact on training success as the learning rate.

θt+1=θtαθL(θt)\theta_{t+1} = \theta_t - \alpha \cdot \nabla_{\theta}L(\theta_t)
θt\theta_t
parameters at step t
α\alpha
learning rate
θL\nabla_{\theta}L
gradient of loss with respect to parameters
  • Too large α\alpha: parameters overshoot the minimum. Loss oscillates or diverges. Training fails.
  • Too small α\alpha: training is extremely slow. Parameters barely move per step.
  • Just right: loss decreases smoothly and quickly.

Learning rate schedules help: start larger and decay over time. Common schedules:

Finding the learning rate: do a "learning rate range test." Run training for one epoch with lr increasing from 10710^{-7} to 10010^{0}. Plot loss vs. lr. Pick the lr where the loss decreases most steeply, before it starts diverging. This is typically one order of magnitude below the divergence point.

Batch Size Trade-offs

The batch size is how many examples you use for each gradient update:

steps per epoch=mB\text{steps per epoch} = \left\lceil \frac{m}{B} \right\rceil
BB
batch size
mm
total training set size

Larger batches:

  • More accurate gradient estimates (less noise)
  • Higher throughput (better GPU utilization)
  • Often require larger learning rates to compensate for reduced noise
  • Can generalize slightly worse (sharp minima, not flat minima)

Smaller batches:

  • Noisier gradients (but sometimes noise helps escape local minima)
  • More parameter updates per epoch
  • Lower memory requirement

Typical starting points: batch size 32-256 for images, 16-64 for text. Always make batch size a power of 2 (hardware efficiency).

Important: if you change batch size by k×k\times, scale the learning rate by k\sqrt{k} (or linearly by kk for large batches). This is the linear scaling rule.

Search Strategies

Try all combinations of a discrete set of values:

Sample each hyperparameter independently from its distribution. Run nn random trials and take the best:

  • Continuous hyperparameters (learning rate, regularization): sample log-uniformly on appropriate scale
  • Integer hyperparameters (layers, width): sample uniformly over a range
  • Categorical (activation function, optimizer): sample uniformly

Random search finds good configurations much faster than grid search when the number of hyperparameters is more than 2-3. The intuition: important hyperparameters get explored at many different values regardless of what the other (irrelevant) hyperparameters are doing.

Bayesian Optimization

Builds a probabilistic model of the hyperparameter-to-performance landscape, uses it to decide which configuration to try next. Balances exploration (trying uncertain regions) and exploitation (trying regions near known good configurations).

Tools: Optuna, Weights & Biases sweeps, Ray Tune, HyperOpt.

Bayesian optimization finds good configurations in 20-50 trials that random search might need 100-200 to find. Worth the setup for expensive experiments.

Interactive example

Hyperparameter search simulator - compare grid, random, and Bayesian search strategies

Coming soon

Practical Tuning Priority

Given a limited compute budget, tune in this order:

  1. Learning rate - tune this first, always. Try 3-5 values spanning two orders of magnitude.
  2. Batch size - has both performance and compute implications.
  3. Regularization (weight decay, dropout) - tune after the architecture is mostly fixed.
  4. Architecture (layers, width) - changing architecture invalidates all previous tuning, so do it last.
  5. Other (optimizer choice, activation function, normalization) - usually matters less than the above.

The key discipline: one change at a time. If you change three things simultaneously and performance improves, you do not know which change caused it.

Quiz

1 / 3

Which hyperparameter typically has the highest impact on training success?