What Are Hyperparameters?
Model parameters (like weights ) are learned from data during training. Hyperparameters are the configuration choices made before training starts that control how learning happens.
The cannot be learned by gradient descent because they are not differentiable components of the loss. You have to set them manually or search over them.
Hyperparameter tuning is often the difference between a model that trains and one that doesn't. The learning rate alone can make training converge in hours or diverge completely — and in large training runs, a bad hyperparameter choice can waste days of compute before you notice.
Common hyperparameters:
- Learning rate : controls step size in gradient descent
- Batch size: how many examples per gradient update
- Number of layers / width: model architecture choices
- Regularization strength : for L1/L2 regularization
- Dropout rate: fraction of neurons randomly zeroed
- Number of epochs / early stopping patience: how long to train
The Learning Rate Is King
No other hyperparameter has as large an impact on training success as the learning rate.
- parameters at step t
- learning rate
- gradient of loss with respect to parameters
- Too large : parameters overshoot the minimum. Loss oscillates or diverges. Training fails.
- Too small : training is extremely slow. Parameters barely move per step.
- Just right: loss decreases smoothly and quickly.
Learning rate schedules help: start larger and decay over time. Common schedules:
Finding the learning rate: do a "learning rate range test." Run training for one epoch with lr increasing from to . Plot loss vs. lr. Pick the lr where the loss decreases most steeply, before it starts diverging. This is typically one order of magnitude below the divergence point.
Batch Size Trade-offs
The batch size is how many examples you use for each gradient update:
- batch size
- total training set size
Larger batches:
- More accurate gradient estimates (less noise)
- Higher throughput (better GPU utilization)
- Often require larger learning rates to compensate for reduced noise
- Can generalize slightly worse (sharp minima, not flat minima)
Smaller batches:
- Noisier gradients (but sometimes noise helps escape local minima)
- More parameter updates per epoch
- Lower memory requirement
Typical starting points: batch size 32-256 for images, 16-64 for text. Always make batch size a power of 2 (hardware efficiency).
Important: if you change batch size by , scale the learning rate by (or linearly by for large batches). This is the linear scaling rule.
Search Strategies
Grid Search
Try all combinations of a discrete set of values:
Random Search
Sample each hyperparameter independently from its distribution. Run random trials and take the best:
- Continuous hyperparameters (learning rate, regularization): sample log-uniformly on appropriate scale
- Integer hyperparameters (layers, width): sample uniformly over a range
- Categorical (activation function, optimizer): sample uniformly
Random search finds good configurations much faster than grid search when the number of hyperparameters is more than 2-3. The intuition: important hyperparameters get explored at many different values regardless of what the other (irrelevant) hyperparameters are doing.
Bayesian Optimization
Builds a probabilistic model of the hyperparameter-to-performance landscape, uses it to decide which configuration to try next. Balances exploration (trying uncertain regions) and exploitation (trying regions near known good configurations).
Tools: Optuna, Weights & Biases sweeps, Ray Tune, HyperOpt.
Bayesian optimization finds good configurations in 20-50 trials that random search might need 100-200 to find. Worth the setup for expensive experiments.
Interactive example
Hyperparameter search simulator - compare grid, random, and Bayesian search strategies
Coming soon
Practical Tuning Priority
Given a limited compute budget, tune in this order:
- Learning rate - tune this first, always. Try 3-5 values spanning two orders of magnitude.
- Batch size - has both performance and compute implications.
- Regularization (weight decay, dropout) - tune after the architecture is mostly fixed.
- Architecture (layers, width) - changing architecture invalidates all previous tuning, so do it last.
- Other (optimizer choice, activation function, normalization) - usually matters less than the above.
The key discipline: one change at a time. If you change three things simultaneously and performance improves, you do not know which change caused it.