The bias-variance tradeoff — Regularization

When a model makes mistakes, there are really only three sources of error. Two of them you can control. One you cannot. Understanding which is which changes how you debug and improve your models.

Bias: Systematic Wrongness

Imagine trying to fit a curved line through curved data using only a straight line. No matter how much data you collect or how long you train, the line will always be wrong in the same predictable way. That is .

Bias is the error that comes from wrong assumptions baked into your model's structure. A linear model on data shaped like a parabola has high bias. The model is not flexible enough to represent the truth.

Low-capacity models have high bias. They pre-decide too much about the shape of the relationship before even looking at data.

Variance: Sensitivity to Your Training Set

Now imagine a degree-15 polynomial fit to only 5 data points. It passes perfectly through all 5 - training error is zero. But change two points slightly and you get a completely different wiggly curve. The model memorized your specific training data rather than the underlying pattern.

That is . A high-variance model would change dramatically if trained on a different sample from the same distribution.

High-capacity models have high variance. They are flexible enough to memorize noise, not just signal.

The Archery Analogy

Picture two archers shooting at a bullseye:

High-bias archer: every arrow lands 2 feet to the left. Consistently wrong in the same direction - tight grouping, off target.
High-variance archer: arrows scatter all over - sometimes left, sometimes right. Unpredictable.

Both are bad. What you want is tight grouping in the center: low bias AND low variance. The problem is that fixing one tends to worsen the other.

Interactive example

Bias vs. variance target diagram - drag complexity slider to see arrow scatter change

Coming soon

The Decomposition

The total expected prediction error decomposes into three parts:

\text{Total Error} = \text{Bias}^2 + \text{Variance} + \sigma^2

$\text{Bias}^2$: error from wrong model assumptions
$\text{Var}$: sensitivity to training sample
$\sigma^2$: irreducible noise in the data

Bias²: how far off your average prediction is from the truth.
Variance: how spread out predictions are around that average across different training sets.
σ²: the randomness inherent in the data. Even a perfect model cannot predict the exact flip of a coin.

You cannot eliminate irreducible noise. It is your floor.

Computing bias and variance

To measure bias and variance empirically: train the same model architecture on K different random samples of training data. Bias is how far the average prediction deviates from truth. Variance is how spread out the K predictions are from each other. In practice you estimate this from train/val gap, not by training K models.

Concrete picture: imagine 5 versions of the same model all predicting house price for the same home. They predict $[490,\thinspace 505,\thinspace 495,\thinspace 510,\thinspace 500]$ (in thousands). Their average prediction is $500$ . If the true price is $498$ , the bias is small. If those five predictions had instead been $[420,\thinspace 580,\thinspace 460,\thinspace 530,\thinspace 510]$ , the spread would be large — that is high variance.

The Complexity Spectrum

Think about fitting polynomial curves to data:

Complexity	Bias	Variance	Behavior
Degree 1 (line)	High	Low	Misses curves, looks similar across datasets
Degree 3-5	Balanced	Balanced	Usually the sweet spot
Degree 15	Low	High	Memorizes training points, wildly different across datasets

This pattern generalizes to all model families. Shallow decision trees have high bias; deep unconstrained trees have high variance. Small networks underfit; large networks without regularization overfit.

Interactive example

Polynomial degree slider - watch training error vs. test error as complexity grows

Coming soon

Where Regularization Comes In

Regularization techniques deliberately add a small amount of bias in exchange for reducing variance significantly. You constrain the model, pushing it toward simpler solutions so it does not memorize noise.

Think of it as a penalty: "Yes, you could memorize every training example with a wildly complex function, but complexity costs you." The model settles for a slightly worse training fit in exchange for better generalization.

L_{\text{reg}} = L_{\text{data}} + \lambda \cdot \Omega(w)

$\lambda$: regularization strength - controls how much complexity is penalized
$\Omega(w)$: complexity penalty on model weights

Adding bias is usually a good trade when your model is overfitting. The bias you add is small and controlled; the variance you reduce can be enormous.

What Is More Dangerous in Practice

In practice, overfitting (high variance) is usually more dangerous than underfitting (high bias) for complex models. High-bias failures are obvious - the model just does not work well. High-variance failures are subtle - the model looks amazing on training data and only fails in deployment.

Degree-1 vs degree-10 polynomial: seeing bias-variance

import numpy as np
from numpy.polynomial import polynomial as P

np.random.seed(42)
# True function: y = 0.5*x^2 - x + 1, with noise
x = np.linspace(-3, 3, 30)
y = 0.5*x**2 - x + 1 + np.random.randn(30)*0.5

# Train/test split (first 20 = train, last 10 = test)
x_train, y_train = x[:20], y[:20]
x_test,  y_test  = x[20:], y[20:]

for degree in [1, 3, 10]:
    coef = P.polyfit(x_train, y_train, degree)
    train_pred = P.polyval(x_train, coef)
    test_pred  = P.polyval(x_test,  coef)
    train_mse = np.mean((train_pred - y_train)**2)
    test_mse  = np.mean((test_pred  - y_test)**2)
    print(f"degree={degree:2d}  train_mse={train_mse:.3f}  test_mse={test_mse:.3f}")
# degree= 1  train_mse=1.553  test_mse=2.105   ← high bias
# degree= 3  train_mse=0.219  test_mse=0.261   ← balanced
# degree=10  train_mse=0.083  test_mse=4.712   ← high variance (overfit)

The degree-10 polynomial has lower training error but much higher test error — it memorized the training noise. This is overfitting in code.

Regularization exists specifically to make high-capacity models trustworthy, not just impressive on training benchmarks.