MLE's Blind Spot
MLE is powerful, but it has a problem: it ignores everything you know before seeing the data.
Suppose you flip a coin 3 times and see 3 heads. MLE gives — it concludes the coin always lands heads. But you know from experience that a coin landing heads 100% of the time is extremely unlikely. You had prior knowledge. MLE throws it away.
The fixes this by including a prior on the parameters.
MAP via Bayes' Theorem
By Bayes' theorem:
- posterior — probability of parameters θ given data D
- likelihood — probability of data D given parameters θ
- prior — initial belief about parameter θ
- evidence — normalizing constant, does not affect the argmax
MAP finds the peak of the posterior:
- MAP estimate — the parameter value that maximizes the posterior
Since is constant with respect to , it drops out. In log form:
- log-likelihood
- log-prior — log probability of parameters under the prior
MAP = MLE with an extra term that rewards parameter values the prior considers plausible.
Worked Example: Coin Flip with Prior
You flip a coin 5 times and observe heads, tail. MLE says .
You have a prior belief that coins are probably fair: let's use a simple discrete prior that places 70% probability on and 30% probability on .
Compute the posterior for each candidate:
| Prior | Likelihood | Product | |
|---|---|---|---|
| 0.5 | 0.70 | 0.022 | |
| 0.8 | 0.30 | 0.025 |
Normalizing (sum = 0.047): , .
MAP estimate: (barely). But the strong prior toward 0.5 has pulled this from MLE's overconfident 0.8 estimate toward a more uncertain picture. With only 5 flips, the prior matters enormously.
Gaussian Prior → L2 Regularization
Here is the connection that explains why L2 regularization works.
Suppose you assume each weight comes from a Gaussian prior:
- regularization strength — controls how tightly θ is pulled toward 0
- Gaussian prior: zero mean, variance 1/λ
The log of this prior is:
- log-prior — the term MAP adds to the log-likelihood
The MAP objective becomes:
- negative log-likelihood — the MLE training loss
L2 regularization (weight decay) is MAP estimation with a Gaussian prior. The hyperparameter controls the prior's tightness — high means a strong belief that weights should be near zero.
Laplace Prior → L1 Regularization
A has the form:
- scale parameter of the Laplace distribution
- regularization strength (= 1/b)
Its log is . So MAP gives:
- L1 regularization — sum of absolute values of weights (lasso)
This is L1 regularization (lasso). The Laplace prior has a sharp peak at zero, which pushes weights to be exactly zero — producing sparse solutions. This is why L1 regularization induces sparsity: the Laplace prior is sharply peaked at the origin.
Summary: MLE vs MAP vs Bayesian Inference
| Method | Formula | Result |
|---|---|---|
| MLE | Point estimate, no prior | |
| MAP | Point estimate with prior | |
| Bayesian | Full posterior distribution |
MAP is the practical middle ground: it incorporates prior knowledge without the computational cost of maintaining a full posterior distribution. In deep learning, MAP is essentially what you're doing any time you add weight decay.
import torch
import torch.nn as nn
# MAP with Gaussian prior = L2 regularization ("weight decay")
model = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay adds λ·||w||² to the loss each step — this IS a Gaussian prior
# Verify: MAP with Laplace prior = L1 regularization
def map_loss_l1(loss, model, lam=1e-4):
"""L1 regularization = MAP with Laplace prior."""
l1_penalty = sum(p.abs().sum() for p in model.parameters())
return loss + lam * l1_penalty
x = torch.randn(32, 10)
y = torch.randn(32, 1)
preds = model(x)
base_loss = nn.MSELoss()(preds, y)
total_loss = map_loss_l1(base_loss, model)
print(f"MSE loss: {base_loss.item():.4f}, MAP loss: {total_loss.item():.4f}")