Skip to content
Math Foundation Probability & Statistics
Lesson 6 ⏱ 10 min

MAP estimation and priors

Video coming soon

MAP Estimation: Adding a Prior Belief to MLE

MLE with a prior. Bayes' theorem connects prior × likelihood to posterior. Gaussian prior → L2 regularization. Laplace prior → L1. Why regularization is Bayesian reasoning.

⏱ ~9 min

🧮

Quick refresher

Bayes' theorem

P(H|E) = P(E|H)·P(H)/P(E). The posterior belief equals the likelihood of evidence times the prior belief, normalized. The posterior is proportional to likelihood × prior.

Example

Prior P(fair coin) = 0.9.

After seeing 3 heads in 3 flips, the likelihood ratio updates your belief toward the biased coin, but the strong prior keeps P(fair|data) relatively high.

MLE's Blind Spot

MLE is powerful, but it has a problem: it ignores everything you know before seeing the data.

Suppose you flip a coin 3 times and see 3 heads. MLE gives p^=1.0\hat{p} = 1.0 — it concludes the coin always lands heads. But you know from experience that a coin landing heads 100% of the time is extremely unlikely. You had prior knowledge. MLE throws it away.

The fixes this by including a prior on the parameters.

MAP via Bayes' Theorem

By Bayes' theorem:

P(θD)=P(Dθ)P(θ)P(D)P(\theta \mid D) = \frac{P(D \mid \theta)\cdot P(\theta)}{P(D)}
P(θD)P(θ|D)
posterior — probability of parameters θ given data D
P(Dθ)P(D|θ)
likelihood — probability of data D given parameters θ
P(θ)P(θ)
prior — initial belief about parameter θ
P(D)P(D)
evidence — normalizing constant, does not affect the argmax

MAP finds the peak of the posterior:

θ^MAP=argmaxθ,P(θD)=argmaxθ,P(Dθ)P(θ)\hat{\theta}{\text{MAP}} = \arg\max{\theta}, P(\theta \mid D) = \arg\max_{\theta}, P(D \mid \theta)\cdot P(\theta)
θ^MAPθ̂_MAP
MAP estimate — the parameter value that maximizes the posterior

Since P(D)P(D) is constant with respect to θ\theta, it drops out. In log form:

θ^MAP=argmaxθ,[(θ)+logP(θ)]\hat{\theta}{\text{MAP}} = \arg\max{\theta},\left[\ell(\theta) + \log P(\theta)\right]
(θ)ℓ(θ)
log-likelihood
logP(θ)log P(θ)
log-prior — log probability of parameters under the prior

MAP = MLE with an extra term that rewards parameter values the prior considers plausible.

Worked Example: Coin Flip with Prior

You flip a coin 5 times and observe h=4h = 4 heads, t=1t = 1 tail. MLE says p^=4/5=0.8\hat{p} = 4/5 = 0.8.

You have a prior belief that coins are probably fair: let's use a simple discrete prior that places 70% probability on p=0.5p = 0.5 and 30% probability on p=0.8p = 0.8.

Compute the posterior for each candidate:

ppPrior P(p)P(p)Likelihood p4(1p)1p^4(1-p)^1Product
0.50.700.540.5=0.0310.5^4 \cdot 0.5 = 0.0310.022
0.80.300.840.2=0.0820.8^4 \cdot 0.2 = 0.0820.025

Normalizing (sum = 0.047): P(p=0.5D)0.47P(p=0.5\mid D) \approx 0.47, P(p=0.8D)0.53P(p=0.8\mid D) \approx 0.53.

MAP estimate: p^MAP=0.8\hat{p}_{\text{MAP}} = 0.8 (barely). But the strong prior toward 0.5 has pulled this from MLE's overconfident 0.8 estimate toward a more uncertain picture. With only 5 flips, the prior matters enormously.

Gaussian Prior → L2 Regularization

Here is the connection that explains why L2 regularization works.

Suppose you assume each weight comes from a Gaussian prior:

θjN!(0,,1λ)\theta_j \sim \mathcal{N}!\left(0,, \frac{1}{\lambda}\right)
λλ
regularization strength — controls how tightly θ is pulled toward 0
N(0,1/λ)N(0, 1/λ)
Gaussian prior: zero mean, variance 1/λ

The log of this prior is:

logP(θ)=λ2jθj2+const\log P(\theta) = -\frac{\lambda}{2}\sum_j \theta_j^2 + \text{const}
logP(θ)log P(θ)
log-prior — the term MAP adds to the log-likelihood

The MAP objective becomes:

θ^MAP=argminθ[NLL(θ)fit data+λ2jθj2L2 regularization]\hat{\theta}{\text{MAP}} = \arg\min{\theta}\left[\underbrace{\text{NLL}(\theta)}{\text{fit data}} + \underbrace{\frac{\lambda}{2}\sum_j \theta_j^2}{\text{L2 regularization}}\right]
NLLNLL
negative log-likelihood — the MLE training loss

L2 regularization (weight decay) is MAP estimation with a Gaussian prior. The hyperparameter λ\lambda controls the prior's tightness — high λ\lambda means a strong belief that weights should be near zero.

Laplace Prior → L1 Regularization

A has the form:

P(θj)=12bexp!(θjb)P(\theta_j) = \frac{1}{2b}\exp!\left(-\frac{|\theta_j|}{b}\right)
bb
scale parameter of the Laplace distribution
λλ
regularization strength (= 1/b)

Its log is logP(θj)=θj/b+const\log P(\theta_j) = -\mid \theta_j\mid /b + \text{const}. So MAP gives:

θ^MAP=argminθ[NLL(θ)+λjθj]\hat{\theta}{\text{MAP}} = \arg\min{\theta}\left[\text{NLL}(\theta) + \lambda\sum_j |\theta_j|\right]
L1L1
L1 regularization — sum of absolute values of weights (lasso)

This is L1 regularization (lasso). The Laplace prior has a sharp peak at zero, which pushes weights to be exactly zero — producing sparse solutions. This is why L1 regularization induces sparsity: the Laplace prior is sharply peaked at the origin.

Summary: MLE vs MAP vs Bayesian Inference

MethodFormulaResult
MLEargmaxP(Dθ)\arg\max P(D\mid\theta)Point estimate, no prior
MAPargmaxP(Dθ)P(θ)\arg\max P(D\mid\theta) \cdot P(\theta)Point estimate with prior
Bayesianθ,P(θD),dθ\int \theta, P(\theta\mid D), d\thetaFull posterior distribution

MAP is the practical middle ground: it incorporates prior knowledge without the computational cost of maintaining a full posterior distribution. In deep learning, MAP is essentially what you're doing any time you add weight decay.

import torch
import torch.nn as nn

# MAP with Gaussian prior = L2 regularization ("weight decay")
model  = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay adds λ·||w||² to the loss each step — this IS a Gaussian prior

# Verify: MAP with Laplace prior = L1 regularization
def map_loss_l1(loss, model, lam=1e-4):
    """L1 regularization = MAP with Laplace prior."""
    l1_penalty = sum(p.abs().sum() for p in model.parameters())
    return loss + lam * l1_penalty

x = torch.randn(32, 10)
y = torch.randn(32, 1)
preds = model(x)
base_loss = nn.MSELoss()(preds, y)
total_loss = map_loss_l1(base_loss, model)
print(f"MSE loss: {base_loss.item():.4f}, MAP loss: {total_loss.item():.4f}")

Quiz

1 / 3

MAP estimation finds: