MAP estimation and priors — Probability & Statistics

MLE is powerful, but it has a problem: it ignores everything you know before seeing the data.

Suppose you flip a coin 3 times and see 3 heads. MLE gives $\hat{p} = 1.0$ — it concludes the coin always lands heads. But you know from experience that a coin landing heads 100% of the time is extremely unlikely. You had prior knowledge. MLE throws it away.

The fixes this by including a prior on the parameters.

What a prior actually means — in plain language

A prior is simply your best guess before seeing any data. It is not mystical or subjective in a problematic way — it is just honest about what you already know.

Examples of sensible priors:

A doctor estimating disease prevalence: "About 1% of this population has this condition" → prior is $p \approx 0.01$
A neural network designer: "Weights in a well-trained model are usually small" → prior is a Gaussian centered at 0
A coin flip: "Most coins are fair" → prior is a distribution concentrated near $p = 0.5$

When data is scarce, the prior matters a lot — it prevents wild estimates. As data grows, the data overwhelms the prior and MAP converges to MLE. A strong prior and a small dataset: the prior wins. A weak prior and a large dataset: the data wins.

MAP via Bayes' Theorem

By Bayes' theorem:

P(\theta \mid D) = \frac{P(D \mid \theta)\cdot P(\theta)}{P(D)}

$P(θ|D)$: posterior — probability of parameters θ given data D
$P(D|θ)$: likelihood — probability of data D given parameters θ
$P(θ)$: prior — initial belief about parameter θ
$P(D)$: evidence — normalizing constant, does not affect the argmax

MAP finds the peak of the posterior:

\hat{\theta}{\text{MAP}} = \arg\max{\theta}, P(\theta \mid D) = \arg\max_{\theta}, P(D \mid \theta)\cdot P(\theta)

$θ̂_MAP$: MAP estimate — the parameter value that maximizes the posterior

Since $P(D)$ is constant with respect to $\theta$ , it drops out. In log form:

\hat{\theta}{\text{MAP}} = \arg\max{\theta},\left[\ell(\theta) + \log P(\theta)\right]

$ℓ(θ)$: log-likelihood
$log P(θ)$: log-prior — log probability of parameters under the prior

MAP = MLE with an extra term that rewards parameter values the prior considers plausible.

Worked Example: Coin Flip with Prior

You flip a coin 5 times and observe $h = 4$ heads, $t = 1$ tail. MLE says $\hat{p} = 4/5 = 0.8$ .

You have a prior belief that coins are probably fair: let's use a simple discrete prior that places 70% probability on $p = 0.5$ and 30% probability on $p = 0.8$ .

Compute the posterior for each candidate:

$p$	Prior $P(p)$	Likelihood $p^4(1-p)^1$	Product
0.5	0.70	$0.5^4 \cdot 0.5 = 0.031$	0.022
0.8	0.30	$0.8^4 \cdot 0.2 = 0.082$	0.025

Normalizing (sum = 0.047): $P(p=0.5\mid D) \approx 0.47$ , $P(p=0.8\mid D) \approx 0.53$ .

MAP estimate: $\hat{p}_{\text{MAP}} = 0.8$ (barely). But the strong prior toward 0.5 has pulled this from MLE's overconfident 0.8 estimate toward a more uncertain picture. With only 5 flips, the prior matters enormously.

Gaussian Prior → L2 Regularization

Here is the connection that explains why L2 regularization works.

Suppose you assume each weight comes from a Gaussian prior:

\theta_j \sim \mathcal{N}!\left(0,, \frac{1}{\lambda}\right)

$λ$: regularization strength — controls how tightly θ is pulled toward 0
$N(0, 1/λ)$: Gaussian prior: zero mean, variance 1/λ

The log of this prior is:

\log P(\theta) = -\frac{\lambda}{2}\sum_j \theta_j^2 + \text{const}

$log P(θ)$: log-prior — the term MAP adds to the log-likelihood

The MAP objective becomes:

\hat{\theta}{\text{MAP}} = \arg\min{\theta}\left[\underbrace{\text{NLL}(\theta)}{\text{fit data}} + \underbrace{\frac{\lambda}{2}\sum_j \theta_j^2}{\text{L2 regularization}}\right]

$NLL$: negative log-likelihood — the MLE training loss

L2 regularization (weight decay) is MAP estimation with a Gaussian prior. The hyperparameter $\lambda$ controls the prior's tightness — high $\lambda$ means a strong belief that weights should be near zero.

Laplace Prior → L1 Regularization

A has the form:

P(\theta_j) = \frac{1}{2b}\exp!\left(-\frac{|\theta_j|}{b}\right)

$b$: scale parameter of the Laplace distribution
$λ$: regularization strength (= 1/b)

Its log is $\log P(\theta_j) = -\mid \theta_j\mid /b + \text{const}$ . So MAP gives:

\hat{\theta}{\text{MAP}} = \arg\min{\theta}\left[\text{NLL}(\theta) + \lambda\sum_j |\theta_j|\right]

$L1$: L1 regularization — sum of absolute values of weights (lasso)

This is L1 regularization (lasso). The Laplace prior has a sharp peak at zero, which pushes weights to be exactly zero — producing sparse solutions. This is why L1 regularization induces sparsity: the Laplace prior is sharply peaked at the origin.

Summary: MLE vs MAP vs Bayesian Inference

Method	Formula	Result
MLE	$\arg\max P(D\mid\theta)$	Point estimate, no prior
MAP	$\arg\max P(D\mid\theta) \cdot P(\theta)$	Point estimate with prior
Bayesian	$\int \theta, P(\theta\mid D), d\theta$	Full posterior distribution

MAP is the practical middle ground: it incorporates prior knowledge without the computational cost of maintaining a full posterior distribution. In deep learning, MAP is essentially what you're doing any time you add weight decay.

import torch
import torch.nn as nn

# MAP with Gaussian prior = L2 regularization ("weight decay")
model  = nn.Linear(10, 1)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
# weight_decay adds λ·||w||² to the loss each step — this IS a Gaussian prior

# Verify: MAP with Laplace prior = L1 regularization
def map_loss_l1(loss, model, lam=1e-4):
    """L1 regularization = MAP with Laplace prior."""
    l1_penalty = sum(p.abs().sum() for p in model.parameters())
    return loss + lam * l1_penalty

x = torch.randn(32, 10)
y = torch.randn(32, 1)
preds = model(x)
base_loss = nn.MSELoss()(preds, y)
total_loss = map_loss_l1(base_loss, model)
print(f"MSE loss: {base_loss.item():.4f}, MAP loss: {total_loss.item():.4f}")