Skip to content
Math Foundation Probability & Statistics
Lesson 4 ⏱ 12 min

Bayes' theorem

Video coming soon

Bayes' Theorem: Updating Beliefs with Evidence

Prior, likelihood, posterior. The medical test example. Why MAP estimation is L2 regularization. How training a model is Bayesian inference.

⏱ ~7 min

🧮

Quick refresher

Conditional probability

P(A|B) = P(A∩B)/P(B) - the probability of A restricted to the world where B occurred. P(A|B) ≠ P(B|A) in general.

Example

P(rolling 4 | rolling even) = P(4 and even)/P(even) = (1/6)/(3/6) = 1/3.

The Question Bayes Answers

You have a belief about something. New evidence arrives. How should you update your belief?

This is what answers. It turns out humans are notoriously bad at answering it intuitively - we systematically overweight new evidence and underweight prior knowledge. Bayes gives the mathematically correct update.

More importantly for ML: this theorem is a lens for understanding what machine learning is doing. Training a model IS updating beliefs (about parameters) based on evidence (training data). Regularization has a Bayesian interpretation. Cross-entropy loss derives from Bayes. Understanding this theorem unifies a huge swath of ML theory.

The Theorem

P(HE)=P(EH)P(H)P(E)P(H \mid E) = \frac{P(E \mid H)\cdot P(H)}{P(E)}
P(HE)P(H|E)
posterior - updated belief in H after seeing evidence E
P(EH)P(E|H)
likelihood - probability of evidence E if H is true
P(H)P(H)
prior - belief in H before seeing evidence
P(E)P(E)
evidence - total probability of observing E

Each piece has a name that carries the whole intuition:

  • Symbol: : the prior - what you believed about HH before seeing EE
  • Symbol: : the likelihood - if HH is true, how probable is evidence EE?
  • Symbol: P(E)P(E): the evidence - total probability of observing EE. Just a normalizing constant.
  • Symbol: : the posterior - your updated belief after observing EE

The short version: posterior \propto likelihood ×\times prior. The posterior is proportional to how well HH explains the evidence multiplied by how plausible HH was to begin with.

The Classic Medical Test Example

A disease affects 1% of the population. A test is 99% accurate in both directions: 99% sensitivity (correct positive if sick) and 99% specificity (correct negative if healthy). You test positive. What is the probability you have the disease?

Most people's intuition says "99%." Let's apply Bayes' theorem.

Let HH = "has disease", EE = "tests positive":

P(E)=P(EH)P(H)+P(EH)P(H)=0.990.01+0.010.99=0.0198P(E) = P(E \mid H)\cdot P(H) + P(E \mid \overline{H})\cdot P(\overline{H}) = 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198
P(H)P(H)
prior - 1% of the population has the disease
P(EH)P(E|H)
likelihood - 99% chance of positive given disease
P(EH)P(E|\overline{H})
false positive rate - 1% chance of positive if healthy
P(HE)=0.99×0.010.0198=0.00990.01980.50P(H \mid E) = \frac{0.99 \times 0.01}{0.0198} = \frac{0.0099}{0.0198} \approx 0.50
P(HE)P(H|E)
the probability we actually want

Only 50%. Despite a 99% accurate test, there is only a coin-flip chance you have the disease.

Interactive example

Bayes medical test demo - adjust disease prevalence and test accuracy to see how the posterior changes

Coming soon

Updating Beliefs Iteratively

Bayes' theorem is designed to be applied repeatedly. Today's posterior becomes tomorrow's prior.

You start with P(coin fair)=0.5P(\text{coin fair}) = 0.5. You flip and get heads. Bayes updates your belief slightly toward "heads-biased." Flip again - heads again. Another update. After 10 consecutive heads, your posterior P(coin fair10 heads)P(\text{coin fair} \mid 10\text{ heads}) is very small.

With enough evidence, the influence of the prior diminishes - the data overwhelms your initial beliefs. With weak evidence, the prior matters a lot. This is the formal mechanism by which prior knowledge fades as data accumulates.

The Bayesian Interpretation of ML Training

Training a neural network is, from a Bayesian perspective, inference over the space of possible models.

MLE vs. MAP:

MLE: θ^=argmaxθP(Dθ)\text{MLE: }\hat{\theta} = \arg\max_\theta P(\mathcal{D} \mid \theta)
θ\theta
model parameters - weights and biases
D\mathcal{D}
training data - the evidence
P(θ)P(\theta)
prior - what we believe about parameters before seeing data
MAP: θ^=argmaxθP(Dθ)P(θ)\text{MAP: }\hat{\theta} = \arg\max_\theta P(\mathcal{D} \mid \theta)\cdot P(\theta)
P(Dθ)P(\mathcal{D}|\theta)
likelihood of data given parameters

Maximum likelihood estimation (MLE) - what vanilla gradient descent does - finds parameters that maximize how probable the data is. It ignores the prior.

Maximum a posteriori (MAP) also accounts for the prior. With a Gaussian prior N(0,1/(2λ))\mathcal{N}(0, 1/(2\lambda)) on the weights, MAP is exactly equivalent to L2 regularization: you penalize weights far from zero. .

The Naive Bayes classifier directly applies the theorem to classification:

P(cx)P(xc)P(c)P(c \mid \mathbf{x}) \propto P(\mathbf{x} \mid c)\cdot P(c)
cc
class label
x\mathbf{x}
feature vector - the input

The "naive" part: it assumes each feature is conditionally independent given the class. This is almost certainly wrong in practice, but the classifier still works surprisingly well for text classification (spam detection, sentiment analysis) because the independence assumption, while wrong, does not hurt the prediction task much.

# Bayes' theorem in Python — medical test example
def bayes_update(prior, likelihood_given_true, likelihood_given_false):
    """P(H|E) = P(E|H) * P(H) / P(E)"""
    p_evidence = likelihood_given_true * prior + likelihood_given_false * (1 - prior)
    return (likelihood_given_true * prior) / p_evidence

# Disease prevalence: 1% of population is sick
# Test sensitivity: 95% of sick people test positive
# False positive rate: 5% of healthy people also test positive
p_sick_given_positive = bayes_update(
    prior=0.01,
    likelihood_given_true=0.95,
    likelihood_given_false=0.05
)
print(f"P(sick | positive test) = {p_sick_given_positive:.4f}")  # → ≈ 0.16

# Naïve Bayes classifier (spam detection sketch)
# P(spam | "money", "free") ∝ P("money"|spam) * P("free"|spam) * P(spam)
p_spam = 0.3
p_money_spam, p_free_spam   = 0.6, 0.7   # word probabilities in spam
p_money_ham,  p_free_ham    = 0.1, 0.1   # word probabilities in ham

score_spam = p_money_spam * p_free_spam * p_spam
score_ham  = p_money_ham  * p_free_ham  * (1 - p_spam)
p_spam_given_words = score_spam / (score_spam + score_ham)
print(f"P(spam | 'money free') = {p_spam_given_words:.4f}")  # → ≈ 0.84

Quiz

1 / 3

In Bayes' theorem P(H|E) = P(E|H)·P(H) / P(E), what is P(H) called?