The Question Bayes Answers
You have a belief about something. New evidence arrives. How should you update your belief?
This is what answers. It turns out humans are notoriously bad at answering it intuitively - we systematically overweight new evidence and underweight prior knowledge. Bayes gives the mathematically correct update.
More importantly for ML: this theorem is a lens for understanding what machine learning is doing. Training a model IS updating beliefs (about parameters) based on evidence (training data). Regularization has a Bayesian interpretation. Cross-entropy loss derives from Bayes. Understanding this theorem unifies a huge swath of ML theory.
The Theorem
- posterior - updated belief in H after seeing evidence E
- likelihood - probability of evidence E if H is true
- prior - belief in H before seeing evidence
- evidence - total probability of observing E
Each piece has a name that carries the whole intuition:
- Symbol: : the prior - what you believed about before seeing
- Symbol: : the likelihood - if is true, how probable is evidence ?
- Symbol: : the evidence - total probability of observing . Just a normalizing constant.
- Symbol: : the posterior - your updated belief after observing
The short version: posterior likelihood prior. The posterior is proportional to how well explains the evidence multiplied by how plausible was to begin with.
The Classic Medical Test Example
A disease affects 1% of the population. A test is 99% accurate in both directions: 99% sensitivity (correct positive if sick) and 99% specificity (correct negative if healthy). You test positive. What is the probability you have the disease?
Most people's intuition says "99%." Let's apply Bayes' theorem.
Let = "has disease", = "tests positive":
- prior - 1% of the population has the disease
- likelihood - 99% chance of positive given disease
- false positive rate - 1% chance of positive if healthy
- the probability we actually want
Only 50%. Despite a 99% accurate test, there is only a coin-flip chance you have the disease.
Interactive example
Bayes medical test demo - adjust disease prevalence and test accuracy to see how the posterior changes
Coming soon
Updating Beliefs Iteratively
Bayes' theorem is designed to be applied repeatedly. Today's posterior becomes tomorrow's prior.
You start with . You flip and get heads. Bayes updates your belief slightly toward "heads-biased." Flip again - heads again. Another update. After 10 consecutive heads, your posterior is very small.
With enough evidence, the influence of the prior diminishes - the data overwhelms your initial beliefs. With weak evidence, the prior matters a lot. This is the formal mechanism by which prior knowledge fades as data accumulates.
The Bayesian Interpretation of ML Training
Training a neural network is, from a Bayesian perspective, inference over the space of possible models.
MLE vs. MAP:
- model parameters - weights and biases
- training data - the evidence
- prior - what we believe about parameters before seeing data
- likelihood of data given parameters
Maximum likelihood estimation (MLE) - what vanilla gradient descent does - finds parameters that maximize how probable the data is. It ignores the prior.
Maximum a posteriori (MAP) also accounts for the prior. With a Gaussian prior on the weights, MAP is exactly equivalent to L2 regularization: you penalize weights far from zero. .
The Naive Bayes classifier directly applies the theorem to classification:
- class label
- feature vector - the input
The "naive" part: it assumes each feature is conditionally independent given the class. This is almost certainly wrong in practice, but the classifier still works surprisingly well for text classification (spam detection, sentiment analysis) because the independence assumption, while wrong, does not hurt the prediction task much.
# Bayes' theorem in Python — medical test example
def bayes_update(prior, likelihood_given_true, likelihood_given_false):
"""P(H|E) = P(E|H) * P(H) / P(E)"""
p_evidence = likelihood_given_true * prior + likelihood_given_false * (1 - prior)
return (likelihood_given_true * prior) / p_evidence
# Disease prevalence: 1% of population is sick
# Test sensitivity: 95% of sick people test positive
# False positive rate: 5% of healthy people also test positive
p_sick_given_positive = bayes_update(
prior=0.01,
likelihood_given_true=0.95,
likelihood_given_false=0.05
)
print(f"P(sick | positive test) = {p_sick_given_positive:.4f}") # → ≈ 0.16
# Naïve Bayes classifier (spam detection sketch)
# P(spam | "money", "free") ∝ P("money"|spam) * P("free"|spam) * P(spam)
p_spam = 0.3
p_money_spam, p_free_spam = 0.6, 0.7 # word probabilities in spam
p_money_ham, p_free_ham = 0.1, 0.1 # word probabilities in ham
score_spam = p_money_spam * p_free_spam * p_spam
score_ham = p_money_ham * p_free_ham * (1 - p_spam)
p_spam_given_words = score_spam / (score_spam + score_ham)
print(f"P(spam | 'money free') = {p_spam_given_words:.4f}") # → ≈ 0.84