Bayes' theorem — Probability & Statistics

The Question Bayes Answers

You have a belief about something. New evidence arrives. How should you update your belief?

This is what answers. It turns out humans are notoriously bad at answering it intuitively - we systematically overweight new evidence and underweight prior knowledge. Bayes gives the mathematically correct update.

More importantly for ML: this theorem is a lens for understanding what machine learning is doing. Training a model IS updating beliefs (about parameters) based on evidence (training data). Regularization has a Bayesian interpretation. Cross-entropy loss derives from Bayes. Understanding this theorem unifies a huge swath of ML theory.

For business context: recommendation and spam filtering

Two classic industrial applications of Bayes' theorem:

Spam filtering (Naïve Bayes): Gmail, Outlook, and every major email provider use some form of Naïve Bayes as a component of spam detection. The prior is "what fraction of all emails are spam?" The likelihood is "how often do spam emails contain this word?" The posterior is "given this email's words, how likely is it spam?" The "naïve" assumption — that word occurrences are independent given the class — is wrong, but the classifier is fast, interpretable, and effective enough to be worth using.

Medical screening and A/B testing: Bayesian reasoning is why a positive test from a rare disease is often not as alarming as it seems (base rate matters enormously) — and also why companies doing A/B testing are increasingly switching to Bayesian frameworks that report "probability that B is better than A" instead of p-values.

The Theorem

P(H \mid E) = \frac{P(E \mid H)\cdot P(H)}{P(E)}

$P(H|E)$: posterior - updated belief in H after seeing evidence E
$P(E|H)$: likelihood - probability of evidence E if H is true
$P(H)$: prior - belief in H before seeing evidence
$P(E)$: evidence - total probability of observing E

Each piece has a name that carries the whole intuition:

Symbol: : the prior - what you believed about $H$ before seeing $E$
Symbol: : the likelihood - if $H$ is true, how probable is evidence $E$ ?
Symbol: $P(E)$ : the evidence - total probability of observing $E$ . Just a normalizing constant.
Symbol: : the posterior - your updated belief after observing $E$

The short version: posterior $\propto$ likelihood $\times$ prior. The posterior is proportional to how well $H$ explains the evidence multiplied by how plausible $H$ was to begin with.

The Classic Medical Test Example

A disease affects 1% of the population. A test is 99% accurate in both directions: 99% sensitivity (correct positive if sick) and 99% specificity (correct negative if healthy). You test positive. What is the probability you have the disease?

Most people's intuition says "99%." Let's apply Bayes' theorem.

Let $H$ = "has disease", $E$ = "tests positive":

P(E) = P(E \mid H)\cdot P(H) + P(E \mid \overline{H})\cdot P(\overline{H}) = 0.99 \cdot 0.01 + 0.01 \cdot 0.99 = 0.0198

$P(H)$: prior - 1% of the population has the disease
$P(E|H)$: likelihood - 99% chance of positive given disease
$P(E|\overline{H})$: false positive rate - 1% chance of positive if healthy

P(H \mid E) = \frac{0.99 \times 0.01}{0.0198} = \frac{0.0099}{0.0198} \approx 0.50

$P(H|E)$: the probability we actually want

Only 50%. Despite a 99% accurate test, there is only a coin-flip chance you have the disease.

Interactive example

Bayes medical test demo - adjust disease prevalence and test accuracy to see how the posterior changes

Coming soon

Updating Beliefs Iteratively

Bayes' theorem is designed to be applied repeatedly. Today's posterior becomes tomorrow's prior.

You start with $P(\text{coin fair}) = 0.5$ . You flip and get heads. Bayes updates your belief slightly toward "heads-biased." Flip again - heads again. Another update. After 10 consecutive heads, your posterior $P(\text{coin fair} \mid 10\text{ heads})$ is very small.

With enough evidence, the influence of the prior diminishes - the data overwhelms your initial beliefs. With weak evidence, the prior matters a lot. This is the formal mechanism by which prior knowledge fades as data accumulates.

The Bayesian Interpretation of ML Training

Training a neural network is, from a Bayesian perspective, inference over the space of possible models.

MLE vs. MAP:

\text{MLE: }\hat{\theta} = \arg\max_\theta P(\mathcal{D} \mid \theta)

$\theta$: model parameters - weights and biases
$\mathcal{D}$: training data - the evidence
$P(\theta)$: prior - what we believe about parameters before seeing data

\text{MAP: }\hat{\theta} = \arg\max_\theta P(\mathcal{D} \mid \theta)\cdot P(\theta)

$P(\mathcal{D}|\theta)$: likelihood of data given parameters

Maximum likelihood estimation (MLE) - what vanilla gradient descent does - finds parameters that maximize how probable the data is. It ignores the prior.

Maximum a posteriori (MAP) also accounts for the prior. With a Gaussian prior $\mathcal{N}(0, 1/(2\lambda))$ on the weights, MAP is exactly equivalent to L2 regularization: you penalize weights far from zero. .

The Naive Bayes classifier directly applies the theorem to classification:

P(c \mid \mathbf{x}) \propto P(\mathbf{x} \mid c)\cdot P(c)

$c$: class label
$\mathbf{x}$: feature vector - the input

The "naive" part: it assumes each feature is conditionally independent given the class. This is almost certainly wrong in practice, but the classifier still works surprisingly well for text classification (spam detection, sentiment analysis) because the independence assumption, while wrong, does not hurt the prediction task much.

# Bayes' theorem in Python — medical test example
def bayes_update(prior, likelihood_given_true, likelihood_given_false):
    """P(H|E) = P(E|H) * P(H) / P(E)"""
    p_evidence = likelihood_given_true * prior + likelihood_given_false * (1 - prior)
    return (likelihood_given_true * prior) / p_evidence

# Disease prevalence: 1% of population is sick
# Test sensitivity: 95% of sick people test positive
# False positive rate: 5% of healthy people also test positive
p_sick_given_positive = bayes_update(
    prior=0.01,
    likelihood_given_true=0.95,
    likelihood_given_false=0.05
)
print(f"P(sick | positive test) = {p_sick_given_positive:.4f}")  # → ≈ 0.16

# Naïve Bayes classifier (spam detection sketch)
# P(spam | "money", "free") ∝ P("money"|spam) * P("free"|spam) * P(spam)
p_spam = 0.3
p_money_spam, p_free_spam   = 0.6, 0.7   # word probabilities in spam
p_money_ham,  p_free_ham    = 0.1, 0.1   # word probabilities in ham

score_spam = p_money_spam * p_free_spam * p_spam
score_ham  = p_money_ham  * p_free_ham  * (1 - p_spam)
p_spam_given_words = score_spam / (score_spam + score_ham)
print(f"P(spam | 'money free') = {p_spam_given_words:.4f}")  # → ≈ 0.84

Conditional probability

The Question Bayes Answers

The Theorem

The Classic Medical Test Example

Updating Beliefs Iteratively

The Bayesian Interpretation of ML Training