What Is Probability?
The is a number between 0 and 1 that measures how likely something is to happen.
Every ML model is, at its core, a probability machine. A classifier does not say "this is a cat" — it says "there is an 87% chance this is a cat." A language model assigns probabilities to every possible next word. Before you can interpret, train, or evaluate any of these models, you need a firm grasp of what probability actually means.
- When : impossible. .
- When : certain. .
- Everything else is in between. .
Think of probability as a fraction: how many of the possible outcomes are the favorable one? For a fair coin, 1 of 2 outcomes is heads, so .
- probability of event A - always between 0 and 1
The is the sample space - the complete set of everything that could happen. Its probability is always 1.
The Complement Rule
- event of interest
- probability that A does NOT happen
If , then . The two must sum to 1 - something either happens or it does not.
This is more useful than it sounds. Sometimes is far easier to compute. "Probability of at least one head in 10 flips" sounds complicated until you flip it: .
Interactive example
Complement rule visualizer - adjust P(A) and see P(not A) update
Coming soon
Joint Probability: AND
Here, is the probability that both and happen simultaneously.
When and are independent (knowing one tells you nothing about the other):
- probability of A
- probability of B
Example: .
When events are not independent - consecutive rainy days are correlated - this formula breaks down. For dependent events:
- probability of B given A has occurred
This brings us to the most important concept in probability for ML.
Conditional Probability: GIVEN
The - read "probability of A given B" - means: IF has already happened, what is the probability of ?
- probability that both A and B occur
- probability that B occurs
You restrict to the world where happened (divide by ), then look at what fraction of that world also has .
Example: ?
- We know it is even, so only are possible (sample space shrinks).
- Of those three, only gives us 4.
- For example: .
Compare to . Learning that the roll is even doubled the probability, because it cut the sample space in half.
The Addition Rule
For events that can overlap:
- probability that A or B or both occur
Subtract the intersection to avoid counting it twice. If and are mutually exclusive (cannot both happen), and the formula simplifies to .
Independence
Events and are when knowing gives you no information about :
- probability of A given B
- unconditional probability of A
Coin flips are independent - each flip starts fresh. But tomorrow's weather is not independent of today's weather. In ML, the i.i.d. (independent and identically distributed) assumption says that training examples are independent of each other. This assumption drives much of ML theory, and it is often violated in practice (think: time series, correlated samples, users in a recommendation system). Knowing when it is violated helps you anticipate failure modes.
# Computing probabilities from data
outcomes = ['H', 'T', 'H', 'H', 'T', 'H', 'T', 'T', 'H', 'H']
# Empirical probability (count / total)
p_heads = outcomes.count('H') / len(outcomes)
print(f"P(H) = {p_heads:.2f}") # → 0.60
# Conditional probability: P(A|B) = P(A ∩ B) / P(B)
# Example: given a patient tests positive, what's the probability they're sick?
p_positive_given_sick = 0.95
p_sick = 0.01
p_positive_given_healthy = 0.05
p_positive = p_positive_given_sick * p_sick + p_positive_given_healthy * (1 - p_sick)
p_sick_given_positive = (p_positive_given_sick * p_sick) / p_positive
print(f"P(sick | positive) = {p_sick_given_positive:.4f}") # → ~0.16 (Bayes!)
# Independence check: P(A ∩ B) == P(A) * P(B)?
p_a, p_b = 0.3, 0.5
p_a_and_b_if_independent = p_a * p_b # 0.15
Interactive example
Conditional probability explorer - see how restricting the sample space changes probabilities
Coming soon