What is probability?

The is a number between 0 and 1 that measures how likely something is to happen.

Every ML model is, at its core, a probability machine. A classifier does not say "this is a cat" — it says "there is an 87% chance this is a cat." A language model assigns probabilities to every possible next word. Before you can interpret, train, or evaluate any of these models, you need a firm grasp of what probability actually means.

When $P = 0$ : impossible. $P(\text{rolling 7 on a standard die}) = 0$ .
When $P = 1$ : certain. $P(\text{rolling} \leq 6) = 1$ .
Everything else is in between. $P(\text{rolling a 3}) = 1/6 \approx 0.167$ .

Think of probability as a fraction: how many of the possible outcomes are the favorable one? For a fair coin, 1 of 2 outcomes is heads, so $P(\text{heads}) = 1/2$ .

P(A) \in [0, 1] \qquad P(\Omega) = 1

$P(A)$: probability of event A - always between 0 and 1

The is the sample space - the complete set of everything that could happen. Its probability is always 1.

The Complement Rule

P(\overline{A}) = 1 - P(A)

$A$: event of interest
$P(\overline{A})$: probability that A does NOT happen

If $P(\text{rain}) = 0.3$ , then $P(\text{no rain}) = 0.7$ . The two must sum to 1 - something either happens or it does not.

This is more useful than it sounds. Sometimes $P(\text{not A})$ is far easier to compute. "Probability of at least one head in 10 flips" sounds complicated until you flip it: $1 - P(\text{zero heads}) = 1 - (0.5)^{10} \approx 0.999$ .

Interactive example

Complement rule visualizer - adjust P(A) and see P(not A) update

Coming soon

Joint Probability: AND

Here, $P(A \text{ and } B)$ is the probability that both $A$ and $B$ happen simultaneously.

When $A$ and $B$ are independent (knowing one tells you nothing about the other):

P(A \cap B) = P(A) \cdot P(B)

$P(A)$: probability of A
$P(B)$: probability of B

Example: $P(\text{heads on flip 1 AND heads on flip 2}) = 0.5 \times 0.5 = 0.25$ .

When events are not independent - consecutive rainy days are correlated - this formula breaks down. For dependent events:

P(A \cap B) = P(A) \cdot P(B \mid A)

$P(B|A)$: probability of B given A has occurred

This brings us to the most important concept in probability for ML.

Conditional Probability: GIVEN

The $P(A \mid B)$ - read "probability of A given B" - means: IF $B$ has already happened, what is the probability of $A$ ?

P(A \mid B) = \frac{P(A \cap B)}{P(B)}

$P(A \cap B)$: probability that both A and B occur
$P(B)$: probability that B occurs

You restrict to the world where $B$ happened (divide by $P(B)$ ), then look at what fraction of that world also has $A$ .

Example: $P(\text{rolling 4} \mid \text{rolling even})$ ?

We know it is even, so only ${2, 4, 6}$ are possible (sample space shrinks).
Of those three, only ${4}$ gives us 4.
For example: $P(\text{4} \mid \text{even}) = 1/3$ .

Compare to $P(\text{rolling 4}) = 1/6$ . Learning that the roll is even doubled the probability, because it cut the sample space in half.

The Addition Rule

For events that can overlap:

P(A \cup B) = P(A) + P(B) - P(A \cap B)

$P(A \cup B)$: probability that A or B or both occur

Subtract the intersection to avoid counting it twice. If $A$ and $B$ are mutually exclusive (cannot both happen), $P(A \cap B) = 0$ and the formula simplifies to $P(A) + P(B)$ .

Independence

Events $A$ and $B$ are when knowing $B$ gives you no information about $A$ :

A \perp B \iff P(A \mid B) = P(A)

$P(A|B)$: probability of A given B
$P(A)$: unconditional probability of A

Coin flips are independent - each flip starts fresh. But tomorrow's weather is not independent of today's weather. In ML, the i.i.d. (independent and identically distributed) assumption says that training examples are independent of each other. This assumption drives much of ML theory, and it is often violated in practice (think: time series, correlated samples, users in a recommendation system). Knowing when it is violated helps you anticipate failure modes.

# Computing probabilities from data
outcomes = ['H', 'T', 'H', 'H', 'T', 'H', 'T', 'T', 'H', 'H']

# Empirical probability (count / total)
p_heads = outcomes.count('H') / len(outcomes)
print(f"P(H) = {p_heads:.2f}")   # → 0.60

# Conditional probability: P(A|B) = P(A ∩ B) / P(B)
# Example: given a patient tests positive, what's the probability they're sick?
p_positive_given_sick    = 0.95
p_sick                   = 0.01
p_positive_given_healthy = 0.05

p_positive = p_positive_given_sick * p_sick + p_positive_given_healthy * (1 - p_sick)
p_sick_given_positive = (p_positive_given_sick * p_sick) / p_positive
print(f"P(sick | positive) = {p_sick_given_positive:.4f}")   # → ~0.16 (Bayes!)

# Independence check: P(A ∩ B) == P(A) * P(B)?
p_a, p_b = 0.3, 0.5
p_a_and_b_if_independent = p_a * p_b   # 0.15

Interactive example

Conditional probability explorer - see how restricting the sample space changes probabilities

Coming soon