The Long-Run Average
Imagine playing a game where you flip a coin: heads gives you $10, tails gives you $0. How much should you expect to win per flip?
Not $10 - that is only if you get heads every time. Not $0 - that is only if you always get tails. Since it is 50/50, over many many flips you would average $5 per flip. That is the .
The expected value is the long-run average you would observe if you repeated the experiment infinitely. It is not the most likely outcome. It is not what you will get on your next trial. It is the average over infinitely many trials.
Expected value is the foundation of every loss function in machine learning. When you minimize mean squared error, you are minimizing the expected squared error over your training distribution. When you reason about what a model will do on average — across thousands of predictions — you are computing expected values.
The Formula
For a discrete random variable taking values :
- the i-th possible outcome
- probability of that outcome
Multiply each possible outcome by its probability, then sum. It is a probability-weighted average.
Coin flip game:
- probability of getting heads = 0.5
A fair 6-sided die:
- each face value from 1 to 6
You will never roll 3.5. "Expected" is a misleading name - it means the long-run average, not what you expect on the next roll.
Linearity of Expectation
Expected value obeys a beautifully simple rule:
- scalar multiplier
- constant offset
And for sums of variables:
- first random variable
- second random variable
This holds even when X and Y are not independent - which is surprising and extremely useful. You can always split an expected sum into a sum of expectations.
Example: Component cost is with E[X] = $50. You buy 3 and pay $10 shipping:
- total cost formula
Simple as algebra, even though is random.
Variance: How Spread Out Are Outcomes?
Expected value gives you the center. tells you how spread out the outcomes are:
- the mean E[X]
- expected squared deviation from the mean
The = is in the same units as , making it easier to interpret.
Two games, both with E[X] = $5:
- Game A: always pays exactly $5. . Boring but predictable.
- Game B: pays $0 or $10 equally. . Same average, wildly different risk.
Same expected value, very different experience. This is why average loss alone does not tell you everything about a model's behavior.
Expected Loss in ML
Your training dataset is a sample drawn from the true data distribution - all possible (input, label) pairs weighted by how often they occur. The training loss is:
- number of training examples
- per-example loss function
This is the empirical expected loss - a sample average approximating the true expected loss:
- data distribution - all possible (x,y) pairs
- model function
Here, is what you actually care about - performance on new, unseen data. With large training sets, by the law of large numbers. With small training sets, the approximation is noisy. A model that memorizes training examples and gets while remains large is .
import numpy as np
# Expected value: E[X] = sum(x * P(X=x))
dice_values = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.ones(6) / 6 # uniform: each face prob 1/6
expected_value = np.sum(dice_values * probabilities)
print(f"E[die] = {expected_value:.1f}") # → 3.5
# Variance: E[(X - E[X])²]
variance = np.sum((dice_values - expected_value)**2 * probabilities)
std_dev = np.sqrt(variance)
print(f"Var[die] = {variance:.4f}") # → 2.9167
print(f"SD[die] = {std_dev:.4f}") # → 1.7078
# Sample mean converges to E[X] (Law of Large Numbers)
samples = np.random.randint(1, 7, size=10000)
print(f"Sample mean (n=10000): {samples.mean():.4f}") # ≈ 3.5
Interactive example
Expected value simulator - run many trials and watch the sample average converge to E[X]
Coming soon