Expected value — Probability & Statistics

The Long-Run Average

Imagine playing a game where you flip a coin: heads gives you $10, tails gives you $0. How much should you expect to win per flip?

Not $10 - that is only if you get heads every time. Not $0 - that is only if you always get tails. Since it is 50/50, over many many flips you would average $5 per flip. That is the .

The expected value is the long-run average you would observe if you repeated the experiment infinitely. It is not the most likely outcome. It is not what you will get on your next trial. It is the average over infinitely many trials.

Expected value is the foundation of every loss function in machine learning. When you minimize mean squared error, you are minimizing the expected squared error over your training distribution. When you reason about what a model will do on average — across thousands of predictions — you are computing expected values.

The Formula

For a discrete random variable $X$ taking values $x_1, x_2, \ldots, x_k$ :

E[X] = \sum_{i} x_i \cdot P(X = x_i)

$x_i$: the i-th possible outcome
$P(X=x_i)$: probability of that outcome

Multiply each possible outcome by its probability, then sum. It is a probability-weighted average.

Coin flip game:

E[\text{winnings}] = 10 \cdot 0.5 + 0 \cdot 0.5 = $5

$P(\text{heads})$: probability of getting heads = 0.5

A fair 6-sided die:

E[\text{roll}] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = \frac{21}{6} = 3.5

$k$: each face value from 1 to 6

You will never roll 3.5. "Expected" is a misleading name - it means the long-run average, not what you expect on the next roll.

Linearity of Expectation

Expected value obeys a beautifully simple rule:

E[aX + b] = a\thinspaceE[X] + b

$a$: scalar multiplier
$b$: constant offset

And for sums of variables:

E[X + Y] = E[X] + E[Y]

$X$: first random variable
$Y$: second random variable

This holds even when X and Y are not independent - which is surprising and extremely useful. You can always split an expected sum into a sum of expectations.

Example: Component cost is $X$ with $E[X] = $50$ . You buy 3 and pay $10 shipping:

E[3X + 10] = 3 \cdot E[X] + 10 = 3 \cdot 50 + 10 = $160

$3X + 10$: total cost formula

Simple as algebra, even though $X$ is random.

Variance: How Spread Out Are Outcomes?

Expected value gives you the center. tells you how spread out the outcomes are:

\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - \mu^2

$\mu$: the mean E[X]
$\text{Var}(X)$: expected squared deviation from the mean

The = $\sqrt{\text{Var}(X)}$ is in the same units as $X$ , making it easier to interpret.

Two games, both with E[X] = $5:

Game A: always pays exactly $5. $\text{Var}(A) = 0$ . Boring but predictable.
Game B: pays $0 or $10 equally. $\text{Var}(B) = E[(B-5)^2] = 25 \cdot 0.5 + 25 \cdot 0.5 = 25$ . Same average, wildly different risk.

Same expected value, very different experience. This is why average loss alone does not tell you everything about a model's behavior.

Expected Loss in ML

Your training dataset is a sample drawn from the true data distribution - all possible (input, label) pairs weighted by how often they occur. The training loss is:

L_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i)

$n$: number of training examples
$L$: per-example loss function

This is the empirical expected loss - a sample average approximating the true expected loss:

L_{\text{true}} = E_{(x,y) \sim \mathcal{D}}[L(y, f(x))]

$\mathcal{D}$: data distribution - all possible (x,y) pairs
$f$: model function

Here, $L_{\text{true}}$ is what you actually care about - performance on new, unseen data. With large training sets, $L_{\text{train}} \approx L_{\text{true}}$ by the law of large numbers. With small training sets, the approximation is noisy. A model that memorizes training examples and gets $L_{\text{train}} \to 0$ while $L_{\text{true}}$ remains large is .

import numpy as np

# Expected value: E[X] = sum(x * P(X=x))
dice_values = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.ones(6) / 6           # uniform: each face prob 1/6
expected_value = np.sum(dice_values * probabilities)
print(f"E[die] = {expected_value:.1f}")  # → 3.5

# Variance: E[(X - E[X])²]
variance = np.sum((dice_values - expected_value)**2 * probabilities)
std_dev  = np.sqrt(variance)
print(f"Var[die] = {variance:.4f}")      # → 2.9167
print(f"SD[die]  = {std_dev:.4f}")       # → 1.7078

# Sample mean converges to E[X] (Law of Large Numbers)
samples = np.random.randint(1, 7, size=10000)
print(f"Sample mean (n=10000): {samples.mean():.4f}")  # ≈ 3.5

Interactive example

Expected value simulator - run many trials and watch the sample average converge to E[X]

Coming soon