Skip to content
Math Foundation Probability & Statistics
Lesson 2 ⏱ 10 min

Expected value

Video coming soon

Expected Value: The Long-Run Average

From coin flip games to variance, linearity of expectation, and why the training loss is an empirical expected loss.

⏱ ~6 min

🧮

Quick refresher

Probability basics

P(A) is between 0 and 1. P(not A) = 1 - P(A). Joint probability for independent events: P(A and B) = P(A)·P(B). Conditional: P(A|B) = P(A∩B)/P(B).

Example

P(heads) = 0.5.

P(heads AND heads) = 0.25.

P(rolling 4 | rolling even) = 1/3.

The Long-Run Average

Imagine playing a game where you flip a coin: heads gives you $10, tails gives you $0. How much should you expect to win per flip?

Not $10 - that is only if you get heads every time. Not $0 - that is only if you always get tails. Since it is 50/50, over many many flips you would average $5 per flip. That is the .

The expected value is the long-run average you would observe if you repeated the experiment infinitely. It is not the most likely outcome. It is not what you will get on your next trial. It is the average over infinitely many trials.

Expected value is the foundation of every loss function in machine learning. When you minimize mean squared error, you are minimizing the expected squared error over your training distribution. When you reason about what a model will do on average — across thousands of predictions — you are computing expected values.

The Formula

For a discrete random variable XX taking values x1,x2,,xkx_1, x_2, \ldots, x_k:

E[X]=ixiP(X=xi)E[X] = \sum_{i} x_i \cdot P(X = x_i)
xix_i
the i-th possible outcome
P(X=xi)P(X=x_i)
probability of that outcome

Multiply each possible outcome by its probability, then sum. It is a probability-weighted average.

Coin flip game:

E[\text{winnings}] = 10 \cdot 0.5 + 0 \cdot 0.5 = $5
P(heads)P(\text{heads})
probability of getting heads = 0.5

A fair 6-sided die:

E[roll]=116+216++616=216=3.5E[\text{roll}] = 1 \cdot \tfrac{1}{6} + 2 \cdot \tfrac{1}{6} + \cdots + 6 \cdot \tfrac{1}{6} = \frac{21}{6} = 3.5
kk
each face value from 1 to 6

You will never roll 3.5. "Expected" is a misleading name - it means the long-run average, not what you expect on the next roll.

Linearity of Expectation

Expected value obeys a beautifully simple rule:

E[aX+b]=a\thinspaceE[X]+bE[aX + b] = a\thinspaceE[X] + b
aa
scalar multiplier
bb
constant offset

And for sums of variables:

E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] + E[Y]
XX
first random variable
YY
second random variable

This holds even when X and Y are not independent - which is surprising and extremely useful. You can always split an expected sum into a sum of expectations.

Example: Component cost is XX with E[X] = $50. You buy 3 and pay $10 shipping:

E[3X + 10] = 3 \cdot E[X] + 10 = 3 \cdot 50 + 10 = $160
3X+103X + 10
total cost formula

Simple as algebra, even though XX is random.

Variance: How Spread Out Are Outcomes?

Expected value gives you the center. tells you how spread out the outcomes are:

Var(X)=E[(Xμ)2]=E[X2]μ2\text{Var}(X) = E[(X - \mu)^2] = E[X^2] - \mu^2
μ\mu
the mean E[X]
Var(X)\text{Var}(X)
expected squared deviation from the mean

The = Var(X)\sqrt{\text{Var}(X)} is in the same units as XX, making it easier to interpret.

Two games, both with E[X] = $5:

  • Game A: always pays exactly $5. Var(A)=0\text{Var}(A) = 0. Boring but predictable.
  • Game B: pays $0 or $10 equally. Var(B)=E[(B5)2]=250.5+250.5=25\text{Var}(B) = E[(B-5)^2] = 25 \cdot 0.5 + 25 \cdot 0.5 = 25. Same average, wildly different risk.

Same expected value, very different experience. This is why average loss alone does not tell you everything about a model's behavior.

Expected Loss in ML

Your training dataset is a sample drawn from the true data distribution - all possible (input, label) pairs weighted by how often they occur. The training loss is:

Ltrain=1ni=1nL(yi,y^i)L_{\text{train}} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i)
nn
number of training examples
LL
per-example loss function

This is the empirical expected loss - a sample average approximating the true expected loss:

Ltrue=E(x,y)D[L(y,f(x))]L_{\text{true}} = E_{(x,y) \sim \mathcal{D}}[L(y, f(x))]
D\mathcal{D}
data distribution - all possible (x,y) pairs
ff
model function

Here, LtrueL_{\text{true}} is what you actually care about - performance on new, unseen data. With large training sets, LtrainLtrueL_{\text{train}} \approx L_{\text{true}} by the law of large numbers. With small training sets, the approximation is noisy. A model that memorizes training examples and gets Ltrain0L_{\text{train}} \to 0 while LtrueL_{\text{true}} remains large is .

import numpy as np

# Expected value: E[X] = sum(x * P(X=x))
dice_values = np.array([1, 2, 3, 4, 5, 6])
probabilities = np.ones(6) / 6           # uniform: each face prob 1/6
expected_value = np.sum(dice_values * probabilities)
print(f"E[die] = {expected_value:.1f}")  # → 3.5

# Variance: E[(X - E[X])²]
variance = np.sum((dice_values - expected_value)**2 * probabilities)
std_dev  = np.sqrt(variance)
print(f"Var[die] = {variance:.4f}")      # → 2.9167
print(f"SD[die]  = {std_dev:.4f}")       # → 1.7078

# Sample mean converges to E[X] (Law of Large Numbers)
samples = np.random.randint(1, 7, size=10000)
print(f"Sample mean (n=10000): {samples.mean():.4f}")  # ≈ 3.5

Interactive example

Expected value simulator - run many trials and watch the sample average converge to E[X]

Coming soon

Quiz

1 / 3

For a fair 6-sided die, what is E[roll]?