The Question
How do you measure surprise?
If the weather forecast says 99% chance of sun and it rains, you are surprised. If it says 50% chance of rain and it rains, you are not very surprised. If it says 1% chance of rain and it rains, you are extremely surprised.
The mathematical theory of information quantifies this intuition precisely. Claude Shannon developed it in 1948 — and in doing so, laid the theoretical foundation for everything from ZIP files to language models.
Self-Information
Define the of an event with probability as:
- self-information (surprisal) of event x — measured in bits or nats
- base-2 logarithm — gives information in bits
- probability of event x
Why this specific formula? Three properties pin it down uniquely:
- Certainty → no information: if , then . Hearing something you knew for certain tells you nothing.
- Rare events → high information: as , . An extremely unlikely event is extremely informative.
- Independent events are additive: when and are independent, because .
Only the log function satisfies all three.
Concrete examples:
- Value: : bits. No surprise.
- Value: : bit. One coin flip.
- Value: : bits. Two coin flips.
- Value: : bits. Three coin flips.
- Value: : bits. Very surprising.
Entropy: Expected Surprise
A single event's self-information depends on which event occurred. is the expected self-information over all outcomes:
- entropy of random variable X — measured in bits or nats
- probability of outcome x
- sum over all possible outcomes x of X
Convention: (since ).
Entropy measures the average uncertainty of the random variable before you observe it. High entropy = hard to predict. Low entropy = easy to predict.
Worked Example: Fair vs Biased Coin
Fair coin ():
- entropy of a fair coin flip
Biased coin ():
- entropy of a biased coin flip
The biased coin has only 0.469 bits of entropy — it is much more predictable than the fair coin. You can usually guess "heads" and be right.
Deterministic variable ():
Here, bits. Zero uncertainty.
Maximum Entropy: The Uniform Distribution
Among all distributions over outcomes, entropy is maximized by the uniform distribution :
- number of possible outcomes
- maximum possible entropy for n outcomes
Intuition: the uniform distribution is the state of maximum ignorance — you have no reason to prefer any outcome over any other.
Units: Bits vs Nats
The choice of logarithm base determines the unit:
| Base | Unit | Common in |
|---|---|---|
| bits (shannon) | Information theory, data compression | |
| (base ) | nats | Mathematics, ML (PyTorch/TensorFlow use nats by default) |
Conversion: .
In ML code, you'll see torch.log (natural log) everywhere — ML frameworks measure entropy in nats. The theory is the same; only the units change.
Why Entropy Matters for ML
Entropy is not a curiosity — it is the foundation of the most important loss function in classification:
Cross-entropy loss (the standard classification loss) is built directly from entropy. When you write:
loss = F.cross_entropy(logits, labels)
You are computing — a function of entropy. The next lesson derives this connection exactly.
Entropy also appears in:
- Decision trees: information gain = entropy before split - entropy after split
- Variational autoencoders: the ELBO objective contains entropy terms
- Reinforcement learning: maximum entropy RL adds an entropy bonus to encourage exploration
- Data compression: Shannon's source coding theorem says you need at least H(X) bits on average to encode X
import numpy as np
def entropy(probs, base=2):
"""Shannon entropy H(X) in bits (base=2) or nats (base=e)."""
probs = np.array(probs)
# Convention: 0·log(0) = 0 (zero terms contribute nothing)
nonzero = probs[probs > 0]
if base == 2:
return -np.sum(nonzero * np.log2(nonzero))
return -np.sum(nonzero * np.log(nonzero))
# Fair coin: maximum uncertainty
print(f"Fair coin: {entropy([0.5, 0.5]):.4f} bits") # → 1.0
# Biased coin: less uncertainty
print(f"90/10 coin: {entropy([0.9, 0.1]):.4f} bits") # → 0.469
# Certain outcome: zero entropy
print(f"Certain: {entropy([1.0, 0.0]):.4f} bits") # → 0.0
# Fair die: log₂(6) ≈ 2.585 bits
print(f"Fair die: {entropy([1/6]*6):.4f} bits") # → 2.585
# Class imbalance in ML: unbalanced dataset has lower entropy
p_class1 = 0.95 # 95% of examples are class 1
print(f"Imbalanced labels: {entropy([p_class1, 1-p_class1]):.4f} bits") # → 0.286