Skip to content
Math Foundation Information Theory
Lesson 1 ⏱ 12 min

What is information? Entropy

Video coming soon

Entropy: Measuring Surprise and Uncertainty

What 'information' means mathematically. Self-information of an event. Entropy as expected surprise. Fair coin vs biased coin. Maximum entropy at uniform distribution. Bits vs nats.

⏱ ~8 min

🧮

Quick refresher

Expected value

The expected value of a random variable is the probability-weighted average of its possible values: E[X] = Σ P(xᵢ)·xᵢ. It represents the long-run average outcome over many repetitions.

Example

A die: E[X] = (1/6)·1 + (1/6)·2 + ..

+ (1/6)·6 = 3.5.

You never roll 3.5, but the average of many rolls converges there.

The Question

How do you measure surprise?

If the weather forecast says 99% chance of sun and it rains, you are surprised. If it says 50% chance of rain and it rains, you are not very surprised. If it says 1% chance of rain and it rains, you are extremely surprised.

The mathematical theory of information quantifies this intuition precisely. Claude Shannon developed it in 1948 — and in doing so, laid the theoretical foundation for everything from ZIP files to language models.

Self-Information

Define the of an event with probability as:

I(x)=log2P(x)I(x) = -\log_2 P(x)
I(x)I(x)
self-information (surprisal) of event x — measured in bits or nats
log2log₂
base-2 logarithm — gives information in bits
P(x)P(x)
probability of event x

Why this specific formula? Three properties pin it down uniquely:

  1. Certainty → no information: if P(x)=1P(x) = 1, then I(x)=log2(1)=0I(x) = -\log_2(1) = 0. Hearing something you knew for certain tells you nothing.
  2. Rare events → high information: as P(x)0P(x) \to 0, I(x)I(x) \to \infty. An extremely unlikely event is extremely informative.
  3. Independent events are additive: I(xy)=I(x)+I(y)I(x \cap y) = I(x) + I(y) when xx and yy are independent, because log(P(x)P(y))=logP(x)logP(y)-\log(P(x) \cdot P(y)) = -\log P(x) - \log P(y).

Only the log function satisfies all three.

Concrete examples:

  • Value: P(x)=1P(x) = 1: I(x)=0I(x) = 0 bits. No surprise.
  • Value: P(x)=1/2P(x) = 1/2: I(x)=1I(x) = 1 bit. One coin flip.
  • Value: P(x)=1/4P(x) = 1/4: I(x)=2I(x) = 2 bits. Two coin flips.
  • Value: P(x)=1/8P(x) = 1/8: I(x)=3I(x) = 3 bits. Three coin flips.
  • Value: P(x)=0.01P(x) = 0.01: I(x)6.64I(x) \approx 6.64 bits. Very surprising.

Entropy: Expected Surprise

A single event's self-information depends on which event occurred. is the expected self-information over all outcomes:

H(X)=E[I(X)]=xP(x)log2P(x)H(X) = \mathbb{E}[I(X)] = -\sum_{x} P(x) \log_2 P(x)
H(X)H(X)
entropy of random variable X — measured in bits or nats
P(x)P(x)
probability of outcome x
ΣΣ
sum over all possible outcomes x of X

Convention: 0log0=00 \log 0 = 0 (since limp0plogp=0\lim_{p \to 0} p \log p = 0).

Entropy measures the average uncertainty of the random variable before you observe it. High entropy = hard to predict. Low entropy = easy to predict.

Worked Example: Fair vs Biased Coin

Fair coin (P(H)=P(T)=0.5P(H) = P(T) = 0.5):

H(fair)=(0.5log20.5+0.5log20.5)=(0.5(1)+0.5(1))=1 bitH(\text{fair}) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = -(0.5 \cdot (-1) + 0.5 \cdot (-1)) = 1 \text{ bit}
H(fair)H(fair)
entropy of a fair coin flip

Biased coin (P(H)=0.9,P(T)=0.1P(H) = 0.9, P(T) = 0.1):

H(biased)=(0.9log20.9+0.1log20.1)=(0.9(0.152)+0.1(3.322))0.469 bitsH(\text{biased}) = -(0.9 \log_2 0.9 + 0.1 \log_2 0.1) = -(0.9 \cdot (-0.152) + 0.1 \cdot (-3.322)) \approx 0.469 \text{ bits}
H(biased)H(biased)
entropy of a biased coin flip

The biased coin has only 0.469 bits of entropy — it is much more predictable than the fair coin. You can usually guess "heads" and be right.

Deterministic variable (P(H)=1,P(T)=0P(H) = 1, P(T) = 0):

Here, H=(1log21+0log20)=(0+0)=0H = -(1 \cdot \log_2 1 + 0 \cdot \log_2 0) = -(0 + 0) = 0 bits. Zero uncertainty.

Maximum Entropy: The Uniform Distribution

Among all distributions over nn outcomes, entropy is maximized by the uniform distribution P(x)=1/nP(x) = 1/n:

Hmax=log2nH_{\max} = \log_2 n
nn
number of possible outcomes
HmaxH_max
maximum possible entropy for n outcomes

Intuition: the uniform distribution is the state of maximum ignorance — you have no reason to prefer any outcome over any other.

Units: Bits vs Nats

The choice of logarithm base determines the unit:

BaseUnitCommon in
log2\log_2bits (shannon)Information theory, data compression
ln\ln (base ee)natsMathematics, ML (PyTorch/TensorFlow use nats by default)

Conversion: 1 bit=ln(2)0.693 nats1 \text{ bit} = \ln(2) \approx 0.693 \text{ nats}.

In ML code, you'll see torch.log (natural log) everywhere — ML frameworks measure entropy in nats. The theory is the same; only the units change.

Why Entropy Matters for ML

Entropy is not a curiosity — it is the foundation of the most important loss function in classification:

Cross-entropy loss (the standard classification loss) is built directly from entropy. When you write:

loss = F.cross_entropy(logits, labels)

You are computing Ptrue(y)logPmodel(y)-\sum P_{\text{true}}(y) \log P_{\text{model}}(y) — a function of entropy. The next lesson derives this connection exactly.

Entropy also appears in:

  • Decision trees: information gain = entropy before split - entropy after split
  • Variational autoencoders: the ELBO objective contains entropy terms
  • Reinforcement learning: maximum entropy RL adds an entropy bonus to encourage exploration
  • Data compression: Shannon's source coding theorem says you need at least H(X) bits on average to encode X
import numpy as np

def entropy(probs, base=2):
    """Shannon entropy H(X) in bits (base=2) or nats (base=e)."""
    probs = np.array(probs)
    # Convention: 0·log(0) = 0  (zero terms contribute nothing)
    nonzero = probs[probs > 0]
    if base == 2:
        return -np.sum(nonzero * np.log2(nonzero))
    return -np.sum(nonzero * np.log(nonzero))

# Fair coin: maximum uncertainty
print(f"Fair coin: {entropy([0.5, 0.5]):.4f} bits")   # → 1.0

# Biased coin: less uncertainty
print(f"90/10 coin: {entropy([0.9, 0.1]):.4f} bits")  # → 0.469

# Certain outcome: zero entropy
print(f"Certain: {entropy([1.0, 0.0]):.4f} bits")     # → 0.0

# Fair die: log₂(6) ≈ 2.585 bits
print(f"Fair die: {entropy([1/6]*6):.4f} bits")        # → 2.585

# Class imbalance in ML: unbalanced dataset has lower entropy
p_class1 = 0.95   # 95% of examples are class 1
print(f"Imbalanced labels: {entropy([p_class1, 1-p_class1]):.4f} bits")  # → 0.286

Quiz

1 / 3

An event has probability 1/8. What is its self-information in bits?