What is information? Entropy — Information Theory

The Question

How do you measure surprise?

If the weather forecast says 99% chance of sun and it rains, you are surprised. If it says 50% chance of rain and it rains, you are not very surprised. If it says 1% chance of rain and it rains, you are extremely surprised.

The mathematical theory of information quantifies this intuition precisely. Claude Shannon developed it in 1948 — and in doing so, laid the theoretical foundation for everything from ZIP files to language models.

Self-Information

Define the of an event with probability as:

I(x) = -\log_2 P(x)

$I(x)$: self-information (surprisal) of event x — measured in bits or nats
$log₂$: base-2 logarithm — gives information in bits
$P(x)$: probability of event x

Why this specific formula? Three properties pin it down uniquely:

Certainty → no information: if $P(x) = 1$ , then $I(x) = -\log_2(1) = 0$ . Hearing something you knew for certain tells you nothing.
Rare events → high information: as $P(x) \to 0$ , $I(x) \to \infty$ . An extremely unlikely event is extremely informative.
Independent events are additive: $I(x \cap y) = I(x) + I(y)$ when $x$ and $y$ are independent, because $-\log(P(x) \cdot P(y)) = -\log P(x) - \log P(y)$ .

Only the log function satisfies all three.

Concrete examples:

Value: $P(x) = 1$ : $I(x) = 0$ bits. No surprise.
Value: $P(x) = 1/2$ : $I(x) = 1$ bit. One coin flip.
Value: $P(x) = 1/4$ : $I(x) = 2$ bits. Two coin flips.
Value: $P(x) = 1/8$ : $I(x) = 3$ bits. Three coin flips.
Value: $P(x) = 0.01$ : $I(x) \approx 6.64$ bits. Very surprising.

Entropy: Expected Surprise

A single event's self-information depends on which event occurred. is the expected self-information over all outcomes:

H(X) = \mathbb{E}[I(X)] = -\sum_{x} P(x) \log_2 P(x)

$H(X)$: entropy of random variable X — measured in bits or nats
$P(x)$: probability of outcome x
$Σ$: sum over all possible outcomes x of X

Convention: $0 \log 0 = 0$ (since $\lim_{p \to 0} p \log p = 0$ ).

Entropy measures the average uncertainty of the random variable before you observe it. High entropy = hard to predict. Low entropy = easy to predict.

Worked Example: Fair vs Biased Coin

Fair coin ( $P(H) = P(T) = 0.5$ ):

H(\text{fair}) = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = -(0.5 \cdot (-1) + 0.5 \cdot (-1)) = 1 \text{ bit}

$H(fair)$: entropy of a fair coin flip

Biased coin ( $P(H) = 0.9, P(T) = 0.1$ ):

H(\text{biased}) = -(0.9 \log_2 0.9 + 0.1 \log_2 0.1) = -(0.9 \cdot (-0.152) + 0.1 \cdot (-3.322)) \approx 0.469 \text{ bits}

$H(biased)$: entropy of a biased coin flip

The biased coin has only 0.469 bits of entropy — it is much more predictable than the fair coin. You can usually guess "heads" and be right.

Deterministic variable ( $P(H) = 1, P(T) = 0$ ):

Here, $H = -(1 \cdot \log_2 1 + 0 \cdot \log_2 0) = -(0 + 0) = 0$ bits. Zero uncertainty.

Maximum Entropy: The Uniform Distribution

Among all distributions over $n$ outcomes, entropy is maximized by the uniform distribution $P(x) = 1/n$ :

H_{\max} = \log_2 n

$n$: number of possible outcomes
$H_max$: maximum possible entropy for n outcomes

Intuition: the uniform distribution is the state of maximum ignorance — you have no reason to prefer any outcome over any other.

Units: Bits vs Nats

The choice of logarithm base determines the unit:

Base	Unit	Common in
$\log_2$	bits (shannon)	Information theory, data compression
$\ln$ (base $e$ )	nats	Mathematics, ML (PyTorch/TensorFlow use nats by default)

Conversion: $1 \text{ bit} = \ln(2) \approx 0.693 \text{ nats}$ .

In ML code, you'll see torch.log (natural log) everywhere — ML frameworks measure entropy in nats. The theory is the same; only the units change.

Why Entropy Matters for ML

Entropy is not a curiosity — it is the foundation of the most important loss function in classification:

Cross-entropy loss (the standard classification loss) is built directly from entropy. When you write:

loss = F.cross_entropy(logits, labels)

You are computing $-\sum P_{\text{true}}(y) \log P_{\text{model}}(y)$ — a function of entropy. The next lesson derives this connection exactly.

Entropy also appears in:

Decision trees: information gain = entropy before split - entropy after split
Variational autoencoders: the ELBO objective contains entropy terms
Reinforcement learning: maximum entropy RL adds an entropy bonus to encourage exploration
Data compression: Shannon's source coding theorem says you need at least H(X) bits on average to encode X

import numpy as np

def entropy(probs, base=2):
    """Shannon entropy H(X) in bits (base=2) or nats (base=e)."""
    probs = np.array(probs)
    # Convention: 0·log(0) = 0  (zero terms contribute nothing)
    nonzero = probs[probs > 0]
    if base == 2:
        return -np.sum(nonzero * np.log2(nonzero))
    return -np.sum(nonzero * np.log(nonzero))

# Fair coin: maximum uncertainty
print(f"Fair coin: {entropy([0.5, 0.5]):.4f} bits")   # → 1.0

# Biased coin: less uncertainty
print(f"90/10 coin: {entropy([0.9, 0.1]):.4f} bits")  # → 0.469

# Certain outcome: zero entropy
print(f"Certain: {entropy([1.0, 0.0]):.4f} bits")     # → 0.0

# Fair die: log₂(6) ≈ 2.585 bits
print(f"Fair die: {entropy([1/6]*6):.4f} bits")        # → 2.585

# Class imbalance in ML: unbalanced dataset has lower entropy
p_class1 = 0.95   # 95% of examples are class 1
print(f"Imbalanced labels: {entropy([p_class1, 1-p_class1]):.4f} bits")  # → 0.286