Skip to content
Math Foundation Information Theory
Lesson 3 ⏱ 10 min

Mutual information

Video coming soon

Mutual Information: How Much Does X Tell You About Y?

Mutual information as reduction in uncertainty. Symmetric definition. Connection to KL divergence and entropy. Noisy channel example. Why MI matters for feature selection and understanding representations.

⏱ ~8 min

🧮

Quick refresher

Conditional entropy

H(X|Y) = -Σ P(x,y)·log P(x|y) is the expected entropy of X after observing Y. If Y perfectly predicts X, then H(X|Y) = 0. If Y is irrelevant, H(X|Y) = H(X).

Example

X = tomorrow's weather, Y = today's weather.

H(X) = 1.5 bits (uncertain).

H(X|Y) = 0.8 bits (still uncertain but less so).

The reduction 1.5 - 0.8 = 0.7 bits is the mutual information.

The Question

You have two random variables and . How much does knowing the value of YY tell you about XX?

This is the question answers. It is the most general measure of statistical dependence — it captures not just linear correlation, but any kind of relationship.

Mutual information underpins feature selection (which inputs actually tell you something about the output?), the information bottleneck theory of what deep networks learn to compress, and diagnostic tools for analyzing how much two layers of a model share. It is the right tool whenever you need to measure statistical dependence without assuming a linear relationship.

Definition via Entropy

The measures how uncertain X remains after you learn Y:

H(XY)=x,yP(x,y)logP(xy)H(X \mid Y) = -\sum_{x,y} P(x,y) \log P(x \mid y)
H(XY)H(X|Y)
conditional entropy — expected uncertainty in X given Y
P(x,y)P(x,y)
joint probability of X=x and Y=y
P(xy)P(x|y)
conditional probability of X=x given Y=y

Mutual information is the reduction in uncertainty about X caused by observing Y:

I(X;,Y)=H(X)H(XY)I(X;, Y) = H(X) - H(X \mid Y)
I(X;Y)I(X;Y)
mutual information between X and Y — measured in bits or nats
H(X)H(X)
marginal entropy of X — uncertainty before observing Y
H(XY)H(X|Y)
conditional entropy — uncertainty after observing Y

By symmetry of the joint distribution, this is also equal to:

I(X;,Y)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X;, Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)
H(Y)H(Y)
entropy of Y
H(YX)H(Y|X)
conditional entropy of Y given X

where =x,yP(x,y)logP(x,y)= -\sum_{x,y} P(x,y)\log P(x,y) is the joint entropy.

Mutual information is symmetric: I(X;Y)=I(Y;X)I(X;Y) = I(Y;X). "How much does X tell you about Y?" equals "How much does Y tell you about X?" — even though the individual conditional entropies H(XY)H(X\mid Y) and H(YX)H(Y\mid X) need not be equal.

I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))

There is a beautiful alternative definition connecting mutual information to KL divergence:

I(X;,Y)=DKL!(P(X,Y),,P(X)P(Y))=x,yP(x,y)logP(x,y)P(x)P(y)I(X;, Y) = D_{\text{KL}}!\left(P(X,Y) ,\Big|, P(X)\cdot P(Y)\right) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}
P(X,Y)P(X,Y)
joint distribution of X and Y
P(X)P(Y)P(X)·P(Y)
product of marginals — the distribution if X and Y were independent

KL divergence measures how far the joint distribution is from the "independent" distribution. If XX and YY are independent, then P(X,Y)=P(X)P(Y)P(X,Y) = P(X)P(Y) and the KL divergence is zero.

This immediately gives us two key properties:

  • I(X;Y)0I(X;Y) \geq 0 always (since KL ≥ 0).
  • I(X;Y)=0I(X;Y) = 0 if and only if X and Y are independent.

Worked Example: Noisy Channel

Let XX = a fair coin flip (P(H)=P(T)=0.5P(H) = P(T) = 0.5). Let YY = a noisy copy of X: with probability = 0.1, Y is flipped.

So: P(Y=HX=H)=0.9P(Y=H\mid X=H) = 0.9, P(Y=TX=H)=0.1P(Y=T\mid X=H) = 0.1, and symmetrically.

Compute the joint distribution:

Y=HY=T
X=H0.450.05
X=T0.050.45

Marginals: P(Y=H)=0.5P(Y=H) = 0.5, P(Y=T)=0.5P(Y=T) = 0.5 (Y is also a fair coin).

H(X) = 1 bit (fair coin).

H(X|Y): Given Y=H, P(X=HY=H)=0.9P(X=H\mid Y=H) = 0.9, P(X=TY=H)=0.1P(X=T\mid Y=H) = 0.1. So H(XY=H)=(0.9log20.9+0.1log20.1)0.469H(X\mid Y=H) = -(0.9\log_2 0.9 + 0.1\log_2 0.1) \approx 0.469 bits. By symmetry, H(XY=T)=0.469H(X\mid Y=T) = 0.469 bits.

Here, H(XY)=0.50.469+0.50.469=0.469H(X\mid Y) = 0.5 \cdot 0.469 + 0.5 \cdot 0.469 = 0.469 bits.

I(X;Y)=H(X)H(XY)=10.469=0.531I(X;Y) = H(X) - H(X\mid Y) = 1 - 0.469 = 0.531 bits.

The noisy channel transmits 0.531 of the original 1 bit. With no noise (ε=0), MI = 1 bit. With 50% noise (ε=0.5, completely random), MI = 0 bits.

The Information Bottleneck

One powerful application is the information bottleneck principle for understanding neural networks:

A neural network encodes input XX into a representation , which is then used to predict output YY. The goal is to find the optimal tradeoff:

minZ;I(X;,Z)β,I(Z;,Y)\min_{Z}; I(X;, Z) - \beta, I(Z;, Y)
ββ
tradeoff parameter — how much to weight compression vs prediction
I(X;Z)I(X;Z)
mutual information between input X and representation Z — measures how much of the input is retained
I(Z;Y)I(Z;Y)
mutual information between representation Z and output Y — measures how task-relevant the representation is

Compress XX into ZZ as much as possible (minimize I(X;Z)I(X;Z)) while preserving what ZZ knows about YY (maximize I(Z;Y)I(Z;Y)). Good representations keep only task-relevant information and discard the rest.

Why MI Matters for ML

Feature selection: compute I(feature;label)I(\text{feature}; \text{label}) for each feature. Keep features with high MI — they are the most informative for the task.

Understanding what a model learned: probe a learned representation ZZ by computing I(Z;Yproperty)I(Z; Y_{\text{property}}) for various properties. Does the representation encode syntax? Semantics? World knowledge?

Maximum entropy RL: the reward is augmented by αH(π)\alpha \cdot H(\pi) (entropy of the policy), encouraging exploration. This is equivalent to maximizing MI between actions and outcomes.

In code, mutual information is rarely computed directly (it requires estimating joint distributions). In practice, it is approximated via variational bounds — that is what the VAE's ELBO is doing.

import numpy as np

def mi_discrete(joint_probs):
    """Mutual information I(X;Y) for discrete distributions from joint probability table."""
    px = joint_probs.sum(axis=1)   # marginal of X
    py = joint_probs.sum(axis=0)   # marginal of Y
    mi = 0.0
    for i in range(len(px)):
        for j in range(len(py)):
            if joint_probs[i, j] > 0:
                mi += joint_probs[i, j] * np.log(joint_probs[i, j] / (px[i] * py[j]))
    return mi

# Example: X = coin flip quality, Y = landing result (biased coin)
# joint[0,0] = P(fair, heads), joint[0,1] = P(fair, tails), etc.
joint = np.array([[0.35, 0.15],   # fair coin: 50/50
                  [0.40, 0.10]])  # biased coin: 80/20 heads
print(f"I(coin_type; result) = {mi_discrete(joint):.4f} nats")  # → > 0 (correlated)

# Independent variables: MI should be ≈ 0
joint_indep = np.array([[0.25, 0.25],
                        [0.25, 0.25]])
print(f"I(independent) = {mi_discrete(joint_indep):.4f} nats")  # → 0.0

Quiz

1 / 3

X is a fair coin flip. Y = X (a perfect copy of X). What is I(X;Y)?