The Question
You have two random variables and . How much does knowing the value of tell you about ?
This is the question answers. It is the most general measure of statistical dependence — it captures not just linear correlation, but any kind of relationship.
Mutual information underpins feature selection (which inputs actually tell you something about the output?), the information bottleneck theory of what deep networks learn to compress, and diagnostic tools for analyzing how much two layers of a model share. It is the right tool whenever you need to measure statistical dependence without assuming a linear relationship.
Definition via Entropy
The measures how uncertain X remains after you learn Y:
- conditional entropy — expected uncertainty in X given Y
- joint probability of X=x and Y=y
- conditional probability of X=x given Y=y
Mutual information is the reduction in uncertainty about X caused by observing Y:
- mutual information between X and Y — measured in bits or nats
- marginal entropy of X — uncertainty before observing Y
- conditional entropy — uncertainty after observing Y
By symmetry of the joint distribution, this is also equal to:
- entropy of Y
- conditional entropy of Y given X
where is the joint entropy.
Mutual information is symmetric: . "How much does X tell you about Y?" equals "How much does Y tell you about X?" — even though the individual conditional entropies and need not be equal.
I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))
There is a beautiful alternative definition connecting mutual information to KL divergence:
- joint distribution of X and Y
- product of marginals — the distribution if X and Y were independent
KL divergence measures how far the joint distribution is from the "independent" distribution. If and are independent, then and the KL divergence is zero.
This immediately gives us two key properties:
- always (since KL ≥ 0).
- if and only if X and Y are independent.
Worked Example: Noisy Channel
Let = a fair coin flip (). Let = a noisy copy of X: with probability = 0.1, Y is flipped.
So: , , and symmetrically.
Compute the joint distribution:
| Y=H | Y=T | |
|---|---|---|
| X=H | 0.45 | 0.05 |
| X=T | 0.05 | 0.45 |
Marginals: , (Y is also a fair coin).
H(X) = 1 bit (fair coin).
H(X|Y): Given Y=H, , . So bits. By symmetry, bits.
Here, bits.
bits.
The noisy channel transmits 0.531 of the original 1 bit. With no noise (ε=0), MI = 1 bit. With 50% noise (ε=0.5, completely random), MI = 0 bits.
The Information Bottleneck
One powerful application is the information bottleneck principle for understanding neural networks:
A neural network encodes input into a representation , which is then used to predict output . The goal is to find the optimal tradeoff:
- tradeoff parameter — how much to weight compression vs prediction
- mutual information between input X and representation Z — measures how much of the input is retained
- mutual information between representation Z and output Y — measures how task-relevant the representation is
Compress into as much as possible (minimize ) while preserving what knows about (maximize ). Good representations keep only task-relevant information and discard the rest.
Why MI Matters for ML
Feature selection: compute for each feature. Keep features with high MI — they are the most informative for the task.
Understanding what a model learned: probe a learned representation by computing for various properties. Does the representation encode syntax? Semantics? World knowledge?
Maximum entropy RL: the reward is augmented by (entropy of the policy), encouraging exploration. This is equivalent to maximizing MI between actions and outcomes.
In code, mutual information is rarely computed directly (it requires estimating joint distributions). In practice, it is approximated via variational bounds — that is what the VAE's ELBO is doing.
import numpy as np
def mi_discrete(joint_probs):
"""Mutual information I(X;Y) for discrete distributions from joint probability table."""
px = joint_probs.sum(axis=1) # marginal of X
py = joint_probs.sum(axis=0) # marginal of Y
mi = 0.0
for i in range(len(px)):
for j in range(len(py)):
if joint_probs[i, j] > 0:
mi += joint_probs[i, j] * np.log(joint_probs[i, j] / (px[i] * py[j]))
return mi
# Example: X = coin flip quality, Y = landing result (biased coin)
# joint[0,0] = P(fair, heads), joint[0,1] = P(fair, tails), etc.
joint = np.array([[0.35, 0.15], # fair coin: 50/50
[0.40, 0.10]]) # biased coin: 80/20 heads
print(f"I(coin_type; result) = {mi_discrete(joint):.4f} nats") # → > 0 (correlated)
# Independent variables: MI should be ≈ 0
joint_indep = np.array([[0.25, 0.25],
[0.25, 0.25]])
print(f"I(independent) = {mi_discrete(joint_indep):.4f} nats") # → 0.0