Mutual information — Information Theory

The Question

You have two random variables and . How much does knowing the value of $Y$ tell you about $X$ ?

This is the question answers. It is the most general measure of statistical dependence — it captures not just linear correlation, but any kind of relationship.

Mutual information underpins feature selection (which inputs actually tell you something about the output?), the information bottleneck theory of what deep networks learn to compress, and diagnostic tools for analyzing how much two layers of a model share. It is the right tool whenever you need to measure statistical dependence without assuming a linear relationship.

Definition via Entropy

The measures how uncertain X remains after you learn Y:

H(X \mid Y) = -\sum_{x,y} P(x,y) \log P(x \mid y)

$H(X|Y)$: conditional entropy — expected uncertainty in X given Y
$P(x,y)$: joint probability of X=x and Y=y
$P(x|y)$: conditional probability of X=x given Y=y

Mutual information is the reduction in uncertainty about X caused by observing Y:

I(X;, Y) = H(X) - H(X \mid Y)

$I(X;Y)$: mutual information between X and Y — measured in bits or nats
$H(X)$: marginal entropy of X — uncertainty before observing Y
$H(X|Y)$: conditional entropy — uncertainty after observing Y

By symmetry of the joint distribution, this is also equal to:

I(X;, Y) = H(Y) - H(Y \mid X) = H(X) + H(Y) - H(X, Y)

$H(Y)$: entropy of Y
$H(Y|X)$: conditional entropy of Y given X

where $= -\sum_{x,y} P(x,y)\log P(x,y)$ is the joint entropy.

Mutual information is symmetric: $I(X;Y) = I(Y;X)$ . "How much does X tell you about Y?" equals "How much does Y tell you about X?" — even though the individual conditional entropies $H(X\mid Y)$ and $H(Y\mid X)$ need not be equal.

I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))

There is a beautiful alternative definition connecting mutual information to KL divergence:

I(X;, Y) = D_{\text{KL}}!\left(P(X,Y) ,\Big|, P(X)\cdot P(Y)\right) = \sum_{x,y} P(x,y)\log\frac{P(x,y)}{P(x)P(y)}

$P(X,Y)$: joint distribution of X and Y
$P(X)·P(Y)$: product of marginals — the distribution if X and Y were independent

KL divergence measures how far the joint distribution is from the "independent" distribution. If $X$ and $Y$ are independent, then $P(X,Y) = P(X)P(Y)$ and the KL divergence is zero.

This immediately gives us two key properties:

$I(X;Y) \geq 0$ always (since KL ≥ 0).
$I(X;Y) = 0$ if and only if X and Y are independent.

Worked Example: Noisy Channel

Let $X$ = a fair coin flip ( $P(H) = P(T) = 0.5$ ). Let $Y$ = a noisy copy of X: with probability = 0.1, Y is flipped.

So: $P(Y=H\mid X=H) = 0.9$ , $P(Y=T\mid X=H) = 0.1$ , and symmetrically.

Compute the joint distribution:

	Y=H	Y=T
X=H	0.45	0.05
X=T	0.05	0.45

Marginals: $P(Y=H) = 0.5$ , $P(Y=T) = 0.5$ (Y is also a fair coin).

H(X) = 1 bit (fair coin).

H(X|Y): Given Y=H, $P(X=H\mid Y=H) = 0.9$ , $P(X=T\mid Y=H) = 0.1$ . So $H(X\mid Y=H) = -(0.9\log_2 0.9 + 0.1\log_2 0.1) \approx 0.469$ bits. By symmetry, $H(X\mid Y=T) = 0.469$ bits.

Here, $H(X\mid Y) = 0.5 \cdot 0.469 + 0.5 \cdot 0.469 = 0.469$ bits.

$I(X;Y) = H(X) - H(X\mid Y) = 1 - 0.469 = 0.531$ bits.

The noisy channel transmits 0.531 of the original 1 bit. With no noise (ε=0), MI = 1 bit. With 50% noise (ε=0.5, completely random), MI = 0 bits.

The Information Bottleneck

One powerful application is the information bottleneck principle for understanding neural networks:

A neural network encodes input $X$ into a representation , which is then used to predict output $Y$ . The goal is to find the optimal tradeoff:

\min_{Z}; I(X;, Z) - \beta, I(Z;, Y)

$β$: tradeoff parameter — how much to weight compression vs prediction
$I(X;Z)$: mutual information between input X and representation Z — measures how much of the input is retained
$I(Z;Y)$: mutual information between representation Z and output Y — measures how task-relevant the representation is

Compress $X$ into $Z$ as much as possible (minimize $I(X;Z)$ ) while preserving what $Z$ knows about $Y$ (maximize $I(Z;Y)$ ). Good representations keep only task-relevant information and discard the rest.

Why MI Matters for ML

Feature selection: compute $I(\text{feature}; \text{label})$ for each feature. Keep features with high MI — they are the most informative for the task.

Understanding what a model learned: probe a learned representation $Z$ by computing $I(Z; Y_{\text{property}})$ for various properties. Does the representation encode syntax? Semantics? World knowledge?

Maximum entropy RL: the reward is augmented by $\alpha \cdot H(\pi)$ (entropy of the policy), encouraging exploration. This is equivalent to maximizing MI between actions and outcomes.

In code, mutual information is rarely computed directly (it requires estimating joint distributions). In practice, it is approximated via variational bounds — that is what the VAE's ELBO is doing.

import numpy as np

def mi_discrete(joint_probs):
    """Mutual information I(X;Y) for discrete distributions from joint probability table."""
    px = joint_probs.sum(axis=1)   # marginal of X
    py = joint_probs.sum(axis=0)   # marginal of Y
    mi = 0.0
    for i in range(len(px)):
        for j in range(len(py)):
            if joint_probs[i, j] > 0:
                mi += joint_probs[i, j] * np.log(joint_probs[i, j] / (px[i] * py[j]))
    return mi

# Example: X = coin flip quality, Y = landing result (biased coin)
# joint[0,0] = P(fair, heads), joint[0,1] = P(fair, tails), etc.
joint = np.array([[0.35, 0.15],   # fair coin: 50/50
                  [0.40, 0.10]])  # biased coin: 80/20 heads
print(f"I(coin_type; result) = {mi_discrete(joint):.4f} nats")  # → > 0 (correlated)

# Independent variables: MI should be ≈ 0
joint_indep = np.array([[0.25, 0.25],
                        [0.25, 0.25]])
print(f"I(independent) = {mi_discrete(joint_indep):.4f} nats")  # → 0.0

For biologists: MI in genomics

Mutual information is widely used to detect statistical dependencies in biological data — without assuming any particular functional form.

Gene regulatory networks: $I(\text{gene A expression}; \text{gene B expression})$ high → A and B are co-regulated or in the same pathway. The ARACNE algorithm uses MI to reconstruct regulatory networks from expression data.
Genotype–phenotype association: MI between a SNP and a trait measures how much knowing the allele tells you about the phenotype, even for non-linear relationships that ANOVA would miss.
Single-cell multiomics: MI between RNA expression and chromatin accessibility identifies genes whose regulation is coupled to epigenetic state.

MI is appealing in genomics because it is model-free — it measures any kind of dependence, not just linear correlation. The challenge is accurate estimation from limited samples (high-dimensional, sparse data).