Multi-class: softmax — Classification

Beyond Two Classes

Binary classification is useful, but the real world has many more categories. Handwritten digit recognition has 10 classes (0–9). ImageNet has 1,000 categories. Language models predict which of 50,000+ vocabulary tokens comes next.

For > 2 classes, the sigmoid trick of one output doesn't generalize. We need K separate probability scores — one per class — that form a valid probability distribution: each score in $(0, 1)$ and all scores summing to exactly 1.

Softmax is the function that does this.

The Model: Logits and Softmax

The network produces K real-valued scores, one per class. These are called :

\mathbf{z} = [z_1, z_2, z_3, \ldots, z_K]

$z_k$: logit for class k - raw score, can be any real number
$K$: total number of classes

If that word feels new, do not let it throw you. A logit is just the raw score before we turn it into a probability — the same role that $z = \mathbf{w} \cdot \mathbf{x} + b$ played in logistic regression.

Softmax converts these logits into probabilities:

\hat{y}k = \frac{e^{z_k}}{\displaystyle\sum{j=1}^{K} e^{z_j}} \qquad \text{for each class } k

$\hat{y}_k$: predicted probability for class k - always in (0,1)
$e$: exponentiated logit for class k
$\sum_j e$: normalization constant - sum of all exponentiated logits

Exponentiate each logit, then divide by the sum of all exponentiated logits. The denominator normalizes so the outputs sum to 1.

Three Key Properties

1. All outputs are positive. $e^{z_k} > 0$ for any real $z_k$ , so $\hat{y}_k > 0$ for every class.

2. All outputs sum to 1. The denominator is the sum of all numerators: $\sum_k \hat{y}_k = \sum_k e^{z_k} / \sum_j e^{z_j} = 1$ . A valid probability distribution.

3. Ordering is preserved. If $z_1 > z_2$ then $e^{z_1} > e^{z_2}$ so $\hat{y}_1 > \hat{y}_2$ . The class with the highest logit gets the highest probability. The exponentiation actually sharpens the differences — a slightly higher logit gets disproportionately more probability.

Worked Example

Here, $\mathbf{z} = \lbrack 2, 1, 0.1\rbrack$ for 3 classes: cat, dog, bird.

Exponentiate: $e^2 \approx 7.39,\ e^1 \approx 2.72,\ e^{0.1} \approx 1.11$
Sum: $7.39 + 2.72 + 1.11 = 11.22$
Probabilities: $\hat{y} = \lbrack 7.39/11.22,\ 2.72/11.22,\ 1.11/11.22\rbrack \approx \lbrack 0.66,\ 0.24,\ 0.10\rbrack$

Check: $0.66 + 0.24 + 0.10 = 1.00$ ✓

The winning class (cat, 66%) doesn't dominate completely — the runner-up (dog, 24%) still gets meaningful probability. This is why it's called softmax: it's a smooth, differentiable version of the operation. The "sharpness" is controlled by the scale of the logits.

One-Hot Labels and Multi-Class Cross-Entropy

In multi-class problems, the true label is a . For 5 classes, class 2 is encoded as $\mathbf{y} = \lbrack 0, 1, 0, 0, 0\rbrack$ .

The multi-class cross-entropy loss is:

L = -\sum_{k=1}^{K} y_k \cdot \log(\hat{y}_k)

$y_k$: true label for class k - 1 if this is the correct class, 0 otherwise
$\hat{y}_k$: predicted probability for class k
$K$: number of classes

Since $\mathbf{y}$ is one-hot, every term is zero except the true class $k^*$ :

L = -\log(\hat{y}_{k^*})

$\hat{y}$: predicted probability for the true class - what we want to be close to 1

The loss is simply the negative log probability assigned to the true class. We want $\hat{y}_{k^*}$ as large as possible (ideally 1.0), and $-\log(1) = 0$ . If $\hat{y}_{k^*} = 0.01$ (nearly all probability on wrong classes): $L = -\log(0.01) \approx 4.6$ . Large penalty. Regardless of how many classes there are, the loss cares only about how much probability the model assigned to the right answer.

The Gradient (Same Clean Formula)

The gradient of multi-class cross-entropy with respect to the logit vector $\mathbf{z}$ is:

\frac{\partial L}{\partial \mathbf{z}} = \hat{\mathbf{y}} - \mathbf{y}

$\hat{\mathbf{y}}$: predicted probability vector from softmax
$\mathbf{y}$: true label one-hot vector

Vector subtraction: predicted probabilities minus true one-hot label. For the correct class: $\hat{y}_{k^*} - 1$ (negative — push that logit up). For all other classes: $\hat{y}_k - 0 = \hat{y}_k$ (positive — push those logits down). Exactly what you want: increase the score for the right answer, decrease scores for all wrong answers.

Sigmoid Is a Special Case

For $K = 2$ classes with logits $z_1$ and $z_2$ , softmax gives $\hat{y}_1 = e^{z_1}/(e^{z_1} + e^{z_2})$ . Divide top and bottom by $e^{z_2}$ :

\hat{y}_1 = \frac{e^{z_1 - z_2}}{e^{z_1 - z_2} + 1} = \sigma(z_1 - z_2)

Binary logistic regression is exactly the special case of softmax with $K = 2$ , where the single parameter $z = z_1 - z_2$ is the difference of the two class logits.

Code: Softmax in Python

import numpy as np

# Naive softmax: np.exp(z) / np.sum(np.exp(z))
# (can overflow for very large logits — see stable version below)

def softmax(z):
    # Subtract the max for numerical stability (same result, no overflow)
    z = np.array(z, dtype=float)
    z -= np.max(z)
    exp_z = np.exp(z)
    return exp_z / np.sum(exp_z)

# Three-class example: logits for cat, dog, bird
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print("Logits:      ", logits)
print("Softmax:     ", np.round(probs, 3))  # [0.659, 0.242, 0.099]
print("Sum:         ", probs.sum())          # 1.0 ✓

# Temperature scaling: T < 1 sharpens, T > 1 softens
print("\nTemperature scaling:")
for T in [0.5, 1.0, 2.0, 10.0]:
    p = softmax(logits / T)
    print(f"  T={T:4.1f}: {np.round(p, 3)}  (winner gets {p.max():.1%})")
# T= 0.5: [0.844 0.115 0.041]  ← sharper, winner dominates
# T= 1.0: [0.659 0.242 0.099]  ← standard softmax
# T= 2.0: [0.476 0.334 0.190]  ← softer, probabilities spread out
# T=10.0: [0.368 0.333 0.299]  ← nearly uniform, maximum uncertainty

Interactive example

Adjust the K logit sliders and watch the probability distribution update - see how one high logit dominates

Coming soon