Beyond Two Classes
Binary classification is useful, but the real world has many more categories. Handwritten digit recognition has 10 classes (0–9). ImageNet has 1,000 categories. Language models predict which of 50,000+ vocabulary tokens comes next.
For > 2 classes, the sigmoid trick of one output doesn't generalize. We need K separate probability scores — one per class — that form a valid probability distribution: each score in and all scores summing to exactly 1.
Softmax is the function that does this.
The Model: Logits and Softmax
The network produces K real-valued scores, one per class. These are called :
- logit for class k - raw score, can be any real number
- total number of classes
If that word feels new, do not let it throw you. A logit is just the raw score before we turn it into a probability — the same role that played in logistic regression.
Softmax converts these logits into probabilities:
- predicted probability for class k - always in (0,1)
- exponentiated logit for class k
- normalization constant - sum of all exponentiated logits
Exponentiate each logit, then divide by the sum of all exponentiated logits. The denominator normalizes so the outputs sum to 1.
Three Key Properties
1. All outputs are positive. e^{z_k} > 0 for any real , so \hat{y}_k > 0 for every class.
2. All outputs sum to 1. The denominator is the sum of all numerators: . A valid probability distribution.
3. Ordering is preserved. If z_1 > z_2 then e^{z_1} > e^{z_2} so \hat{y}_1 > \hat{y}_2. The class with the highest logit gets the highest probability. The exponentiation actually sharpens the differences — a slightly higher logit gets disproportionately more probability.
Worked Example
Here, for 3 classes: cat, dog, bird.
- Exponentiate:
- Sum:
- Probabilities:
Check: ✓
The winning class (cat, 66%) doesn't dominate completely — the runner-up (dog, 24%) still gets meaningful probability. This is why it's called softmax: it's a smooth, differentiable version of the operation. The "sharpness" is controlled by the scale of the logits.
One-Hot Labels and Multi-Class Cross-Entropy
In multi-class problems, the true label is a . For 5 classes, class 2 is encoded as .
The multi-class cross-entropy loss is:
- true label for class k - 1 if this is the correct class, 0 otherwise
- predicted probability for class k
- number of classes
Since is one-hot, every term is zero except the true class :
- predicted probability for the true class - what we want to be close to 1
The loss is simply the negative log probability assigned to the true class. We want as large as possible (ideally 1.0), and . If (nearly all probability on wrong classes): . Large penalty. Regardless of how many classes there are, the loss cares only about how much probability the model assigned to the right answer.
The Gradient (Same Clean Formula)
The gradient of multi-class cross-entropy with respect to the logit vector is:
- predicted probability vector from softmax
- true label one-hot vector
Vector subtraction: predicted probabilities minus true one-hot label. For the correct class: (negative — push that logit up). For all other classes: (positive — push those logits down). Exactly what you want: increase the score for the right answer, decrease scores for all wrong answers.
Sigmoid Is a Special Case
For classes with logits and , softmax gives . Divide top and bottom by :
Binary logistic regression is exactly the special case of softmax with , where the single parameter is the difference of the two class logits.
Code: Softmax in Python
import numpy as np
# Naive softmax: np.exp(z) / np.sum(np.exp(z))
# (can overflow for very large logits — see stable version below)
def softmax(z):
# Subtract the max for numerical stability (same result, no overflow)
z = np.array(z, dtype=float)
z -= np.max(z)
exp_z = np.exp(z)
return exp_z / np.sum(exp_z)
# Three-class example: logits for cat, dog, bird
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print("Logits: ", logits)
print("Softmax: ", np.round(probs, 3)) # [0.659, 0.242, 0.099]
print("Sum: ", probs.sum()) # 1.0 ✓
# Temperature scaling: T < 1 sharpens, T > 1 softens
print("\nTemperature scaling:")
for T in [0.5, 1.0, 2.0, 10.0]:
p = softmax(logits / T)
print(f" T={T:4.1f}: {np.round(p, 3)} (winner gets {p.max():.1%})")
# T= 0.5: [0.844 0.115 0.041] ← sharper, winner dominates
# T= 1.0: [0.659 0.242 0.099] ← standard softmax
# T= 2.0: [0.476 0.334 0.190] ← softer, probabilities spread out
# T=10.0: [0.368 0.333 0.299] ← nearly uniform, maximum uncertainty
Interactive example
Adjust the K logit sliders and watch the probability distribution update - see how one high logit dominates
Coming soon