The Overconfidence Problem
Train a classifier to convergence with standard cross-entropy loss, then inspect its output probabilities on held-out examples. Typically, you'll find predictions like:
- "99.8% cat, 0.1% dog, 0.1% bird" — for an ambiguous image
- "99.99% positive sentiment" — for a mildly positive sentence
These probabilities are overconfident. The model assigns near-certainty to its predictions even for borderline cases. This matters in practice: if you need calibrated confidence scores (for thresholding, for downstream decision-making, for flagging uncertain cases), an overconfident model is unreliable.
The root cause is the loss function itself.
Why Hard Labels Drive Logits to Infinity
A logit is the raw, unnormalized score a neural network produces for each class before any probability conversion. For a 3-class problem, the final layer might output [2.1, 0.3, -0.8] — those three numbers are logits. Softmax converts them to probabilities by exponentiating and normalizing: [0.77, 0.14, 0.09]. The larger the logit gap between the correct class and the others, the more confident the final probability.
Standard training targets are one-hot labels: probability 1 for the correct class, 0 for all others. The cross-entropy loss is:
- number of classes
- target probability for class k (1 for correct, 0 for others)
- predicted probability for class k from softmax
- natural logarithm
where is the correct class. To minimize this loss, . But softmax can only approach 1 asymptotically as the correct class logit grows to infinity relative to others.
Gradient descent therefore continuously applies pressure to make logits larger and larger throughout training. The model "tries" to achieve the impossible target of probability 1.
Label Smoothing: The Fix
Label smoothing replaces the hard one-hot targets with soft targets that spread a small amount of probability uniformly across all classes:
- soft target for class k
- original one-hot target for class k
- smoothing parameter — typically 0.1
- number of classes
For the correct class:
For each incorrect class:
Example: K=10, ε=0.1, correct class index c=3:
| Class | Hard label | Smooth label |
|---|---|---|
| 0 | 0.000 | 0.010 |
| 1 | 0.000 | 0.010 |
| 2 | 0.000 | 0.010 |
| 3 (correct) | 1.000 | 0.910 |
| 4 | 0.000 | 0.010 |
| … | 0.000 | 0.010 |
Total probability: 0.910 + 9 × 0.010 = 1.000 ✓
The Modified Loss
The loss with smooth labels:
- smooth target for class k
- model's predicted probability for class k
The first term is the familiar cross-entropy on the correct class. The second term is the entropy of the model's distribution (with a minus sign) — it encourages the model to spread probability, not concentrate it.
What Changes at the Optimum
With hard labels, the optimal prediction is . With smooth labels (ε=0.1, K=10), the optimal prediction is:
The model can achieve the minimum loss at finite logit values. The corresponding logit gap between the correct and incorrect class is:
This is a concrete finite value — no infinite pressure. Gradient descent finds this minimum and stops pushing logits larger.
Empirical Evidence
Label smoothing with ε=0.1 has been shown to improve:
- ImageNet classification: consistent ~0.3–0.5% top-1 accuracy improvement across architectures
- Machine translation: BLEU score improvements, particularly at domain boundaries
- Speech recognition: word error rate reduction
- Calibration: Expected Calibration Error (ECE) typically decreases by 2–5×
The improvement is most pronounced for tasks where overconfidence causes errors (close decision boundaries, ambiguous examples) and less impactful for easy tasks with clear class boundaries.
Code: Label Smoothing in PyTorch
import torch.nn as nn
# Built-in label smoothing (PyTorch >= 1.10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# Equivalent manual implementation
class LabelSmoothingLoss(nn.Module):
def __init__(self, num_classes, smoothing=0.1):
super().__init__()
self.smoothing = smoothing
self.num_classes = num_classes
def forward(self, logits, targets):
log_probs = nn.functional.log_softmax(logits, dim=-1)
# Hard label contribution
nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(1)).squeeze(1)
# Smoothing contribution
smooth_loss = -log_probs.mean(dim=-1)
return (1 - self.smoothing) * nll + self.smoothing * smooth_loss
# Usage
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = criterion(logits, targets) # targets are integer class indices
The PyTorch label_smoothing parameter takes ε directly (0.1 is the standard value). It's a one-line change that consistently improves performance on multi-class classification.