Skip to content
Regularization
Lesson 7 ⏱ 10 min

Label smoothing

Video coming soon

Label Smoothing - Preventing Overconfident Predictions

Why hard one-hot labels drive logits toward infinity, the soft-label construction with ε and K, the modified loss function derived step-by-step, and empirical evidence from ImageNet and machine translation.

⏱ ~6 min

🧮

Quick refresher

Cross-entropy loss and softmax

For classification, the model outputs a logit vector; softmax converts logits to probabilities. Cross-entropy loss = -log(p_correct_class). To minimize this loss perfectly, the model would need p_correct = 1, which requires the correct class logit to be infinitely large compared to all others.

Example

For a 3-class problem with logits [3, 1, 0], softmax gives [0.844, 0.114, 0.042].

Cross-entropy loss = -log(0.844) ≈ 0.169.

To get loss = 0 exactly, we'd need logit_1 → +∞ while logits 2,3 stay finite — impossible in practice, so gradient descent pushes logits larger and larger.

The Overconfidence Problem

Train a classifier to convergence with standard cross-entropy loss, then inspect its output probabilities on held-out examples. Typically, you'll find predictions like:

  • "99.8% cat, 0.1% dog, 0.1% bird" — for an ambiguous image
  • "99.99% positive sentiment" — for a mildly positive sentence

These probabilities are overconfident. The model assigns near-certainty to its predictions even for borderline cases. This matters in practice: if you need calibrated confidence scores (for thresholding, for downstream decision-making, for flagging uncertain cases), an overconfident model is unreliable.

The root cause is the loss function itself.

Why Hard Labels Drive Logits to Infinity

A logit is the raw, unnormalized score a neural network produces for each class before any probability conversion. For a 3-class problem, the final layer might output [2.1, 0.3, -0.8] — those three numbers are logits. Softmax converts them to probabilities by exponentiating and normalizing: [0.77, 0.14, 0.09]. The larger the logit gap between the correct class and the others, the more confident the final probability.

Standard training targets are one-hot labels: probability 1 for the correct class, 0 for all others. The cross-entropy loss is:

L=k=1Kyklogy^k=logy^cL = -\sum_{k=1}^{K} y_k \log \hat{y}k = -\log \hat{y}{c}
KK
number of classes
yky_k
target probability for class k (1 for correct, 0 for others)
y^k\hat{y}_k
predicted probability for class k from softmax
log\log
natural logarithm

where cc is the correct class. To minimize this loss, y^c1\hat{y}_c \to 1. But softmax can only approach 1 asymptotically as the correct class logit grows to infinity relative to others.

Gradient descent therefore continuously applies pressure to make logits larger and larger throughout training. The model "tries" to achieve the impossible target of probability 1.

Label Smoothing: The Fix

Label smoothing replaces the hard one-hot targets with soft targets that spread a small amount of probability uniformly across all classes:

y~k=(1ε)yk+εK\tilde{y}_k = (1 - \varepsilon) \cdot y_k + \frac{\varepsilon}{K}
y~k\tilde{y}_k
soft target for class k
yky_k
original one-hot target for class k
ε\varepsilon
smoothing parameter — typically 0.1
KK
number of classes

For the correct class: y~c=(1ε)+ε/K\tilde{y}_c = (1-\varepsilon) + \varepsilon/K

For each incorrect class: y~k=ε/K\tilde{y}_k = \varepsilon/K

Example: K=10, ε=0.1, correct class index c=3:

ClassHard labelSmooth label
00.0000.010
10.0000.010
20.0000.010
3 (correct)1.0000.910
40.0000.010
0.0000.010

Total probability: 0.910 + 9 × 0.010 = 1.000 ✓

The Modified Loss

The loss with smooth labels:

Lsmooth=k=1Ky~klogy^k=(1ε)logy^cεKk=1Klogy^kL_{\text{smooth}} = -\sum_{k=1}^{K} \tilde{y}_k \log \hat{y}_k = -(1-\varepsilon) \log \hat{y}c - \frac{\varepsilon}{K} \sum{k=1}^{K} \log \hat{y}_k
y~k\tilde{y}_k
smooth target for class k
y^k\hat{y}_k
model's predicted probability for class k

The first term is the familiar cross-entropy on the correct class. The second term is the entropy of the model's distribution (with a minus sign) — it encourages the model to spread probability, not concentrate it.

What Changes at the Optimum

With hard labels, the optimal prediction is y^c=1\hat{y}_c = 1. With smooth labels (ε=0.1, K=10), the optimal prediction is:

y^c=1ε+ε/K=0.91andy^k=ε/K=0.01 for kc\hat{y}_c = 1 - \varepsilon + \varepsilon/K = 0.91 \quad \text{and} \quad \hat{y}_k = \varepsilon/K = 0.01 \text{ for } k \neq c

The model can achieve the minimum loss at finite logit values. The corresponding logit gap between the correct and incorrect class is:

log(0.91)log(0.01)=0.094(4.605)4.5\log(0.91) - \log(0.01) = -0.094 - (-4.605) \approx 4.5

This is a concrete finite value — no infinite pressure. Gradient descent finds this minimum and stops pushing logits larger.

Empirical Evidence

Label smoothing with ε=0.1 has been shown to improve:

  • ImageNet classification: consistent ~0.3–0.5% top-1 accuracy improvement across architectures
  • Machine translation: BLEU score improvements, particularly at domain boundaries
  • Speech recognition: word error rate reduction
  • Calibration: Expected Calibration Error (ECE) typically decreases by 2–5×

The improvement is most pronounced for tasks where overconfidence causes errors (close decision boundaries, ambiguous examples) and less impactful for easy tasks with clear class boundaries.

Code: Label Smoothing in PyTorch

import torch.nn as nn

# Built-in label smoothing (PyTorch >= 1.10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Equivalent manual implementation
class LabelSmoothingLoss(nn.Module):
    def __init__(self, num_classes, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing
        self.num_classes = num_classes

    def forward(self, logits, targets):
        log_probs = nn.functional.log_softmax(logits, dim=-1)
        # Hard label contribution
        nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(1)).squeeze(1)
        # Smoothing contribution
        smooth_loss = -log_probs.mean(dim=-1)
        return (1 - self.smoothing) * nll + self.smoothing * smooth_loss

# Usage
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = criterion(logits, targets)  # targets are integer class indices

The PyTorch label_smoothing parameter takes ε directly (0.1 is the standard value). It's a one-line change that consistently improves performance on multi-class classification.

Quiz

1 / 3

With K=4 classes and ε=0.1, label smoothing replaces the hard label y=[1,0,0,0] with what soft label ỹ?