Label smoothing — Regularization

The Overconfidence Problem

Train a classifier to convergence with standard cross-entropy loss, then inspect its output probabilities on held-out examples. Typically, you'll find predictions like:

"99.8% cat, 0.1% dog, 0.1% bird" — for an ambiguous image
"99.99% positive sentiment" — for a mildly positive sentence

These probabilities are overconfident. The model assigns near-certainty to its predictions even for borderline cases. This matters in practice: if you need calibrated confidence scores (for thresholding, for downstream decision-making, for flagging uncertain cases), an overconfident model is unreliable.

The root cause is the loss function itself.

For the business professional: overconfidence is a deployed-systems problem

Overconfident models cause failures in production that are invisible during testing. Consider a fraud detection system that outputs "99.9% fraudulent" on every suspicious transaction. Downstream systems — automated blocks, manual review queues, customer notifications — are calibrated to trust that confidence number. But if the model is overconfident by construction, that 99.9% might correspond to only 60% true precision. A model that says "80% likely fraudulent, review recommended" is more actionable than one that says "99.9% fraudulent" incorrectly. Overconfidence also makes it harder to build escalation logic: you can't set a meaningful threshold to decide "when to escalate to a human" if all model outputs are near 1.0.

Why Hard Labels Drive Logits to Infinity

A logit is the raw, unnormalized score a neural network produces for each class before any probability conversion. For a 3-class problem, the final layer might output [2.1, 0.3, -0.8] — those three numbers are logits. Softmax converts them to probabilities by exponentiating and normalizing: [0.77, 0.14, 0.09]. The larger the logit gap between the correct class and the others, the more confident the final probability.

Standard training targets are one-hot labels: probability 1 for the correct class, 0 for all others. The cross-entropy loss is:

L = -\sum_{k=1}^{K} y_k \log \hat{y}k = -\log \hat{y}{c}

$K$: number of classes
$y_k$: target probability for class k (1 for correct, 0 for others)
$\hat{y}_k$: predicted probability for class k from softmax
$\log$: natural logarithm

where $c$ is the correct class. To minimize this loss, $\hat{y}_c \to 1$ . But softmax can only approach 1 asymptotically as the correct class logit grows to infinity relative to others.

Gradient descent therefore continuously applies pressure to make logits larger and larger throughout training. The model "tries" to achieve the impossible target of probability 1.

Label Smoothing: The Fix

Label smoothing replaces the hard one-hot targets with soft targets that spread a small amount of probability uniformly across all classes:

\tilde{y}_k = (1 - \varepsilon) \cdot y_k + \frac{\varepsilon}{K}

$\tilde{y}_k$: soft target for class k
$y_k$: original one-hot target for class k
$\varepsilon$: smoothing parameter — typically 0.1
$K$: number of classes

For the correct class: $\tilde{y}_c = (1-\varepsilon) + \varepsilon/K$

For each incorrect class: $\tilde{y}_k = \varepsilon/K$

Example: K=10, ε=0.1, correct class index c=3:

Class	Hard label	Smooth label
0	0.000	0.010
1	0.000	0.010
2	0.000	0.010
3 (correct)	1.000	0.910
4	0.000	0.010
…	0.000	0.010

Total probability: 0.910 + 9 × 0.010 = 1.000 ✓

The Modified Loss

The loss with smooth labels:

L_{\text{smooth}} = -\sum_{k=1}^{K} \tilde{y}_k \log \hat{y}_k = -(1-\varepsilon) \log \hat{y}c - \frac{\varepsilon}{K} \sum{k=1}^{K} \log \hat{y}_k

$\tilde{y}_k$: smooth target for class k
$\hat{y}_k$: model's predicted probability for class k

The first term is the familiar cross-entropy on the correct class. The second term is the entropy of the model's distribution (with a minus sign) — it encourages the model to spread probability, not concentrate it.

What Changes at the Optimum

With hard labels, the optimal prediction is $\hat{y}_c = 1$ . With smooth labels (ε=0.1, K=10), the optimal prediction is:

\hat{y}_c = 1 - \varepsilon + \varepsilon/K = 0.91 \quad \text{and} \quad \hat{y}_k = \varepsilon/K = 0.01 \text{ for } k \neq c

The model can achieve the minimum loss at finite logit values. The corresponding logit gap between the correct and incorrect class is:

\log(0.91) - \log(0.01) = -0.094 - (-4.605) \approx 4.5

This is a concrete finite value — no infinite pressure. Gradient descent finds this minimum and stops pushing logits larger.

Empirical Evidence

Label smoothing with ε=0.1 has been shown to improve:

ImageNet classification: consistent ~0.3–0.5% top-1 accuracy improvement across architectures
Machine translation: BLEU score improvements, particularly at domain boundaries
Speech recognition: word error rate reduction
Calibration: Expected Calibration Error (ECE) typically decreases by 2–5×

The improvement is most pronounced for tasks where overconfidence causes errors (close decision boundaries, ambiguous examples) and less impactful for easy tasks with clear class boundaries.

Code: Label Smoothing in PyTorch

import torch.nn as nn

# Built-in label smoothing (PyTorch >= 1.10)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

# Equivalent manual implementation
class LabelSmoothingLoss(nn.Module):
    def __init__(self, num_classes, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing
        self.num_classes = num_classes

    def forward(self, logits, targets):
        log_probs = nn.functional.log_softmax(logits, dim=-1)
        # Hard label contribution
        nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(1)).squeeze(1)
        # Smoothing contribution
        smooth_loss = -log_probs.mean(dim=-1)
        return (1 - self.smoothing) * nll + self.smoothing * smooth_loss

# Usage
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
loss = criterion(logits, targets)  # targets are integer class indices

The PyTorch label_smoothing parameter takes ε directly (0.1 is the standard value). It's a one-line change that consistently improves performance on multi-class classification.