Skip to content
Regularization
Lesson 6 ⏱ 12 min

Data augmentation

Video coming soon

Data Augmentation - Getting More Training Signal from What You Have

Why label-preserving transformations effectively multiply dataset size, a catalog of standard image and text augmentations with examples, the Mixup and CutMix algorithms derived from scratch, and on-the-fly vs. pre-computed augmentation.

⏱ ~7 min

🧮

Quick refresher

Overfitting and the bias-variance tradeoff

A model overfits when it memorizes training examples rather than learning generalizable patterns. Overfitting is driven by high variance: the model is too sensitive to the specific training examples it saw. More diverse training data reduces variance — the model can't memorize patterns that weren't consistently present.

Example

A model trained on only upright dog photos might learn 'dogs are upright.' If 90% of training dogs are rotated/flipped, the model must learn appearance features rather than orientation.

Augmentation manufactures that diversity.

The Cheapest Form of Regularization

More data is almost always better for machine learning. But data collection is expensive, labeling is slow, and some domains (medical imaging, rare events) have inherent scarcity.

Data augmentation asks: what if we generated new training examples from the ones we already have?

The key constraint is that the transformation must be label-preserving: a horizontally flipped photo of a cat is still a cat. A sentence with synonyms substituted still has the same sentiment. As long as we can maintain the correct label, every valid transformation gives us a free training example.

Standard Image Augmentations

These are the workhorse augmentations, applied as random transformations at training time:

Geometric:

  • Random horizontal flip (50% probability): trivially label-preserving for most natural images
  • Random crop: take a random sub-region (e.g., 75–100% of the image area, then resize back). Forces the model to recognize objects from partial views.
  • Random rotation (±15–30°): teaches orientation invariance

Color/appearance:

  • (randomly perturb brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1)
  • Grayscale conversion (with small probability)
  • Gaussian blur

Erasing:

  • Cutout / Random Erasing: zero out a random rectangular region. Forces the model to not rely on any single region being present. Works surprisingly well.

Text Augmentation

Text doesn't have a direct analog to pixel flipping, but several effective techniques exist:

  • Synonym replacement: randomly swap words with synonyms from a thesaurus (e.g., "happy" → "joyful"). Preserves meaning.
  • Random insertion: insert a random synonym of a random word at a random position
  • Random deletion: delete words with small probability (e.g., 10%)
  • Back-translation: translate to another language and back. Creates paraphrases: "The cat sat on the mat" → [Spanish] → "The cat was sitting on the rug."
  • Token masking (used in BERT pre-training): mask random tokens; train to predict the original

Advanced: Mixup

Mixup (Zhang et al., 2018) creates entirely synthetic training examples by linearly blending two real examples:

x~=λxi+(1λ)xj,y~=λyi+(1λ)yj\tilde{x} = \lambda x_i + (1 - \lambda) x_j, \qquad \tilde{y} = \lambda y_i + (1 - \lambda) y_j
x~\tilde{x}
the blended input example
xi,xjx_i, x_j
two randomly selected training examples
λ\lambda
mixing coefficient, drawn from Beta(α, α)
y~\tilde{y}
the blended label

The is sampled from a Beta distribution with parameter , typically 0.2.

The Beta(α,α)\text{Beta}(\alpha, \alpha) distribution controls how "extreme" the mixing is. Think of it as a dial between two modes:

  • Low α (e.g. 0.1): most samples land near 0 or 1, so most blended examples are nearly 100% one image or the other — mild mixing.
  • High α (e.g. 1.0): samples spread uniformly between 0 and 1, so every blended example is a genuine 50/50 blend — aggressive mixing.
  • α = 0.2 (standard default): a middle ground — most blends are 80/20 or 90/10, but occasionally you get a genuine 50/50 ghostly overlay.

Worked example: λ=0.7\lambda = 0.7, image A (cat, one-hot label [1,0]) and image B (dog, [0,1]):

  • Blended input: 70% cat pixels + 30% dog pixels (a ghostly overlay)
  • Blended label: [0.7, 0.3] — the model should be 70% confident it's a cat

Mixup trains the model to interpolate predictions smoothly between training examples, significantly improving calibration.

Advanced: CutMix

CutMix (Yun et al., 2019) takes a more spatially coherent approach. Instead of blending every pixel, it pastes a rectangular crop from one image onto another:

  1. Sample a bounding box with area fraction 1λ1-\lambda of the image
  2. Replace that region in image A with the corresponding region from image B
  3. Set label: y~=λyA+(1λ)yB\tilde{y} = \lambda y_A + (1-\lambda) y_B where λ\lambda = fraction of pixels from A

Why it works better than Mixup for vision: CutMix preserves local image statistics — each pixel belongs to exactly one real image, so texture and edge features are realistic. The model must learn to identify objects from partial views, which strongly regularizes spatial feature detectors.

Code: Augmentation in PyTorch

import torchvision.transforms.v2 as T

# Standard augmentation pipeline for ImageNet-scale training
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.75, 1.0)),
    T.RandomHorizontalFlip(),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Mixup and CutMix (PyTorch >= 2.0)
mixup = T.MixUp(alpha=0.2, num_classes=1000)
cutmix = T.CutMix(alpha=1.0, num_classes=1000)

for images, labels in dataloader:
    # Randomly choose Mixup or CutMix each batch
    if torch.rand(1) < 0.5:
        images, labels = mixup(images, labels)
    else:
        images, labels = cutmix(images, labels)
    # labels are now soft (float) — use CrossEntropyLoss which handles soft labels
    loss = criterion(model(images), labels)

Augmentation is applied to training data only. Validation and test transforms should use only deterministic resizing and normalization — no random augmentation.

Quiz

1 / 3

Why is data augmentation typically applied on-the-fly during training rather than pre-computed and saved?