Data augmentation — Regularization

The Cheapest Form of Regularization

More data is almost always better for machine learning. But data collection is expensive, labeling is slow, and some domains (medical imaging, rare events) have inherent scarcity.

Data augmentation asks: what if we generated new training examples from the ones we already have?

The key constraint is that the transformation must be label-preserving: a horizontally flipped photo of a cat is still a cat. A sentence with synonyms substituted still has the same sentiment. As long as we can maintain the correct label, every valid transformation gives us a free training example.

Standard Image Augmentations

These are the workhorse augmentations, applied as random transformations at training time:

Geometric:

Random horizontal flip (50% probability): trivially label-preserving for most natural images
Random crop: take a random sub-region (e.g., 75–100% of the image area, then resize back). Forces the model to recognize objects from partial views.
Random rotation (±15–30°): teaches orientation invariance

Color/appearance:

(randomly perturb brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1)
Grayscale conversion (with small probability)
Gaussian blur

Erasing:

Cutout / Random Erasing: zero out a random rectangular region. Forces the model to not rely on any single region being present. Works surprisingly well.

Text Augmentation

Text doesn't have a direct analog to pixel flipping, but several effective techniques exist:

Synonym replacement: randomly swap words with synonyms from a thesaurus (e.g., "happy" → "joyful"). Preserves meaning.
Random insertion: insert a random synonym of a random word at a random position
Random deletion: delete words with small probability (e.g., 10%)
Back-translation: translate to another language and back. Creates paraphrases: "The cat sat on the mat" → [Spanish] → "The cat was sitting on the rug."
Token masking (used in BERT pre-training): mask random tokens; train to predict the original

Advanced: Mixup

Mixup (Zhang et al., 2018) creates entirely synthetic training examples by linearly blending two real examples:

\tilde{x} = \lambda x_i + (1 - \lambda) x_j, \qquad \tilde{y} = \lambda y_i + (1 - \lambda) y_j

$\tilde{x}$: the blended input example
$x_i, x_j$: two randomly selected training examples
$\lambda$: mixing coefficient, drawn from Beta(α, α)
$\tilde{y}$: the blended label

The is sampled from a Beta distribution with parameter , typically 0.2.

The $\text{Beta}(\alpha, \alpha)$ distribution controls how "extreme" the mixing is. Think of it as a dial between two modes:

Low α (e.g. 0.1): most samples land near 0 or 1, so most blended examples are nearly 100% one image or the other — mild mixing.
High α (e.g. 1.0): samples spread uniformly between 0 and 1, so every blended example is a genuine 50/50 blend — aggressive mixing.
α = 0.2 (standard default): a middle ground — most blends are 80/20 or 90/10, but occasionally you get a genuine 50/50 ghostly overlay.

Worked example: $\lambda = 0.7$ , image A (cat, one-hot label [1,0]) and image B (dog, [0,1]):

Blended input: 70% cat pixels + 30% dog pixels (a ghostly overlay)
Blended label: [0.7, 0.3] — the model should be 70% confident it's a cat

Mixup trains the model to interpolate predictions smoothly between training examples, significantly improving calibration.

Advanced: CutMix

CutMix (Yun et al., 2019) takes a more spatially coherent approach. Instead of blending every pixel, it pastes a rectangular crop from one image onto another:

Sample a bounding box with area fraction $1-\lambda$ of the image
Replace that region in image A with the corresponding region from image B
Set label: $\tilde{y} = \lambda y_A + (1-\lambda) y_B$ where $\lambda$ = fraction of pixels from A

Why it works better than Mixup for vision: CutMix preserves local image statistics — each pixel belongs to exactly one real image, so texture and edge features are realistic. The model must learn to identify objects from partial views, which strongly regularizes spatial feature detectors.

Code: Augmentation in PyTorch

import torchvision.transforms.v2 as T

# Standard augmentation pipeline for ImageNet-scale training
train_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.75, 1.0)),
    T.RandomHorizontalFlip(),
    T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Mixup and CutMix (PyTorch >= 2.0)
mixup = T.MixUp(alpha=0.2, num_classes=1000)
cutmix = T.CutMix(alpha=1.0, num_classes=1000)

for images, labels in dataloader:
    # Randomly choose Mixup or CutMix each batch
    if torch.rand(1) < 0.5:
        images, labels = mixup(images, labels)
    else:
        images, labels = cutmix(images, labels)
    # labels are now soft (float) — use CrossEntropyLoss which handles soft labels
    loss = criterion(model(images), labels)

Augmentation is applied to training data only. Validation and test transforms should use only deterministic resizing and normalization — no random augmentation.