The Cheapest Form of Regularization
More data is almost always better for machine learning. But data collection is expensive, labeling is slow, and some domains (medical imaging, rare events) have inherent scarcity.
Data augmentation asks: what if we generated new training examples from the ones we already have?
The key constraint is that the transformation must be label-preserving: a horizontally flipped photo of a cat is still a cat. A sentence with synonyms substituted still has the same sentiment. As long as we can maintain the correct label, every valid transformation gives us a free training example.
Standard Image Augmentations
These are the workhorse augmentations, applied as random transformations at training time:
Geometric:
- Random horizontal flip (50% probability): trivially label-preserving for most natural images
- Random crop: take a random sub-region (e.g., 75–100% of the image area, then resize back). Forces the model to recognize objects from partial views.
- Random rotation (±15–30°): teaches orientation invariance
Color/appearance:
- (randomly perturb brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.1)
- Grayscale conversion (with small probability)
- Gaussian blur
Erasing:
- Cutout / Random Erasing: zero out a random rectangular region. Forces the model to not rely on any single region being present. Works surprisingly well.
Text Augmentation
Text doesn't have a direct analog to pixel flipping, but several effective techniques exist:
- Synonym replacement: randomly swap words with synonyms from a thesaurus (e.g., "happy" → "joyful"). Preserves meaning.
- Random insertion: insert a random synonym of a random word at a random position
- Random deletion: delete words with small probability (e.g., 10%)
- Back-translation: translate to another language and back. Creates paraphrases: "The cat sat on the mat" → [Spanish] → "The cat was sitting on the rug."
- Token masking (used in BERT pre-training): mask random tokens; train to predict the original
Advanced: Mixup
Mixup (Zhang et al., 2018) creates entirely synthetic training examples by linearly blending two real examples:
- the blended input example
- two randomly selected training examples
- mixing coefficient, drawn from Beta(α, α)
- the blended label
The is sampled from a Beta distribution with parameter , typically 0.2.
The distribution controls how "extreme" the mixing is. Think of it as a dial between two modes:
- Low α (e.g. 0.1): most samples land near 0 or 1, so most blended examples are nearly 100% one image or the other — mild mixing.
- High α (e.g. 1.0): samples spread uniformly between 0 and 1, so every blended example is a genuine 50/50 blend — aggressive mixing.
- α = 0.2 (standard default): a middle ground — most blends are 80/20 or 90/10, but occasionally you get a genuine 50/50 ghostly overlay.
Worked example: , image A (cat, one-hot label [1,0]) and image B (dog, [0,1]):
- Blended input: 70% cat pixels + 30% dog pixels (a ghostly overlay)
- Blended label: [0.7, 0.3] — the model should be 70% confident it's a cat
Mixup trains the model to interpolate predictions smoothly between training examples, significantly improving calibration.
Advanced: CutMix
CutMix (Yun et al., 2019) takes a more spatially coherent approach. Instead of blending every pixel, it pastes a rectangular crop from one image onto another:
- Sample a bounding box with area fraction of the image
- Replace that region in image A with the corresponding region from image B
- Set label: where = fraction of pixels from A
Why it works better than Mixup for vision: CutMix preserves local image statistics — each pixel belongs to exactly one real image, so texture and edge features are realistic. The model must learn to identify objects from partial views, which strongly regularizes spatial feature detectors.
Code: Augmentation in PyTorch
import torchvision.transforms.v2 as T
# Standard augmentation pipeline for ImageNet-scale training
train_transform = T.Compose([
T.RandomResizedCrop(224, scale=(0.75, 1.0)),
T.RandomHorizontalFlip(),
T.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1),
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Mixup and CutMix (PyTorch >= 2.0)
mixup = T.MixUp(alpha=0.2, num_classes=1000)
cutmix = T.CutMix(alpha=1.0, num_classes=1000)
for images, labels in dataloader:
# Randomly choose Mixup or CutMix each batch
if torch.rand(1) < 0.5:
images, labels = mixup(images, labels)
else:
images, labels = cutmix(images, labels)
# labels are now soft (float) — use CrossEntropyLoss which handles soft labels
loss = criterion(model(images), labels)
Augmentation is applied to training data only. Validation and test transforms should use only deterministic resizing and normalization — no random augmentation.