Training a model from scratch — Putting It Together

The Transfer Learning Revolution

For most of the 20th century, machine learning models were trained from scratch for each specific task. You had a task, you had labeled data, you trained a model, done.

The 2010s changed everything. It turned out that models trained on massive general datasets learn - intermediate features that are useful for many downstream tasks, not just the original training task.

A model trained on ImageNet (1.2 million images, 1000 classes) learns filters that detect edges, textures, shapes, and object parts. These same features are useful for identifying chest X-ray pathologies, satellite imagery, or defective manufacturing parts - even though the source and target tasks look very different. You do not need to relearn "what is an edge." Use the pretrained knowledge.

This single insight transformed applied ML: in most practical applications, training from scratch is no longer necessary or even advisable.

When to Use Transfer Learning

Transfer learning is appropriate when:

You have fewer than ~50,000 examples (for images) or ~100,000 tokens (for text)
Your task domain is reasonably similar to what the pretrained model saw
You want fast iteration time
You have limited compute budget

Training from scratch is appropriate when:

You have millions of examples
Your data distribution is fundamentally different from anything previously modeled (e.g., novel sensor types, rare medical modalities)
You need full control over what features the model learns
You are building a foundation model for others to use

In practice: almost always start with transfer learning. The overhead of downloading pretrained weights is trivial. The benefit is often dramatic.

Fine-Tuning Strategies

When you take a pretrained model and adapt it to your task, you have several options:

Strategy 1: Feature extraction (frozen backbone)

Keep all pretrained weights frozen. Add a new output head (one or two layers) for your task. Train only the new head. This is fastest and works well when your task is similar to the pretraining task and you have limited data.

\text{loss} = L(W_\text{head}(W_\text{frozen}(x)), y) \quad \text{only updating } W_\text{head}

$W_\text{frozen}$: frozen pretrained weights - not updated during training
$W_\text{head}$: new task head weights - initialized randomly and trained

Strategy 2: Full fine-tuning

Start from pretrained weights, but allow all weights to update during training. Use a very small learning rate (typically 1/10 to 1/100 of what you would use from scratch) to avoid destroying the pretrained features.

\alpha_\text{finetune} \approx 10^{-5} \text{ to } 10^{-4} \quad \text{vs.} \quad \alpha_\text{scratch} \approx 10^{-3} \text{ to } 10^{-2}

$\alpha_\text{finetune}$: fine-tuning learning rate, much smaller than from-scratch lr

Strategy 3: Layerwise learning rate decay

Different layers get different learning rates. Earlier layers (which learn general features) get smaller learning rates; later layers (which learn task-specific features) get larger learning rates. This is a good default for deep models.

Early Stopping

Even with transfer learning, you can overfit on your specific training set. Early stopping is a simple and very effective regularization technique:

Monitor validation loss after each epoch
Save a checkpoint whenever validation loss improves
Stop training when validation loss has not improved for $p$ consecutive epochs (the "patience" parameter)
Return the saved checkpoint (best validation performance, not final weights)

Interactive example

Training monitor - watch training and validation loss curves with early stopping trigger visualization

Coming soon

Early stopping prevents you from training past the generalization optimum. It also gives you a "free" regularization mechanism - you are implicitly limiting the number of gradient updates, which limits how much the model can overfit.

Data Augmentation

When you have limited data, a powerful complement to transfer learning is data augmentation - synthetically creating new training examples by applying random transformations to existing ones:

Images: random crops, flips, rotations, color jitter, brightness changes
Text: synonym replacement, back-translation, random deletion/insertion
Audio: time stretching, pitch shifting, adding noise, mixing clips

The key: augmentations must preserve the label. A horizontally flipped cat is still a cat. An image with adjusted brightness is still whatever class it was before.

Augmentation can effectively multiply your training set size by 5-50x at minimal cost. Combined with transfer learning, it is often sufficient to build high-quality models on surprisingly small datasets.

P9 — Transfer learning in scikit-learn style (feature extraction)

# Using a pretrained model as a feature extractor (the sklearn mental model)
# Step 1: extract features from pretrained backbone
from torchvision.models import resnet50, ResNet50_Weights
import torch

model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()
# Remove the final classification head — use the 2048-d feature vector
backbone = torch.nn.Sequential(*list(model.children())[:-1])

# Step 2: extract features for all your examples
features = []
for img_batch in dataloader:
    with torch.no_grad():
        feat = backbone(img_batch).squeeze()  # (batch, 2048)
    features.append(feat.numpy())

# Step 3: train any sklearn classifier on the extracted features
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_features, y_train)
print("Val accuracy:", clf.score(X_val_features, y_val))

This separates the feature extraction (PyTorch) from the classification (scikit-learn), letting you use the full sklearn model selection workflow on top of pretrained deep features.

The Practical Workflow

Find the best pretrained model for your modality (check Papers With Code and model hubs like HuggingFace)
Establish a baseline with frozen backbone + new head, minimal augmentation
Check if you're underfitting or overfitting - if underfitting: unfreeze more layers, increase model size; if overfitting: add augmentation, add regularization, reduce model size
Full fine-tune with appropriate lr and early stopping once baseline looks promising
Iterate on augmentation - it is often the highest-ROI improvement for small datasets