Design Patterns, Not Architecture Memorization
Understanding why successful CNN architectures are designed the way they are is more valuable than memorizing their exact configurations. Five recurring patterns appear across nearly every state-of-the-art CNN — from ResNet (2015) to EfficientNet (2019) to ConvNeXt (2022). Master these and you can both read existing architectures and design new ones.
Modern CNN architectures like ResNet and EfficientNet are the backbone of production image AI — from self-driving car perception stacks to the vision models inside medical imaging software. Knowing the design patterns that appear in all of them means you can adapt any architecture for a new task, not just copy-paste code.
Pattern 1: The Stem
The handles the raw input (e.g., 224×224 RGB) before the main stages begin.
Standard stem (ResNet-style):
Input: 224×224×3 → 7×7 conv, 64 filters, stride 2 → 112×112×64 → 3×3 max pool, stride 2 → 56×56×64
Why large kernels here? At 224×224, the spatial resolution is expensive. A 7×7 conv with stride 2 simultaneously enlarges the receptive field and halves resolution in one step — more efficient than two 3×3 strides-2. The input is simple (RGB pixels), so a slightly coarser spatial operation is fine.
Why max pool? Further fast downsampling to reach 56×56 (a practical starting size for stages). Total: 224×224 → 56×56 in two operations.
Pattern 2: Stage Structure
The main body is divided into stages, with consistent characteristics:
- Within a stage: fixed number of channels, fixed spatial size
- Between stages: double channels, halve spatial resolution (stride-2 transition)
Why double channels? When spatial resolution halves (4× fewer positions), doubling channels maintains roughly the same total number of activations and therefore similar representational capacity.
Typical stage progression:
| Stage | Spatial size | Channels | Blocks |
|---|---|---|---|
| Stem | 56×56 | 64 | 1 |
| Stage 1 | 56×56 | 64 | 3 |
| Stage 2 | 28×28 | 128 | 4 |
| Stage 3 | 14×14 | 256 | 6 |
| Stage 4 | 7×7 | 512 | 3 |
(ResNet-50 structure)
Most state-of-the-art architectures use 4–5 stages. The number of blocks per stage follows an increasing then decreasing pattern (more blocks in middle stages where channels are wide).
Pattern 3: Bottleneck Blocks
In stages where channel counts are large (256+), replace simple 3×3 blocks with bottleneck blocks to reduce computation:
channels=256 → 1×1 conv, 64 channels (compress) → 3×3 conv, 64 channels (spatial filtering on cheap 64-ch map) → 1×1 conv, 256 channels (expand back) + skip connection
The expensive 3×3 convolution runs on 64 channels instead of 256 — 16× cheaper spatially. The 1×1 convolutions are cheap (no spatial extent). Total savings: ~4× per block with minimal accuracy loss.
Bottleneck blocks are used in ResNet-50, ResNet-101, ResNet-152. Shallower models (ResNet-18, ResNet-34) use simple two-3×3 blocks since their channels are narrower.
Pattern 4: Global Average Pooling
At the end of the final stage, the feature map is typically 7×7 × C (after 5 doublings of resolution from 224×224). Traditional approach: flatten to a long vector, then fully-connected layers.
Better approach: Global Average Pooling (GAP)
- global average pool output for channel c
- spatial dimensions of the final feature map
- feature map value at position (h,w), channel c
GAP averages all spatial positions for each channel, producing a vector of length C. Then one FC layer maps C → num_classes.
Parameter comparison (ResNet final layer, C=512, 1000 classes):
- Flatten 7×7×512 + FC: (7×7×512) × 1000 = 25,088,000 params
- GAP + FC: 512 × 1000 = 512,000 params — 49× fewer
Additional benefit: GAP works for any input resolution. A network trained on 224×224 with GAP can be applied directly to 320×320 inputs — the GAP averages over a larger spatial map but still outputs C numbers.
Pattern 5: Receptive Field Arithmetic
Before finalizing an architecture, verify the network's receptive field is large enough to see the relevant context.
For each layer in sequence, the cumulative receptive field grows:
- receptive field after L layers
- kernel size of layer l
- stride of layer l
- initial receptive field (one pixel)
The product of all previous strides is the current input stride — how many input pixels correspond to one step in the current feature map.
Example trace for ResNet-50:
| Layer | Kernel | Stride | Input stride | RF |
|---|---|---|---|---|
| Input | — | — | 1 | 1 |
| Stem 7×7 | 7 | 2 | 1 | 7 |
| MaxPool 3×3 | 3 | 2 | 2 | 11 |
| Stage 1 conv1 3×3 | 3 | 1 | 4 | 19 |
| Stage 1 conv2 3×3 | 3 | 1 | 4 | 27 |
| Stage 2 conv1 3×3 | 3 | 2 | 4 | 35 |
| … (continuing) | ||||
| After all stages | 32 | ~483 |
Final receptive field of ~483 for a 224×224 input — sufficient to see the full image with margin.
A Design Checklist
When designing a new CNN:
- Input resolution: what is H_in × W_in? How many times can you halve (each halving = one stage)?
- Starting channels: 64 is standard. Scale up or down for your compute budget.
- Stages: 4–5 stages. Double channels, halve resolution at each transition.
- Block type: bottleneck if channels ≥ 256; simple two-conv if narrower.
- Skip connections: always use residual blocks for ≥ 10 layers.
- Head: GAP → FC → softmax (or sigmoid for multi-label).
- Receptive field check: verify RF at final layer covers ≥ 70% of input.
- Parameter budget: count params at each stage; most should be in stages 3–4.
Code: Building a Small ResNet-style CNN
import torch.nn as nn
class SmallResNet(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
# Stem
self.stem = nn.Sequential(
nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
nn.BatchNorm2d(32), nn.ReLU(),
)
# Stages: each doubles channels, halves resolution
self.stage1 = self._make_stage(32, 64, stride=2, n_blocks=2)
self.stage2 = self._make_stage(64, 128, stride=2, n_blocks=2)
self.stage3 = self._make_stage(128, 256, stride=2, n_blocks=2)
# Head
self.head = nn.Sequential(
nn.AdaptiveAvgPool2d(1), # global average pool
nn.Flatten(),
nn.Linear(256, num_classes)
)
def _make_stage(self, in_ch, out_ch, stride, n_blocks):
layers = [ProjectionResidualBlock(in_ch, out_ch, stride)]
for _ in range(n_blocks - 1):
layers.append(ResidualBlock(out_ch))
return nn.Sequential(*layers)
def forward(self, x):
x = self.stem(x)
x = self.stage1(x); x = self.stage2(x); x = self.stage3(x)
return self.head(x)
nn.AdaptiveAvgPool2d(1) is PyTorch's global average pool — it outputs a 1×1 spatial map regardless of input size, implementing the GAP pattern in one line.