CNN architecture design patterns — Convolutional Networks

Design Patterns, Not Architecture Memorization

Understanding why successful CNN architectures are designed the way they are is more valuable than memorizing their exact configurations. Five recurring patterns appear across nearly every state-of-the-art CNN — from ResNet (2015) to EfficientNet (2019) to ConvNeXt (2022). Master these and you can both read existing architectures and design new ones.

Modern CNN architectures like ResNet and EfficientNet are the backbone of production image AI — from self-driving car perception stacks to the vision models inside medical imaging software. Knowing the design patterns that appear in all of them means you can adapt any architecture for a new task, not just copy-paste code.

Pattern 1: The Stem

The handles the raw input (e.g., 224×224 RGB) before the main stages begin.

Standard stem (ResNet-style):

Input: 224×224×3
→ 7×7 conv, 64 filters, stride 2 → 112×112×64
→ 3×3 max pool, stride 2         → 56×56×64

Why large kernels here? At 224×224, the spatial resolution is expensive. A 7×7 conv with stride 2 simultaneously enlarges the receptive field and halves resolution in one step — more efficient than two 3×3 strides-2. The input is simple (RGB pixels), so a slightly coarser spatial operation is fine.

Why max pool? Further fast downsampling to reach 56×56 (a practical starting size for stages). Total: 224×224 → 56×56 in two operations.

Pattern 2: Stage Structure

The main body is divided into stages, with consistent characteristics:

Within a stage: fixed number of channels, fixed spatial size
Between stages: double channels, halve spatial resolution (stride-2 transition)

Why double channels? When spatial resolution halves (4× fewer positions), doubling channels maintains roughly the same total number of activations and therefore similar representational capacity.

Typical stage progression:

Stage	Spatial size	Channels	Blocks
Stem	56×56	64	1
Stage 1	56×56	64	3
Stage 2	28×28	128	4
Stage 3	14×14	256	6
Stage 4	7×7	512	3

(ResNet-50 structure)

Most state-of-the-art architectures use 4–5 stages. The number of blocks per stage follows an increasing then decreasing pattern (more blocks in middle stages where channels are wide).

Pattern 3: Bottleneck Blocks

In stages where channel counts are large (256+), replace simple 3×3 blocks with bottleneck blocks to reduce computation:

channels=256
→ 1×1 conv, 64 channels   (compress)
→ 3×3 conv, 64 channels   (spatial filtering on cheap 64-ch map)
→ 1×1 conv, 256 channels  (expand back)
+ skip connection

The expensive 3×3 convolution runs on 64 channels instead of 256 — 16× cheaper spatially. The 1×1 convolutions are cheap (no spatial extent). Total savings: ~4× per block with minimal accuracy loss.

Bottleneck blocks are used in ResNet-50, ResNet-101, ResNet-152. Shallower models (ResNet-18, ResNet-34) use simple two-3×3 blocks since their channels are narrower.

Pattern 4: Global Average Pooling

At the end of the final stage, the feature map is typically 7×7 × C (after 5 doublings of resolution from 224×224). Traditional approach: flatten to a long vector, then fully-connected layers.

Better approach: Global Average Pooling (GAP)

\hat{x}c = \frac{1}{H \cdot W} \sum{h=1}^{H} \sum_{w=1}^{W} x_{h,w,c}

$\hat{x}_c$: global average pool output for channel c
$H, W$: spatial dimensions of the final feature map
$x_{h,w,c}$: feature map value at position (h,w), channel c

GAP averages all spatial positions for each channel, producing a vector of length C. Then one FC layer maps C → num_classes.

Parameter comparison (ResNet final layer, C=512, 1000 classes):

Flatten 7×7×512 + FC: (7×7×512) × 1000 = 25,088,000 params
GAP + FC: 512 × 1000 = 512,000 params — 49× fewer

Additional benefit: GAP works for any input resolution. A network trained on 224×224 with GAP can be applied directly to 320×320 inputs — the GAP averages over a larger spatial map but still outputs C numbers.

Pattern 5: Receptive Field Arithmetic

Before finalizing an architecture, verify the network's receptive field is large enough to see the relevant context.

For each layer in sequence, the cumulative receptive field grows:

R_L = R_{L-1} + (k_L - 1) \cdot \prod_{i=1}^{L-1} s_i

$R_L$: receptive field after L layers
$k_l$: kernel size of layer l
$s_l$: stride of layer l
$R_0 = 1$: initial receptive field (one pixel)

The product of all previous strides is the current input stride — how many input pixels correspond to one step in the current feature map.

Example trace for ResNet-50:

Layer	Kernel	Stride	Input stride	RF
Input	—	—	1	1
Stem 7×7	7	2	1	7
MaxPool 3×3	3	2	2	11
Stage 1 conv1 3×3	3	1	4	19
Stage 1 conv2 3×3	3	1	4	27
Stage 2 conv1 3×3	3	2	4	35
… (continuing)
After all stages			32	~483

Final receptive field of ~483 for a 224×224 input — sufficient to see the full image with margin.

Step-by-step receptive field calculation

Let's trace the formula $R_L = R_{L-1} + (k_L - 1) \times \prod_{i<L} s_i$ through the first four operations of ResNet-50, starting from RF=1, input-stride=1:

7×7 stem conv, stride 2:

RF = 1 + (7−1) × 1 = 7
Input stride after this layer: 1×2 = 2

3×3 MaxPool, stride 2:

RF = 7 + (3−1) × 2 = 7 + 4 = 11
Input stride: 2×2 = 4

Stage 1, first 3×3 conv, stride 1:

RF = 11 + (3−1) × 4 = 11 + 8 = 19
Input stride: 4 (unchanged, stride=1)

Stage 1, second 3×3 conv, stride 1:

RF = 19 + (3−1) × 4 = 19 + 8 = 27

After just the stem and one bottleneck block, each output neuron already "sees" a 27×27 patch of the original 224×224 image. By the end of all four stages (32 input strides), the theoretical RF has grown to ~483 — larger than the image itself. This gives the network complete global context at every spatial position in the final feature map.

A Design Checklist

When designing a new CNN:

Input resolution: what is H_in × W_in? How many times can you halve (each halving = one stage)?
Starting channels: 64 is standard. Scale up or down for your compute budget.
Stages: 4–5 stages. Double channels, halve resolution at each transition.
Block type: bottleneck if channels ≥ 256; simple two-conv if narrower.
Skip connections: always use residual blocks for ≥ 10 layers.
Head: GAP → FC → softmax (or sigmoid for multi-label).
Receptive field check: verify RF at final layer covers ≥ 70% of input.
Parameter budget: count params at each stage; most should be in stages 3–4.

Code: Building a Small ResNet-style CNN

import torch.nn as nn

class SmallResNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32), nn.ReLU(),
        )
        # Stages: each doubles channels, halves resolution
        self.stage1 = self._make_stage(32, 64, stride=2, n_blocks=2)
        self.stage2 = self._make_stage(64, 128, stride=2, n_blocks=2)
        self.stage3 = self._make_stage(128, 256, stride=2, n_blocks=2)
        # Head
        self.head = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # global average pool
            nn.Flatten(),
            nn.Linear(256, num_classes)
        )

    def _make_stage(self, in_ch, out_ch, stride, n_blocks):
        layers = [ProjectionResidualBlock(in_ch, out_ch, stride)]
        for _ in range(n_blocks - 1):
            layers.append(ResidualBlock(out_ch))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x); x = self.stage2(x); x = self.stage3(x)
        return self.head(x)

nn.AdaptiveAvgPool2d(1) is PyTorch's global average pool — it outputs a 1×1 spatial map regardless of input size, implementing the GAP pattern in one line.