Skip to content
Convolutional Networks
Lesson 9 ⏱ 12 min

CNN architecture design patterns

Video coming soon

CNN Architecture Patterns - A Design Checklist

Walking through the five canonical CNN design patterns with concrete examples from ResNet, EfficientNet, and MobileNet, deriving the receptive field arithmetic formula, and assembling a complete small CNN from scratch using the checklist.

⏱ ~8 min

🧮

Quick refresher

Receptive field and spatial resolution

The receptive field is the region of the input that influences each output neuron. Strided convolutions and pooling reduce spatial resolution, which increases the effective receptive field per output position. Doubling the number of channels while halving spatial resolution keeps the total information roughly constant across stages.

Example

A 224×224 input with 3 stages, each halving spatial size: after stage 1, 112×112; stage 2, 56×56; stage 3, 28×28.

If channels go 64→128→256, the total number of activations per stage stays approximately constant: 112×112×64 ≈ 56×56×128 ≈ 28×28×256 ≈ 802,816.

Design Patterns, Not Architecture Memorization

Understanding why successful CNN architectures are designed the way they are is more valuable than memorizing their exact configurations. Five recurring patterns appear across nearly every state-of-the-art CNN — from ResNet (2015) to EfficientNet (2019) to ConvNeXt (2022). Master these and you can both read existing architectures and design new ones.

Modern CNN architectures like ResNet and EfficientNet are the backbone of production image AI — from self-driving car perception stacks to the vision models inside medical imaging software. Knowing the design patterns that appear in all of them means you can adapt any architecture for a new task, not just copy-paste code.

Pattern 1: The Stem

The handles the raw input (e.g., 224×224 RGB) before the main stages begin.

Standard stem (ResNet-style):

Input: 224×224×3
→ 7×7 conv, 64 filters, stride 2 → 112×112×64
→ 3×3 max pool, stride 2         → 56×56×64

Why large kernels here? At 224×224, the spatial resolution is expensive. A 7×7 conv with stride 2 simultaneously enlarges the receptive field and halves resolution in one step — more efficient than two 3×3 strides-2. The input is simple (RGB pixels), so a slightly coarser spatial operation is fine.

Why max pool? Further fast downsampling to reach 56×56 (a practical starting size for stages). Total: 224×224 → 56×56 in two operations.

Pattern 2: Stage Structure

The main body is divided into stages, with consistent characteristics:

  • Within a stage: fixed number of channels, fixed spatial size
  • Between stages: double channels, halve spatial resolution (stride-2 transition)

Why double channels? When spatial resolution halves (4× fewer positions), doubling channels maintains roughly the same total number of activations and therefore similar representational capacity.

Typical stage progression:

StageSpatial sizeChannelsBlocks
Stem56×56641
Stage 156×56643
Stage 228×281284
Stage 314×142566
Stage 47×75123

(ResNet-50 structure)

Most state-of-the-art architectures use 4–5 stages. The number of blocks per stage follows an increasing then decreasing pattern (more blocks in middle stages where channels are wide).

Pattern 3: Bottleneck Blocks

In stages where channel counts are large (256+), replace simple 3×3 blocks with bottleneck blocks to reduce computation:

channels=256
→ 1×1 conv, 64 channels   (compress)
→ 3×3 conv, 64 channels   (spatial filtering on cheap 64-ch map)
→ 1×1 conv, 256 channels  (expand back)
+ skip connection

The expensive 3×3 convolution runs on 64 channels instead of 256 — 16× cheaper spatially. The 1×1 convolutions are cheap (no spatial extent). Total savings: ~4× per block with minimal accuracy loss.

Bottleneck blocks are used in ResNet-50, ResNet-101, ResNet-152. Shallower models (ResNet-18, ResNet-34) use simple two-3×3 blocks since their channels are narrower.

Pattern 4: Global Average Pooling

At the end of the final stage, the feature map is typically 7×7 × C (after 5 doublings of resolution from 224×224). Traditional approach: flatten to a long vector, then fully-connected layers.

Better approach: Global Average Pooling (GAP)

x^c=1HWh=1Hw=1Wxh,w,c\hat{x}c = \frac{1}{H \cdot W} \sum{h=1}^{H} \sum_{w=1}^{W} x_{h,w,c}
x^c\hat{x}_c
global average pool output for channel c
H,WH, W
spatial dimensions of the final feature map
xh,w,cx_{h,w,c}
feature map value at position (h,w), channel c

GAP averages all spatial positions for each channel, producing a vector of length C. Then one FC layer maps C → num_classes.

Parameter comparison (ResNet final layer, C=512, 1000 classes):

  • Flatten 7×7×512 + FC: (7×7×512) × 1000 = 25,088,000 params
  • GAP + FC: 512 × 1000 = 512,000 params — 49× fewer

Additional benefit: GAP works for any input resolution. A network trained on 224×224 with GAP can be applied directly to 320×320 inputs — the GAP averages over a larger spatial map but still outputs C numbers.

Pattern 5: Receptive Field Arithmetic

Before finalizing an architecture, verify the network's receptive field is large enough to see the relevant context.

For each layer in sequence, the cumulative receptive field grows:

RL=RL1+(kL1)i=1L1siR_L = R_{L-1} + (k_L - 1) \cdot \prod_{i=1}^{L-1} s_i
RLR_L
receptive field after L layers
klk_l
kernel size of layer l
sls_l
stride of layer l
R0=1R_0 = 1
initial receptive field (one pixel)

The product of all previous strides is the current input stride — how many input pixels correspond to one step in the current feature map.

Example trace for ResNet-50:

LayerKernelStrideInput strideRF
Input11
Stem 7×77217
MaxPool 3×332211
Stage 1 conv1 3×331419
Stage 1 conv2 3×331427
Stage 2 conv1 3×332435
… (continuing)
After all stages32~483

Final receptive field of ~483 for a 224×224 input — sufficient to see the full image with margin.

A Design Checklist

When designing a new CNN:

  1. Input resolution: what is H_in × W_in? How many times can you halve (each halving = one stage)?
  2. Starting channels: 64 is standard. Scale up or down for your compute budget.
  3. Stages: 4–5 stages. Double channels, halve resolution at each transition.
  4. Block type: bottleneck if channels ≥ 256; simple two-conv if narrower.
  5. Skip connections: always use residual blocks for ≥ 10 layers.
  6. Head: GAP → FC → softmax (or sigmoid for multi-label).
  7. Receptive field check: verify RF at final layer covers ≥ 70% of input.
  8. Parameter budget: count params at each stage; most should be in stages 3–4.

Code: Building a Small ResNet-style CNN

import torch.nn as nn

class SmallResNet(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 32, 3, stride=2, padding=1, bias=False),
            nn.BatchNorm2d(32), nn.ReLU(),
        )
        # Stages: each doubles channels, halves resolution
        self.stage1 = self._make_stage(32, 64, stride=2, n_blocks=2)
        self.stage2 = self._make_stage(64, 128, stride=2, n_blocks=2)
        self.stage3 = self._make_stage(128, 256, stride=2, n_blocks=2)
        # Head
        self.head = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # global average pool
            nn.Flatten(),
            nn.Linear(256, num_classes)
        )

    def _make_stage(self, in_ch, out_ch, stride, n_blocks):
        layers = [ProjectionResidualBlock(in_ch, out_ch, stride)]
        for _ in range(n_blocks - 1):
            layers.append(ResidualBlock(out_ch))
        return nn.Sequential(*layers)

    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x); x = self.stage2(x); x = self.stage3(x)
        return self.head(x)

nn.AdaptiveAvgPool2d(1) is PyTorch's global average pool — it outputs a 1×1 spatial map regardless of input size, implementing the GAP pattern in one line.

Quiz

1 / 3

What is the purpose of large-kernel (7×7 or 5×5) convolutions in the network stem?