Dilated convolutions and receptive fields — Convolutional Networks

The Receptive Field Problem

Every output neuron in a CNN can only "see" a limited portion of the input — its . For image classification, a large receptive field is ideal — the model should consider the whole image before deciding "cat." For dense prediction tasks (semantic segmentation, depth estimation), every output pixel needs a large receptive field to incorporate global context.

Dilated convolutions are used in DeepLab for image segmentation, WaveNet for audio generation, and many real-time perception systems. They solve the fundamental tension between large receptive fields and spatial resolution — letting a network see a wide context without losing fine-grained detail or adding more parameters.

With standard 3×3 convolutions, the receptive field grows linearly: each layer adds 2 to each side. After layers:

R = 1 + 2L

$R$: receptive field size
$L$: number of 3×3 conv layers

To achieve a 65×65 receptive field: 32 layers of 3×3 convolutions. Deep is fine for classification (we can use pooling to reduce spatial size), but for dense prediction (per-pixel tasks), pooling destroys the fine spatial detail needed to label individual pixels.

Dilated convolutions solve this: large receptive fields without downsampling.

What is a Dilated Convolution?

A dilated convolution inserts gaps of zeros between kernel elements, creating an expanded ("dilated") effective kernel.

Visualizing a 3×3 kernel at different dilation rates:

Dilation d=1 (standard): kernel samples at positions 0, 1, 2 → covers 3×3 area

Dilation d=2: kernel samples at positions 0, 2, 4 → covers 5×5 area with 9 weights

Dilation d=4: kernel samples at positions 0, 4, 8 → covers 9×9 area with 9 weights

The effective receptive field size for a single dilated 3×3 with dilation :

R_{\text{eff}} = (k - 1) \cdot d + 1

$k$: kernel size (e.g., 3)
$d$: dilation rate
$R_{ ext{eff}}$: effective receptive field of this layer

For k=3, d=4: R_eff = 2×4 + 1 = 9 (covers 9×9 area with only 9 parameters — same as a 3×3 standard conv).

Stacking Dilated Convolutions: Exponential Growth

The power of dilated convolutions emerges when stacked with increasing rates. With rates [1, 2, 4, 8], each using 3×3 kernels:

Layer	Dilation	Adds to RF	Cumulative RF
1	1	2×1 = 2	3
2	2	2×2 = 4	7
3	4	2×4 = 8	15
4	8	2×8 = 16	31

After 4 layers: receptive field of 31×31 with only 4×9=36 parameters per feature (vs. 4 standard 3×3 layers getting R=9 only).

Extending to rates [1, 2, 4, 8, 16]:

R = 1 + \sum_{i=1}^{L} (k-1) \cdot d_i = 1 + 2 \cdot \sum_{i=1}^{L} d_i

$d_i$: dilation rate of layer i
$k$: kernel size

With rates [1, 2, 4, 8, 16]: sum = 31, R = 1 + 2×31 = 63 from only 5 layers, with full spatial resolution maintained throughout.

Applications

WaveNet (audio generation, DeepMind 2016): Generates audio sample by sample. Uses dilated 1D causal convolutions with rates [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] per block, achieving a receptive field of 1023 samples — over 60ms of audio context — with manageable depth.

DeepLab (semantic segmentation, Google 2015–2018): Semantic segmentation requires per-pixel class labels. DeepLab uses "atrous" (dilated) convolutions to maintain full resolution feature maps while achieving large receptive fields. The final version (DeepLabV3+) uses atrous spatial pyramid pooling (ASPP): multiple dilated convolutions in parallel at rates [6, 12, 18] to capture objects at different scales.

Code: Dilated Convolution in PyTorch

import torch.nn as nn

# Standard 3×3 convolution (dilation=1)
standard = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)

# Dilated 3×3, dilation=2: kernel samples spaced 2 apart
# padding must equal dilation to maintain spatial size
dilated2 = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)

# Dilated 3×3, dilation=4
dilated4 = nn.Conv2d(64, 64, kernel_size=3, padding=4, dilation=4)

# Stack: receptive field grows from 3 → 7 → 15
class DilatedStack(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1, dilation=1),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=2, dilation=2),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=4, dilation=4),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=8, dilation=8),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.layers(x)

Key rule: for a dilated 3×3 conv with dilation d, set padding=d to maintain the same spatial output size as the input. This ensures all spatial resolution is preserved.