Skip to content
Convolutional Networks
Lesson 6 ⏱ 12 min

Dilated convolutions and receptive fields

Video coming soon

Dilated Convolutions - Seeing Farther Without Stacking Layers

The receptive field concept, why standard convolutions require many layers for global context, dilated convolution with visualization of the spacing pattern, the exponentially growing receptive field with dilation rates [1,2,4,8], and applications in WaveNet and DeepLab.

⏱ ~7 min

🧮

Quick refresher

Convolutional receptive field

The receptive field of a neuron is the region of the input that it depends on. For a single 3×3 convolution, each output neuron sees a 3×3 region. After stacking two 3×3 convolutions, each output neuron indirectly depends on a 5×5 region. Each additional 3×3 layer adds 2 to the receptive field size.

Example

Three 3×3 convolution layers: after layer 1, each output sees a 3×3 input region.

After layer 2, each sees a 5×5 region.

After layer 3, each sees a 7×7 region.

To see a 100×100 region, you'd need (100-1)/2 = 49.5 → 50 layers of 3×3 convolutions.

The Receptive Field Problem

Every output neuron in a CNN can only "see" a limited portion of the input — its . For image classification, a large receptive field is ideal — the model should consider the whole image before deciding "cat." For dense prediction tasks (semantic segmentation, depth estimation), every output pixel needs a large receptive field to incorporate global context.

Dilated convolutions are used in DeepLab for image segmentation, WaveNet for audio generation, and many real-time perception systems. They solve the fundamental tension between large receptive fields and spatial resolution — letting a network see a wide context without losing fine-grained detail or adding more parameters.

With standard 3×3 convolutions, the receptive field grows linearly: each layer adds 2 to each side. After layers:

R=1+2LR = 1 + 2L
RR
receptive field size
LL
number of 3×3 conv layers

To achieve a 65×65 receptive field: 32 layers of 3×3 convolutions. Deep is fine for classification (we can use pooling to reduce spatial size), but for dense prediction (per-pixel tasks), pooling destroys the fine spatial detail needed to label individual pixels.

Dilated convolutions solve this: large receptive fields without downsampling.

What is a Dilated Convolution?

A dilated convolution inserts gaps of zeros between kernel elements, creating an expanded ("dilated") effective kernel.

Visualizing a 3×3 kernel at different dilation rates:

Dilation d=1 (standard): kernel samples at positions 0, 1, 2 → covers 3×3 area

Dilation d=2: kernel samples at positions 0, 2, 4 → covers 5×5 area with 9 weights

Dilation d=4: kernel samples at positions 0, 4, 8 → covers 9×9 area with 9 weights

The effective receptive field size for a single dilated 3×3 with dilation :

Reff=(k1)d+1R_{\text{eff}} = (k - 1) \cdot d + 1
kk
kernel size (e.g., 3)
dd
dilation rate
RexteffR_{ ext{eff}}
effective receptive field of this layer

For k=3, d=4: R_eff = 2×4 + 1 = 9 (covers 9×9 area with only 9 parameters — same as a 3×3 standard conv).

Stacking Dilated Convolutions: Exponential Growth

The power of dilated convolutions emerges when stacked with increasing rates. With rates [1, 2, 4, 8], each using 3×3 kernels:

LayerDilationAdds to RFCumulative RF
112×1 = 23
222×2 = 47
342×4 = 815
482×8 = 1631

After 4 layers: receptive field of 31×31 with only 4×9=36 parameters per feature (vs. 4 standard 3×3 layers getting R=9 only).

Extending to rates [1, 2, 4, 8, 16]:

R=1+i=1L(k1)di=1+2i=1LdiR = 1 + \sum_{i=1}^{L} (k-1) \cdot d_i = 1 + 2 \cdot \sum_{i=1}^{L} d_i
did_i
dilation rate of layer i
kk
kernel size

With rates [1, 2, 4, 8, 16]: sum = 31, R = 1 + 2×31 = 63 from only 5 layers, with full spatial resolution maintained throughout.

Applications

WaveNet (audio generation, DeepMind 2016): Generates audio sample by sample. Uses dilated 1D causal convolutions with rates [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] per block, achieving a receptive field of 1023 samples — over 60ms of audio context — with manageable depth.

DeepLab (semantic segmentation, Google 2015–2018): Semantic segmentation requires per-pixel class labels. DeepLab uses "atrous" (dilated) convolutions to maintain full resolution feature maps while achieving large receptive fields. The final version (DeepLabV3+) uses atrous spatial pyramid pooling (ASPP): multiple dilated convolutions in parallel at rates [6, 12, 18] to capture objects at different scales.

Code: Dilated Convolution in PyTorch

import torch.nn as nn

# Standard 3×3 convolution (dilation=1)
standard = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)

# Dilated 3×3, dilation=2: kernel samples spaced 2 apart
# padding must equal dilation to maintain spatial size
dilated2 = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)

# Dilated 3×3, dilation=4
dilated4 = nn.Conv2d(64, 64, kernel_size=3, padding=4, dilation=4)

# Stack: receptive field grows from 3 → 7 → 15
class DilatedStack(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(channels, channels, 3, padding=1, dilation=1),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=2, dilation=2),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=4, dilation=4),
            nn.ReLU(),
            nn.Conv2d(channels, channels, 3, padding=8, dilation=8),
            nn.ReLU(),
        )

    def forward(self, x):
        return self.layers(x)

Key rule: for a dilated 3×3 conv with dilation d, set padding=d to maintain the same spatial output size as the input. This ensures all spatial resolution is preserved.

Quiz

1 / 3

A dilated convolution with dilation rate d=3 and kernel size k=3 covers what spatial extent in the input?