The Receptive Field Problem
Every output neuron in a CNN can only "see" a limited portion of the input — its . For image classification, a large receptive field is ideal — the model should consider the whole image before deciding "cat." For dense prediction tasks (semantic segmentation, depth estimation), every output pixel needs a large receptive field to incorporate global context.
Dilated convolutions are used in DeepLab for image segmentation, WaveNet for audio generation, and many real-time perception systems. They solve the fundamental tension between large receptive fields and spatial resolution — letting a network see a wide context without losing fine-grained detail or adding more parameters.
With standard 3×3 convolutions, the receptive field grows linearly: each layer adds 2 to each side. After layers:
- receptive field size
- number of 3×3 conv layers
To achieve a 65×65 receptive field: 32 layers of 3×3 convolutions. Deep is fine for classification (we can use pooling to reduce spatial size), but for dense prediction (per-pixel tasks), pooling destroys the fine spatial detail needed to label individual pixels.
Dilated convolutions solve this: large receptive fields without downsampling.
What is a Dilated Convolution?
A dilated convolution inserts gaps of zeros between kernel elements, creating an expanded ("dilated") effective kernel.
Visualizing a 3×3 kernel at different dilation rates:
Dilation d=1 (standard): kernel samples at positions 0, 1, 2 → covers 3×3 area
Dilation d=2: kernel samples at positions 0, 2, 4 → covers 5×5 area with 9 weights
Dilation d=4: kernel samples at positions 0, 4, 8 → covers 9×9 area with 9 weights
The effective receptive field size for a single dilated 3×3 with dilation :
- kernel size (e.g., 3)
- dilation rate
- effective receptive field of this layer
For k=3, d=4: R_eff = 2×4 + 1 = 9 (covers 9×9 area with only 9 parameters — same as a 3×3 standard conv).
Stacking Dilated Convolutions: Exponential Growth
The power of dilated convolutions emerges when stacked with increasing rates. With rates [1, 2, 4, 8], each using 3×3 kernels:
| Layer | Dilation | Adds to RF | Cumulative RF |
|---|---|---|---|
| 1 | 1 | 2×1 = 2 | 3 |
| 2 | 2 | 2×2 = 4 | 7 |
| 3 | 4 | 2×4 = 8 | 15 |
| 4 | 8 | 2×8 = 16 | 31 |
After 4 layers: receptive field of 31×31 with only 4×9=36 parameters per feature (vs. 4 standard 3×3 layers getting R=9 only).
Extending to rates [1, 2, 4, 8, 16]:
- dilation rate of layer i
- kernel size
With rates [1, 2, 4, 8, 16]: sum = 31, R = 1 + 2×31 = 63 from only 5 layers, with full spatial resolution maintained throughout.
Applications
WaveNet (audio generation, DeepMind 2016): Generates audio sample by sample. Uses dilated 1D causal convolutions with rates [1, 2, 4, 8, 16, 32, 64, 128, 256, 512] per block, achieving a receptive field of 1023 samples — over 60ms of audio context — with manageable depth.
DeepLab (semantic segmentation, Google 2015–2018): Semantic segmentation requires per-pixel class labels. DeepLab uses "atrous" (dilated) convolutions to maintain full resolution feature maps while achieving large receptive fields. The final version (DeepLabV3+) uses atrous spatial pyramid pooling (ASPP): multiple dilated convolutions in parallel at rates [6, 12, 18] to capture objects at different scales.
Code: Dilated Convolution in PyTorch
import torch.nn as nn
# Standard 3×3 convolution (dilation=1)
standard = nn.Conv2d(64, 64, kernel_size=3, padding=1, dilation=1)
# Dilated 3×3, dilation=2: kernel samples spaced 2 apart
# padding must equal dilation to maintain spatial size
dilated2 = nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)
# Dilated 3×3, dilation=4
dilated4 = nn.Conv2d(64, 64, kernel_size=3, padding=4, dilation=4)
# Stack: receptive field grows from 3 → 7 → 15
class DilatedStack(nn.Module):
def __init__(self, channels):
super().__init__()
self.layers = nn.Sequential(
nn.Conv2d(channels, channels, 3, padding=1, dilation=1),
nn.ReLU(),
nn.Conv2d(channels, channels, 3, padding=2, dilation=2),
nn.ReLU(),
nn.Conv2d(channels, channels, 3, padding=4, dilation=4),
nn.ReLU(),
nn.Conv2d(channels, channels, 3, padding=8, dilation=8),
nn.ReLU(),
)
def forward(self, x):
return self.layers(x)
Key rule: for a dilated 3×3 conv with dilation d, set padding=d to maintain the same spatial output size as the input. This ensures all spatial resolution is preserved.