Depthwise separable convolutions — Convolutional Networks

The Cost of a Standard Convolution

The assembly-line analogy

A standard 3×3 convolution is like asking one worker to simultaneously do two completely different jobs: inspect the spatial neighbourhood of each pixel (looking for local patterns), and compare information from all input channels at that location. Two demanding tasks fused into one expensive operation.

Depthwise separable convolution splits the assembly line: first, one specialist per channel inspects spatial patterns only within their channel — they don't interact with other channels at all (depthwise step). Then, a separate combining pass takes all the per-channel results and mixes them together across channels (pointwise 1×1 step).

Two passes instead of one, but each pass is far cheaper. Together they achieve nearly the same result. This split is precisely why mobile AI is possible today.

A standard 3×3 convolution simultaneously does two things:

Spatial filtering: detect local patterns (edges, textures) by combining nearby pixels
Channel mixing: combine information from all input channels into new representations

This fusion is powerful but expensive. For a feature map of size with input channels and output channels:

Depthwise separable convolutions are the reason mobile AI exists. MobileNet, EfficientNet, and the vision encoders in on-device ML frameworks all rely on this factorization to run on phones and embedded hardware — reducing computation by 8–9× with minimal accuracy loss.

\text{Params}{\text{std}} = k^2 \cdot C{\text{in}} \cdot C_{\text{out}}

$H,W$: spatial dimensions of the feature map
$k$: kernel size (k=3 for 3×3)
$C_{\text{in}}$: number of input channels
$C_{\text{out}}$: number of output channels (filters)
$\text{Params}_{\text{std}}$: parameter count for the standard convolution

For a practical example: 256→256 channels, 3×3 kernels: $9 \times 256 \times 256 = 589{,}824$ parameters. For a mobile device, this is prohibitive.

The key question: do spatial filtering and channel mixing need to happen together?

Depthwise Separable Convolution: Two Steps

The answer is no. Depthwise separable convolution decomposes the operation:

Step 1: Depthwise Convolution

Apply one 3×3 filter per input channel, independently. Each filter processes one channel and produces one output channel. There is no cross-channel interaction.

\text{Params}{\text{dw}} = k^2 \cdot C{\text{in}}

$C_{\text{in}}$: input channels (= output channels for depthwise step)
$k$: kernel size
$\text{Params}_{\text{dw}}$: parameter count for depthwise step

For C_in=256, k=3: $9 \times 256 = 2{,}304$ parameters. The output has the same number of channels as the input (C_in), each independently filtered.

Step 2: Pointwise Convolution

Now apply a 1×1 convolution to mix channels across the C_in filtered feature maps into C_out output channels.

\text{Params}{\text{pw}} = C{\text{in}} \cdot C_{\text{out}}

$\text{Params}_{\text{pw}}$: parameter count for pointwise (1×1) step

For 256→256: $256 \times 256 = 65{,}536$ parameters.

The Reduction Factor

Total parameters for the depthwise separable convolution:

\text{Params}{\text{DSC}} = k^2 C{\text{in}} + C_{\text{in}} C_{\text{out}} = C_{\text{in}} (k^2 + C_{\text{out}})

$\text{Params}_{\text{DSC}}$: total parameter count for depthwise separable

Reduction compared to standard:

\frac{\text{Params}{\text{DSC}}}{\text{Params}{\text{std}}} = \frac{k^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}}{k^2 C_{\text{in}} C_{\text{out}}} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}

For C_in=C_out=256, k=3:

\frac{1}{256} + \frac{1}{9} \approx 0.004 + 0.111 = 0.115

The DSC uses only 11.5% of the parameters — roughly 8.7× fewer.

Worked Numerical Comparison

Architecture	Parameters per layer	Relative cost
Standard 3×3 (64→64)	36,864	1.0×
Depthwise (64 channels)	576	0.016×
Pointwise (64→64)	4,096	0.111×
DSC total	4,672	0.127×

Accuracy tradeoff: MobileNetV1 (Howard et al., 2017) replaces all standard 3×3 convolutions with depthwise separable variants. On ImageNet:

Standard ResNet-50: 76.1% top-1, 25M parameters
MobileNetV1-1.0: 70.6% top-1, 4.2M parameters

5.5% accuracy drop for 6× fewer parameters — an excellent tradeoff for mobile deployment.

MobileNet and Efficiency Architecture

MobileNet's key insight: for edge deployment (phones, embedded devices), a 5–10% accuracy tradeoff for 8× faster inference is the right engineering decision. The entire MobileNet family (V1, V2, V3) builds on depthwise separable convolutions:

MobileNetV2 adds inverted residuals: expand channels with 1×1, do depthwise 3×3, compress with 1×1 (opposite of bottleneck)
EfficientNet scales width, depth, and resolution uniformly — uses depthwise separable convolutions as the primary building block

Code: Depthwise Separable Convolution in PyTorch

import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, C_in, C_out, stride=1):
        super().__init__()
        self.depthwise = nn.Conv2d(
            C_in, C_in,
            kernel_size=3, stride=stride, padding=1,
            groups=C_in,   # groups=C_in means one filter per channel
            bias=False
        )
        self.bn1 = nn.BatchNorm2d(C_in)
        self.pointwise = nn.Conv2d(C_in, C_out, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(C_out)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.relu(self.bn1(self.depthwise(x)))
        x = self.relu(self.bn2(self.pointwise(x)))
        return x

The critical implementation detail: groups=C_in in nn.Conv2d is how PyTorch implements depthwise convolution. Setting groups=C_in divides the C_in input channels into C_in groups of 1, applying one filter per group. When groups = in_channels = out_channels, it's a depthwise convolution.