Skip to content
Convolutional Networks
Lesson 7 ⏱ 12 min

Depthwise separable convolutions

Video coming soon

Depthwise Separable Convolutions - MobileNet's Core Idea

The cost breakdown of standard convolutions, splitting into depthwise then pointwise steps, computing the exact parameter and FLOP savings, and why MobileNet achieved near-ResNet accuracy with 8x fewer operations.

⏱ ~7 min

🧮

Quick refresher

1×1 convolutions

A 1×1 convolution has kernel size 1 and applies a learned linear transformation over channels at each spatial location independently. It mixes channel information without blending spatial locations. With C_in input channels and K filters: C_in×K parameters.

Example

A 1×1 conv with C_in=256, K=64: 256×64 = 16,384 parameters.

At a 14×14 spatial location, it computes 64 outputs, each a weighted sum of all 256 input channels at that single position.

The Cost of a Standard Convolution

A standard 3×3 convolution simultaneously does two things:

  1. Spatial filtering: detect local patterns (edges, textures) by combining nearby pixels
  2. Channel mixing: combine information from all input channels into new representations

This fusion is powerful but expensive. For a feature map of size with input channels and output channels:

Depthwise separable convolutions are the reason mobile AI exists. MobileNet, EfficientNet, and the vision encoders in on-device ML frameworks all rely on this factorization to run on phones and embedded hardware — reducing computation by 8–9× with minimal accuracy loss.

Paramsstd=k2CinCout\text{Params}{\text{std}} = k^2 \cdot C{\text{in}} \cdot C_{\text{out}}
H,WH,W
spatial dimensions of the feature map
kk
kernel size (k=3 for 3×3)
CinC_{\text{in}}
number of input channels
CoutC_{\text{out}}
number of output channels (filters)
Paramsstd\text{Params}_{\text{std}}
parameter count for the standard convolution

For a practical example: 256→256 channels, 3×3 kernels: 9×256×256=589,8249 \times 256 \times 256 = 589{,}824 parameters. For a mobile device, this is prohibitive.

The key question: do spatial filtering and channel mixing need to happen together?

Depthwise Separable Convolution: Two Steps

The answer is no. Depthwise separable convolution decomposes the operation:

Step 1: Depthwise Convolution

Apply one 3×3 filter per input channel, independently. Each filter processes one channel and produces one output channel. There is no cross-channel interaction.

Paramsdw=k2Cin\text{Params}{\text{dw}} = k^2 \cdot C{\text{in}}
CinC_{\text{in}}
input channels (= output channels for depthwise step)
kk
kernel size
Paramsdw\text{Params}_{\text{dw}}
parameter count for depthwise step

For C_in=256, k=3: 9×256=2,3049 \times 256 = 2{,}304 parameters. The output has the same number of channels as the input (C_in), each independently filtered.

Step 2: Pointwise Convolution

Now apply a 1×1 convolution to mix channels across the C_in filtered feature maps into C_out output channels.

Paramspw=CinCout\text{Params}{\text{pw}} = C{\text{in}} \cdot C_{\text{out}}
Paramspw\text{Params}_{\text{pw}}
parameter count for pointwise (1×1) step

For 256→256: 256×256=65,536256 \times 256 = 65{,}536 parameters.

The Reduction Factor

Total parameters for the depthwise separable convolution:

ParamsDSC=k2Cin+CinCout=Cin(k2+Cout)\text{Params}{\text{DSC}} = k^2 C{\text{in}} + C_{\text{in}} C_{\text{out}} = C_{\text{in}} (k^2 + C_{\text{out}})
ParamsDSC\text{Params}_{\text{DSC}}
total parameter count for depthwise separable

Reduction compared to standard:

ParamsDSCParamsstd=k2Cin+CinCoutk2CinCout=1Cout+1k2\frac{\text{Params}{\text{DSC}}}{\text{Params}{\text{std}}} = \frac{k^2 C_{\text{in}} + C_{\text{in}} C_{\text{out}}}{k^2 C_{\text{in}} C_{\text{out}}} = \frac{1}{C_{\text{out}}} + \frac{1}{k^2}

For C_in=C_out=256, k=3:

1256+190.004+0.111=0.115\frac{1}{256} + \frac{1}{9} \approx 0.004 + 0.111 = 0.115

The DSC uses only 11.5% of the parameters — roughly 8.7× fewer.

Worked Numerical Comparison

ArchitectureParameters per layerRelative cost
Standard 3×3 (64→64)36,8641.0×
Depthwise (64 channels)5760.016×
Pointwise (64→64)4,0960.111×
DSC total4,6720.127×

Accuracy tradeoff: MobileNetV1 (Howard et al., 2017) replaces all standard 3×3 convolutions with depthwise separable variants. On ImageNet:

  • Standard ResNet-50: 76.1% top-1, 25M parameters
  • MobileNetV1-1.0: 70.6% top-1, 4.2M parameters

5.5% accuracy drop for 6× fewer parameters — an excellent tradeoff for mobile deployment.

MobileNet and Efficiency Architecture

MobileNet's key insight: for edge deployment (phones, embedded devices), a 5–10% accuracy tradeoff for 8× faster inference is the right engineering decision. The entire MobileNet family (V1, V2, V3) builds on depthwise separable convolutions:

  • MobileNetV2 adds inverted residuals: expand channels with 1×1, do depthwise 3×3, compress with 1×1 (opposite of bottleneck)
  • EfficientNet scales width, depth, and resolution uniformly — uses depthwise separable convolutions as the primary building block

Code: Depthwise Separable Convolution in PyTorch

import torch.nn as nn

class DepthwiseSeparableConv(nn.Module):
    def __init__(self, C_in, C_out, stride=1):
        super().__init__()
        self.depthwise = nn.Conv2d(
            C_in, C_in,
            kernel_size=3, stride=stride, padding=1,
            groups=C_in,   # groups=C_in means one filter per channel
            bias=False
        )
        self.bn1 = nn.BatchNorm2d(C_in)
        self.pointwise = nn.Conv2d(C_in, C_out, kernel_size=1, bias=False)
        self.bn2 = nn.BatchNorm2d(C_out)
        self.relu = nn.ReLU(inplace=True)

    def forward(self, x):
        x = self.relu(self.bn1(self.depthwise(x)))
        x = self.relu(self.bn2(self.pointwise(x)))
        return x

The critical implementation detail: groups=C_in in nn.Conv2d is how PyTorch implements depthwise convolution. Setting groups=C_in divides the C_in input channels into C_in groups of 1, applying one filter per group. When groups = in_channels = out_channels, it's a depthwise convolution.

Quiz

1 / 3

A depthwise convolution with C_in=32 input channels applies how many individual 3×3 filters?