Skip to content
Convolutional Networks
Lesson 5 ⏱ 10 min

1×1 convolutions and channel mixing

Video coming soon

1×1 Convolutions - The Pointwise Fully-Connected Layer

What happens at each spatial location when the kernel is 1×1, why this is equivalent to a fully-connected layer over channels, the parameter count comparison with 3×3 convolutions, and the role 1×1 convs play in bottleneck blocks and dimensionality reduction.

⏱ ~6 min

🧮

Quick refresher

Convolutional layer output

A 2D convolution with K filters of size H×W×C_in produces an output of shape H_out × W_out × K. Each filter slides over the spatial dimensions, computing a dot product with the H×W×C_in input patch at each location. The C_in dimension is always fully connected within the filter.

Example

A single 3×3×64 filter applied to a 28×28×64 feature map produces one 26×26 output channel.

Each output value is a dot product of the filter (3×3×64 = 576 numbers) with the corresponding 3×3×64 input patch.

When a Spatial Window of 1 Makes Sense

Standard convolutions use 3×3 or 5×5 kernels because we want to capture spatial relationships — edges, textures, shapes that span multiple pixels. But there's a situation where spatial context is irrelevant: mixing information across channels at a single location.

Think of a feature map as — a grid of H×WH \times W locations, each with a C-dimensional descriptor vector. At each location, those C values encode different learned features: edges in various orientations, textures, local patterns. A 1×1 convolution asks: "how should these C features be recombined at this location?"

Despite sounding trivial, 1×1 convolutions appear in nearly every modern CNN — from GoogLeNet's inception modules to ResNet's bottleneck blocks. They are the standard tool for changing channel depth cheaply, and understanding them is essential for reading any recent architecture paper.

A 1×1 convolution with input channels and output filters computes, at each spatial position (h,w)(h, w):

out[h,w,k]=c=1CinW[k,c]in[h,w,c]\text{out}[h, w, k] = \sum_{c=1}^{C_{\text{in}}} W[k, c] \cdot \text{in}[h, w, c]
out[h,w,k]\text{out}[h,w,k]
output at position (h,w) for filter k
W[k,c]W[k,c]
weight connecting input channel c to output channel k
in[h,w,c]\text{in}[h,w,c]
input value at position (h,w) channel c

This is exactly a applied independently and identically at every spatial position. No spatial blending — just channel mixing.

Parameter Count: The Efficiency Argument

For a feature map with C_in=256 input channels and K=256 output channels:

OperationParameters
3×3 convolution3 × 3 × 256 × 256 = 589,824
1×1 convolution1 × 1 × 256 × 256 = 65,536

The 1×1 convolution has exactly 1/9 the parameters. When used strategically, this savings is enormous.

Use Case 1: Dimensionality Reduction

Before an expensive 3×3 convolution on many channels, use a 1×1 convolution to reduce the channel count:

Input: 28×28×256
→ 1×1 conv, K=64:  28×28×64    (reduce channels by 4×)
→ 3×3 conv, K=64:  28×28×64    (3×3 on 64 channels, not 256)
→ 1×1 conv, K=256: 28×28×256   (restore channel count)

The 3×3 conv now operates on 64 channels instead of 256: 9× fewer parameters in that layer. This is the bottleneck block pattern used in ResNet-50, ResNet-101, and virtually all large CNNs.

Parameter comparison:

  • One 3×3 conv: 256→256 channels: 3×3×256×256 = 589,824 params
  • Bottleneck (1×1→3×3→1×1): 256→64→64→256: 256×64 + 3×3×64×64 + 64×256 = 16,384 + 36,864 + 16,384 = 69,632 params — 8.5× fewer

Use Case 2: Channel Count Adjustment

1×1 convolutions can increase or decrease channel count freely, at negligible spatial cost:

  • 512 channels → 128 channels: reduce for efficiency before spatial operations
  • 128 channels → 512 channels: expand for representation capacity
  • This is also how residual skip connections are projected to matching dimensions when input and output channels differ

Use Case 3: Non-Linear Channel Mixing

With a non-linearity (e.g., ReLU) after the 1×1 conv, the channel transformation becomes non-linear. This increases expressive power without any spatial overhead — a "channel attention" mechanism in the simplest sense.

The Inception Connection

The Inception module (used in GoogLeNet and Inception-v3) uses 1×1 convolutions extensively:

Input (256ch)
├── 1×1 conv (64ch output)           # direct channel reduction
├── 1×1 conv → 3×3 conv             # reduce, then spatial
├── 1×1 conv → 5×5 conv             # reduce, then spatial  
└── 3×3 max-pool → 1×1 conv         # pool, then reduce
→ Concatenate along channel dim

Every branch starts with a 1×1 conv to reduce channel depth. Without them, the 5×5 and 3×3 branches would be 9× or 25× more expensive.

Code: 1×1 Convolution in PyTorch

import torch.nn as nn

# 1×1 convolution: kernel_size=1
pointwise = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1, bias=False)
# Parameters: 256 × 64 = 16,384

# Standard 3×3 for comparison
spatial = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=3, padding=1, bias=False)
# Parameters: 9 × 256 × 64 = 147,456

# Bottleneck block
class Bottleneck(nn.Module):
    def __init__(self, channels, bottleneck_channels):
        super().__init__()
        self.block = nn.Sequential(
            nn.Conv2d(channels, bottleneck_channels, 1, bias=False),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(bottleneck_channels, bottleneck_channels, 3, padding=1, bias=False),
            nn.BatchNorm2d(bottleneck_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(bottleneck_channels, channels, 1, bias=False),
            nn.BatchNorm2d(channels),
        )

    def forward(self, x):
        return nn.functional.relu(x + self.block(x))

Any nn.Conv2d with kernel_size=1 is a 1×1 convolution. Setting bias=False is standard when BatchNorm follows, since BatchNorm's β parameter absorbs the bias.

Quiz

1 / 3

A 1×1 convolution with C_in=256 input channels and K=64 output filters has how many weight parameters?