Pooling — Convolutional Networks

Convolution layers detect features. Pooling layers compress and summarize them. Together they form the core of a CNN's feature extraction pipeline. Pooling carries significant responsibilities: reducing computation, building translation invariance, and forcing the network toward higher-level representations.

Pooling makes CNNs robust to small shifts in an image — without it, moving a cat photo a few pixels to the right would produce a completely different feature map, making recognition fragile. It is also what lets CNNs process images of varying sizes and scale to real hardware without memory overflow.

What Pooling Does

A pooling layer takes a feature map and reduces its spatial dimensions by summarizing small regions. No learning happens - it is a fixed operation with zero parameters.

The most common variant is . Here is how 2x2 max pooling with stride 2 works:

Divide the feature map into non-overlapping 2x2 patches.
For each patch, take the maximum value.
The result is a feature map half the size in each dimension.

For a 26x26 feature map: 2x2 max pooling with stride 2 produces a 13x13 feature map. From 676 values to 169 - a 4x compression.

H_{\text{out}} = \left\lfloor \frac{H_{\text{in}} - p}{s} \right\rfloor + 1

$H_{\text{out}}$: output height after pooling
$H_{\text{in}}$: input height before pooling
$p$: pool size (2 for 2x2 pooling)
$s$: pool stride

Why Maximum (Not Average)?

Max pooling makes a specific choice: report whether a feature appeared anywhere in a region, not how strongly on average.

Imagine a filter that detects a vertical edge, outputting 8.5 where it found an edge and near-zero everywhere else. In a 2x2 patch:

0.1   8.5
0.2   0.1

Max pooling: 8.5 - "yes, there was a vertical edge somewhere in this patch"
Average pooling: $(0.1 + 8.5 + 0.2 + 0.1) / 4 = 2.2$ - "there was some edge-ness on average"

Max pooling preserves the existence of the feature. Average pooling dilutes it. For detecting whether specific patterns are present (the goal in classification), max pooling is usually more informative.

Translation Invariance

Consider a feature detector that fires at pixel position (3, 4) in an image, producing a value of 9.0. Now shift the image 1 pixel to the right - the same feature fires at position (3, 5).

Without pooling, these two cases produce activations at different spatial positions - they look completely different to subsequent layers.

With 2x2 max pooling, positions (3, 4) and (3, 5) may fall within the same 2x2 pool region. Both produce 9.0 as the max output from that region. The pooling layer sees the same thing in both cases.

Max pooling asks: "did this feature appear anywhere in this 2x2 neighborhood?" A small shift in the feature's position does not change the answer. This is .

Stacking multiple conv+pool layers builds up larger invariance. By the final pooling layers, you have invariance over regions many pixels wide. A face detector does not care if the face moved 10 pixels to the left.

Interactive example

Max pooling step-by-step - watch a 2x2 window slide across the feature map

Coming soon

Benefits of Pooling

Computational efficiency: halving spatial dimensions reduces the number of activations by 4x. Every subsequent layer processes 4x fewer values. For deep networks, this adds up dramatically.

Increasing receptive field: after pooling, each activation in the next conv layer effectively "sees" a larger region of the original image. Two pooling layers means each neuron effectively sees a 4x larger region. This is how deep CNNs build global understanding from local operations.

Regularization: by discarding precise spatial information, pooling forces the network to represent features abstractly rather than position-specifically. "There was an eye somewhere in the upper-left region" generalizes better than "there was an eye at exactly pixel (47, 23)."

Global Average Pooling

Modern architectures (ResNet, EfficientNet, MobileNet) end with instead of fully connected layers.

GAP takes an entire feature map - say 7x7 - and reduces it to a single number: the average of all 49 values. With 512 feature maps, GAP produces a 512-dimensional vector.

Handles any input size: GAP always produces a fixed-size output regardless of input image dimensions. A 224x224 and a 256x256 image both produce a 512-dim vector.
Massive parameter reduction: a fully connected layer from $512 \times 7 \times 7 = 25{,}088$ to 1,000 classes needs 25 million parameters. GAP reduces this to $512 \times 1{,}000 = 512{,}000$ parameters - 50x fewer.
Regularization: averaging across the entire feature map is a strong anti-overfitting mechanism.

Strided Convolutions vs. Pooling

Modern architectures sometimes skip max pooling and use strided convolutions (stride 2) for downsampling instead. A strided convolution halves spatial dimensions like max pooling, but uses learnable weights to decide how to downsample.

You will see both in practice. Understanding pooling conceptually prepares you to read any CNN architecture.