Why convolutions work — Convolutional Networks

We have seen what convolutions do mechanically. Now let us go deeper: why do they work so well for images? The answer involves three interlocking ideas - parameter sharing, local connectivity, and hierarchical feature learning - that together explain why CNNs are so dramatically better than fully connected networks for visual tasks.

A fully connected network treating each pixel as an independent input needs billions of parameters just to handle a standard photo — and would still fail to generalize to new images. Convolutions are the reason visual AI is tractable at all: the same edge-detector works anywhere in the image, so you need far fewer parameters, and the model automatically learns to combine low-level features into high-level ones.

In a fully connected network, every input pixel has a unique weight connecting it to every output neuron. For a 224x224 RGB image feeding into a hidden layer of 512 neurons:

Here, $150{,}528 \times 512 = 77{,}070{,}336$ weights for that one layer alone.

In a CNN, a single 3x3 filter has 9 parameters. Those same 9 parameters are applied at every position in the image. For 64 filters:

\text{params}_{\text{conv}} = k \times f \times f = 64 \times 3 \times 3 = 576

$k$: number of filters
$f$: filter size
$\text{params}}$: total parameters in one conv layer

That is a parameter reduction of over 100,000x compared to the fully connected approach.

Why is this valid? It rests on a fundamental assumption about images: . The same filter that detects a vertical edge in the top-left should detect one in the bottom-right. By sharing one set of weights across all positions, the network explicitly encodes this prior knowledge.

Interactive example

Parameter count comparison - FC vs. CNN for same image size

Coming soon

Local Connectivity: Respecting Spatial Structure

In a fully connected network, each output neuron is connected to every single input pixel. Pixel (0,0) is connected to the same neuron as pixel (223,223), even though they are in opposite corners and probably have nothing to do with each other.

In a CNN, each output position (i,j) in a feature map only connects to a small in the input - say the 3x3 patch centered at (i,j). The neuron literally cannot see anything outside that region.

Local connectivity encodes another prior: nearby pixels are more related than distant pixels. Pixels belonging to the same edge, texture, or object tend to be spatially adjacent. A 3x3 neighborhood is enough to detect basic features. For larger patterns, deeper layers stack these small detectors to build up larger effective receptive fields.

Hierarchical Feature Learning: The Deep Magic

When you stack multiple conv+pool layers, each layer learns features composed from the layer below. This is where CNNs become genuinely powerful.

Layer 1 - 3x3 receptive fields. After training, layer 1 filters learn:

Horizontal edges (dark-to-bright transitions downward)
Vertical edges (dark-to-bright transitions rightward)
Diagonal edges
Color blobs

Layer 2 - ~5x5 effective receptive field. Combines layer 1 features:

Corners (horizontal edge meeting vertical edge)
L-shapes, T-shapes
Simple textures made of edge combinations

Layer 3 - ~7-9x9 effective receptive field:

Circles, arcs, ovals
Object parts: eyes, wheels, leaves
Complex textures: fur, fabric, wood grain

Layer 4 and beyond - increasingly abstract:

Animal faces
Text characters
Vehicle bodies

This is how CNNs build from raw pixels to semantic understanding. The network discovers the hierarchy from data via gradient descent - you do not program any of these stages manually.

Interactive example

Feature hierarchy - click a layer in the network to see what features it responds to

Coming soon

Researchers have verified this empirically by visualizing what maximally activates each filter. Early layers really do respond to edges and colors; middle layers to textures and shapes; late layers to object parts and whole objects. The hierarchy is real.

Translation Equivariance vs. Invariance

There is an important distinction worth making precisely:

The : convolution layers are equivariant. If a feature shifts by $\Delta$ pixels in the input, the response in the feature map shifts by $\Delta$ pixels too.

The : you get this from pooling, not from convolution. Pooling discards precise position within a pool region.

The combination gives you what you want:

Convolution: wherever a feature is, it gets detected (equivariance).
Pooling: small shifts within a region do not matter (local invariance).
Deep stacking: local invariances accumulate into broad invariance across the image.

f * (T_\delta g) = T_\delta (f * g)

$f$: a convolutional filter
$T_\delta$: translation operator - shifts input by delta pixels
$*$: convolution operation

A cat on the left is the same cat on the right, and the stacked conv+pool architecture handles this naturally.

Why This Matters Beyond Images

The same principles - parameter sharing, local connectivity, hierarchical features - apply wherever data has spatial or sequential structure.

The inductive biases of CNNs are correct for images. When your inductive biases match the true structure of the problem, you learn faster, need less data, and generalize better. That is the entire story.