The convolution operation — Convolutional Networks

Images have spatial structure: nearby pixels are related, and the same kind of pattern can appear in many different places. A regular fully connected layer ignores that structure. Convolution exploits it directly.

The convolution operation is the core of everything in this unit. Once you understand it intuitively - a small learnable detector sliding across an image, asking "is my pattern here?" at every position - the entire CNN architecture falls into place.

The convolution operation is how a CNN detects edges, textures, and shapes — and eventually recognizes objects, faces, and diseases in medical scans. Every filter you've heard about — edge detector, Gabor filter, learned feature — is a convolution kernel sliding across an image.

What the operation is actually doing (in plain English)

Imagine you're looking for a particular puzzle piece pattern anywhere on a big table. You pick up a small template (the filter), slide it across the table, and at each position you ask: "how well does this template match what's underneath?" If it matches well, you write a big number; if it doesn't match at all, you write a small number near zero. At the end, you have a map showing every location where that template had a strong match. That map is the feature map. The "sliding template" is a convolution.

The double-sum formula $\sum_m \sum_n f[m,n] \cdot g[i+m, j+n]$ is just the mathematical way of writing "multiply each template element by the image pixel underneath it, then add them all up." The two summation signs are two nested loops — one over rows, one over columns of the filter.

What a Filter Is

A (also called a kernel) is a small matrix of numbers. In modern CNNs, 3x3 filters are most common. The filter contains learnable weights - they start random and are updated via backpropagation, just like weights of a fully connected layer.

After training, different filters learn to detect different things: horizontal edges, vertical edges, diagonal lines, color blobs, textures. The specific patterns are not designed by humans - they emerge from gradient descent on the training data.

How Convolution Works

The process is a sliding dot product:

Place the filter at the top-left corner of the image, aligned with a 3x3 patch of pixels.
Compute the dot product between the filter and that patch (multiply corresponding entries, sum them all up).
Write this single number to the output at the corresponding position.
Slide the filter one pixel to the right, repeat.
After reaching the right edge, move down one row and start from the left.
Continue until the filter has visited every valid position in the image.

The grid of all these dot product results is called the or activation map.

Interactive1D Convolution — slide the kernel over the signal

Input signal

Kernel

Kernel position: 1 → output[1] = 2.00

Computation at position 1:1×-1.0 + 2×0.0 + 3×1.0 = 2.00

Output (after convolution)

2.0

0.0

1.0

-1.0

-4.0

2.0

-2.0

Convolution slides the kernel across the input, computing a dot product at each position. The kernel's values are learned during training — the network discovers which patterns to detect.

A Concrete Example: Edge Detection

Let us look at a real filter:

[[-1, -1, -1],
 [ 0,  0,  0],
 [ 1,  1,  1]]

This is a horizontal edge detector. When placed over a region where the top row is dark (pixels ≈ 0) and the bottom row is bright (pixels ≈ 255):

Top row: $(-1) \times 0 + (-1) \times 0 + (-1) \times 0 = 0$
Middle row: $0 \times 128 + 0 \times 128 + 0 \times 128 = 0$
Bottom row: $1 \times 255 + 1 \times 255 + 1 \times 255 = 765$

Dot product = 765. The filter fired strongly. When placed over a uniform region (all pixels ≈ 128), positive and negative terms cancel and the output is near 0. The filter only fires at horizontal edges.

The Math

For a 2D convolution (technically cross-correlation, but universally called convolution in deep learning):

(f * g)[i, j] = \sum_m \sum_n f[m, n] \cdot g[i+m, j+n]

$f$: the filter (kernel) - small matrix of learnable weights
$g$: the input image or feature map
$(i,j)$: output position
$(m,n)$: indices over the filter's spatial extent

You sum over the filter's spatial extent, indexing into the image at offset $(i+m, j+n)$ . This is a local dot product between the filter and an image patch.

Multiple Filters: Multiple Feature Maps

One filter detects one pattern. Images contain many patterns - horizontal edges, vertical edges, corners, textures. You apply filters in parallel to detect k different patterns simultaneously.

Each filter produces one feature map. With k filters on a 28x28 image using 3x3 filters (no padding): each filter produces a 26x26 feature map, so the output is 26x26xk.

The k feature maps are the k "channels" of the next layer, analogous to the 3 RGB channels of the input.

Output Size Formula

\text{output size} = \left\lfloor \frac{n - f + 2p}{s} \right\rfloor + 1

$n$: input spatial size (height or width)
$f$: filter size
$p$: padding (pixels added to each side)
$s$: stride

For a 28x28 image, 3x3 filter, no padding, stride 1: $\frac{28 - 3 + 0}{1} + 1 = 26$ .

Stride controls how far the filter jumps between positions:

Stride 1: moves 1 pixel at a time (maximum overlap).
Stride 2: jumps every other position, output is roughly half the spatial size.

Padding adds a border of zeros around the image:

valid: no padding. Output is smaller than input.
same: pad so output has the same spatial size as input (for stride 1).

Parameter Efficiency

A 3x3 filter has 9 learnable parameters. Those same 9 weights are reused at every position in the image. 64 filters of size 3x3 on the first layer = $64 \times 9 = 576$ parameters total.

Compare to a fully connected layer connecting all 784 MNIST pixels to 784 outputs: $784 \times 784 = 614{,}656$ parameters. Same scale of computation, roughly 1000x fewer parameters in the conv layer.

Interactive example

Filter visualization - see what 16 learned filters look like after training on image data

Coming soon