Images have spatial structure: nearby pixels are related, and the same kind of pattern can appear in many different places. A regular fully connected layer ignores that structure. Convolution exploits it directly.
The convolution operation is the core of everything in this unit. Once you understand it intuitively - a small learnable detector sliding across an image, asking "is my pattern here?" at every position - the entire CNN architecture falls into place.
The convolution operation is how a CNN detects edges, textures, and shapes — and eventually recognizes objects, faces, and diseases in medical scans. Every filter you've heard about — edge detector, Gabor filter, learned feature — is a convolution kernel sliding across an image.
What a Filter Is
A (also called a kernel) is a small matrix of numbers. In modern CNNs, 3x3 filters are most common. The filter contains learnable weights - they start random and are updated via backpropagation, just like weights of a fully connected layer.
After training, different filters learn to detect different things: horizontal edges, vertical edges, diagonal lines, color blobs, textures. The specific patterns are not designed by humans - they emerge from gradient descent on the training data.
How Convolution Works
The process is a sliding dot product:
- Place the filter at the top-left corner of the image, aligned with a 3x3 patch of pixels.
- Compute the dot product between the filter and that patch (multiply corresponding entries, sum them all up).
- Write this single number to the output at the corresponding position.
- Slide the filter one pixel to the right, repeat.
- After reaching the right edge, move down one row and start from the left.
- Continue until the filter has visited every valid position in the image.
The grid of all these dot product results is called the or activation map.
Convolution slides the kernel across the input, computing a dot product at each position. The kernel's values are learned during training — the network discovers which patterns to detect.
A Concrete Example: Edge Detection
Let us look at a real filter:
[[-1, -1, -1], [ 0, 0, 0], [ 1, 1, 1]]
This is a horizontal edge detector. When placed over a region where the top row is dark (pixels ≈ 0) and the bottom row is bright (pixels ≈ 255):
- Top row:
- Middle row:
- Bottom row:
Dot product = 765. The filter fired strongly. When placed over a uniform region (all pixels ≈ 128), positive and negative terms cancel and the output is near 0. The filter only fires at horizontal edges.
The Math
For a 2D convolution (technically cross-correlation, but universally called convolution in deep learning):
- the filter (kernel) - small matrix of learnable weights
- the input image or feature map
- output position
- indices over the filter's spatial extent
You sum over the filter's spatial extent, indexing into the image at offset . This is a local dot product between the filter and an image patch.
Multiple Filters: Multiple Feature Maps
One filter detects one pattern. Images contain many patterns - horizontal edges, vertical edges, corners, textures. You apply filters in parallel to detect k different patterns simultaneously.
Each filter produces one feature map. With k filters on a 28x28 image using 3x3 filters (no padding): each filter produces a 26x26 feature map, so the output is 26x26xk.
The k feature maps are the k "channels" of the next layer, analogous to the 3 RGB channels of the input.
Output Size Formula
- input spatial size (height or width)
- filter size
- padding (pixels added to each side)
- stride
For a 28x28 image, 3x3 filter, no padding, stride 1: .
Stride controls how far the filter jumps between positions:
- Stride 1: moves 1 pixel at a time (maximum overlap).
- Stride 2: jumps every other position, output is roughly half the spatial size.
Padding adds a border of zeros around the image:
valid: no padding. Output is smaller than input.same: pad so output has the same spatial size as input (for stride 1).
Parameter Efficiency
A 3x3 filter has 9 learnable parameters. Those same 9 weights are reused at every position in the image. 64 filters of size 3x3 on the first layer = parameters total.
Compare to a fully connected layer connecting all 784 MNIST pixels to 784 outputs: parameters. Same scale of computation, roughly 1000x fewer parameters in the conv layer.
Interactive example
Filter visualization - see what 16 learned filters look like after training on image data
Coming soon