Images as matrices — Convolutional Networks

Before you can understand convolutional neural networks, you need to see images the way computers do: as grids of numbers. Once you see that, both the power and the limitations of naive approaches become obvious.

Convolutional neural networks are behind every image recognition system — from phone cameras that identify faces to medical AI that detects tumors in X-rays. Before you can understand why they work, you need to understand what an image actually looks like to a computer.

Pixels as Numbers

Every digital image is a grid of pixels. Each pixel is just a number.

A is a 2D matrix where each entry is an integer from 0 (pure black) to 255 (pure white). The MNIST dataset of handwritten digits uses 28x28 grayscale images. Each image is literally a 28x28 matrix - 784 numbers arranged in a grid.

A has three separate channels: Red, Green, and Blue (RGB). A 224x224 color image is a $224 \times 224 \times 3$ - three stacked matrices, one per color channel.

\text{Total pixels} = H \times W \times C

$H$: image height in pixels
$W$: image width in pixels
$C$: number of channels - 1 for grayscale, 3 for RGB

Total numbers for a 224x224 RGB image: $224 \times 224 \times 3 = 150{,}528$ . A rich orange pixel is R=255, G=165, B=0. A soft teal is R=0, G=128, B=128. The image is nothing but 150,528 carefully arranged numbers.

The Naive Approach: Flatten and Feed

The simplest thing you could do is flatten the image into a single long vector and feed it into a standard fully connected network. For a 28x28 MNIST image: flatten to a 784-element vector, add a hidden layer, done. And it works to a degree - you can achieve around 98% accuracy on MNIST this way.

But consider what happens with larger images. For a 224x224 RGB image:

150{,}528\text{-dimensional input} \times 512\text{ neurons} = 77{,}070{,}336\text{ weights}

Over 77 million parameters for a single hidden layer. That is expensive, slow, and extremely prone to overfitting. But the parameter count is not even the worst problem.

What Flattening Destroys

When you flatten a 2D image into a 1D vector, you lose something critical: .

Consider pixels at positions (10, 10) and (10, 11) in the original image - they are neighbors, side by side, very likely to be similar or related. In the flattened vector, they are adjacent elements, but the network has no way of knowing they were neighbors in 2D space. The pixel at (10, 10) is equally "related" to the pixel at (1, 200) - they are just numbers at different indices.

All neighborhood information is thrown away the moment you flatten. Recognizing edges, textures, shapes, and objects all depend on local relationships between nearby pixels. A flat vector forces the network to re-learn "these pixels are near each other" from scratch.

Interactive example

Flattening visualization - see pixel neighborhoods before and after flattening

Coming soon

The Translation Problem

There is a third issue that makes flattening even worse: .

Imagine you train a network to recognize cats. During training, most cats are near the center of the image. Your model learns weights that respond to cat features at roughly those pixel positions. Now at test time, someone submits a photo where the cat is in the lower-right corner. Completely different pixel positions are active. The trained weights are looking for the cat in all the wrong places.

To the fully connected network, a cat in the upper-left and a cat in the lower-right are as different as a cat and an airplane. You would need to show the model cats in every possible position - an enormous amount of redundant data.

What We Actually Need

What we want is an architecture that:

Exploits locality: processes nearby pixels together because they are more related than distant pixels.
Shares knowledge across positions: a feature detector that spots an edge should work anywhere in the image, not just at specific pixel coordinates.
Is computationally efficient: does not require 77 million parameters per layer.

These three requirements point directly to convolutional neural networks. The convolution operation is specifically designed to satisfy all three simultaneously.