From One Neuron to Many
A single neuron takes a vector of inputs, computes a weighted sum, and applies an activation. That is useful, but it only produces one output value. Real data has rich structure that cannot be captured by a single number.
Think of recognizing a hand-written digit. The pixel in the top-left corner might detect "is there ink here?", while a different neuron might ask "is there a curve here?", and another "is there a horizontal stroke here?". You need many detectors running in parallel, all looking at the same image, each contributing their own answer. That's a layer.
A is the natural generalization: a collection of neurons that all see the same input and each independently produce their own output. If you have neurons, the layer maps a vector of numbers to a vector of numbers. Each neuron has its own weight vector and bias, but they are all computed in parallel.
The Matrix Formulation
Say you have inputs and neurons. Each neuron has a weight vector and a scalar bias , computing:
- weight vector for neuron j
- input vector
- bias for neuron j
Four dot products. Four additions. But why compute them one by one? Stack the four weight vectors as rows of a matrix :
- weight matrix with one row per neuron
- input vector
- bias vector
Then apply the :
- activation output vector
- elementwise activation function
- pre-activation vector
This is the complete layer computation. Modern GPUs are specifically designed to perform matrix multiplications extremely fast — training deep networks is fundamentally a matrix multiplication problem.
Notation for Multi-Layer Networks
With multiple layers, superscripts index the layer. Layer has its own weights and biases . For a 3-layer network:
- Compute: (the input is layer 0's activation)
- Compute:
- Compute:
- Compute:
Each layer's output becomes the next layer's input. The weight matrix has shape : rows (one per neuron in layer ) and columns (one per neuron in the previous layer).
Each output cell C[i][j] is the dot product of row i from A with column j from B — which is why the inner dimensions must match. In a neural network, this is how all inputs combine with all weights in one operation.
Counting Parameters
Every weight and bias is a learnable parameter. Layer connects inputs to neurons. Each neuron needs weights plus 1 bias:
- neurons in layer l
- neurons in layer l-1
Concrete example: MNIST classifier with architecture 784 → 128 → 64 → 10.
- Layer 1 (784 → 128): parameters
- Layer 2 (128 → 64): parameters
- Layer 3 (64 → 10): parameters
Total: ~109,386 parameters to recognize handwritten digits.
Notice that modern vision models have hundreds of millions of parameters, and GPT-3 has 175 billion. Parameter count is a rough proxy for — how complex a function the network can represent.
The Fully Connected Name
A layer where every neuron connects to every neuron in the previous layer is called (or dense). The weight matrix has entries — every possible connection exists.
Not all layers need to be fully connected. Convolutional layers (for images) and attention layers (for sequences) impose different connectivity patterns that exploit structure in the data. But fully connected layers are the foundation — every deep learning architecture contains them in some form, especially at the output.