Skip to content
Neural Networks
Lesson 3 ⏱ 12 min

Layers: h = σ(Wx + b)

Video coming soon

Layers: Matrix Multiplication as Parallel Neurons

How stacking weight vectors into a matrix lets you compute all neurons in a layer simultaneously, and how dimensions flow through a multi-layer network.

⏱ ~7 min

🧮

Quick refresher

Matrix multiplication

Multiplying an m×n matrix by an n×1 vector gives an m×1 vector. Each row of the matrix dots with the vector.

Example

W is 4×3 (4 neurons, 3 inputs).

x is 3×1.

Wx+b gives a 4×1 output — one value per neuron.

From One Neuron to Many

A single neuron takes a vector of inputs, computes a weighted sum, and applies an activation. That is useful, but it only produces one output value. Real data has rich structure that cannot be captured by a single number.

Think of recognizing a hand-written digit. The pixel in the top-left corner might detect "is there ink here?", while a different neuron might ask "is there a curve here?", and another "is there a horizontal stroke here?". You need many detectors running in parallel, all looking at the same image, each contributing their own answer. That's a layer.

A is the natural generalization: a collection of neurons that all see the same input and each independently produce their own output. If you have neurons, the layer maps a vector of numbers to a vector of noutn_{out} numbers. Each neuron has its own weight vector and bias, but they are all computed in parallel.

The Matrix Formulation

Say you have nin=3n_{in} = 3 inputs and nout=4n_{out} = 4 neurons. Each neuron jj has a weight vector wjR3w_j \in \mathbb{R}^3 and a scalar bias bjb_j, computing:

zj=wjx+bjz_j = w_j \cdot x + b_j
wjw_j
weight vector for neuron j
xx
input vector
bjb_j
bias for neuron j

Four dot products. Four additions. But why compute them one by one? Stack the four weight vectors as rows of a 4×34 \times 3 matrix :

z=Wx+bz = Wx + b
WW
weight matrix with one row per neuron
xx
input vector
bb
bias vector

Then apply the :

a(l)=σ!(W(l)a(l1)+b(l))a^{(l)} = \sigma!\left(W^{(l)} a^{(l-1)} + b^{(l)}\right)
aa
activation output vector
σ\sigma
elementwise activation function
zz
pre-activation vector

This is the complete layer computation. Modern GPUs are specifically designed to perform matrix multiplications extremely fast — training deep networks is fundamentally a matrix multiplication problem.

Notation for Multi-Layer Networks

With multiple layers, superscripts index the layer. Layer ll has its own weights W(l)W^{(l)} and biases b(l)b^{(l)}. For a 3-layer network:

  1. Compute: a(0)=xa^{(0)} = x (the input is layer 0's activation)
  2. Compute: z(1)=W(1)a(0)+b(1),a(1)=ReLU(z(1))z^{(1)} = W^{(1)} a^{(0)} + b^{(1)},\quad a^{(1)} = \text{ReLU}(z^{(1)})
  3. Compute: z(2)=W(2)a(1)+b(2),a(2)=ReLU(z(2))z^{(2)} = W^{(2)} a^{(1)} + b^{(2)},\quad a^{(2)} = \text{ReLU}(z^{(2)})
  4. Compute: z(3)=W(3)a(2)+b(3),y^=softmax(z(3))z^{(3)} = W^{(3)} a^{(2)} + b^{(3)},\quad \hat{y} = \text{softmax}(z^{(3)})

Each layer's output becomes the next layer's input. The weight matrix W(l)W^{(l)} has shape (nl×nl1)(n_l \times n_{l-1}): nln_l rows (one per neuron in layer ll) and nl1n_{l-1} columns (one per neuron in the previous layer).

InteractiveMatrix Multiplication — hover output cells to see the computation
A (2×2)
×
B (2×2)
=
C = AB (2×2)
19
22
43
50

Each output cell C[i][j] is the dot product of row i from A with column j from B — which is why the inner dimensions must match. In a neural network, this is how all inputs combine with all weights in one operation.

Counting Parameters

Every weight and bias is a learnable parameter. Layer ll connects nl1n_{l-1} inputs to nln_l neurons. Each neuron needs nl1n_{l-1} weights plus 1 bias:

params per layer=nl×(nl1+1)\text{params per layer} = n_l \times (n_{l-1} + 1)
nln_l
neurons in layer l
nn
neurons in layer l-1

Concrete example: MNIST classifier with architecture 784 → 128 → 64 → 10.

  1. Layer 1 (784 → 128): 128×785=100,480128 \times 785 = 100{,}480 parameters
  2. Layer 2 (128 → 64): 64×129=8,25664 \times 129 = 8{,}256 parameters
  3. Layer 3 (64 → 10): 10×65=65010 \times 65 = 650 parameters

Total: ~109,386 parameters to recognize handwritten digits.

Notice that modern vision models have hundreds of millions of parameters, and GPT-3 has 175 billion. Parameter count is a rough proxy for — how complex a function the network can represent.

The Fully Connected Name

A layer where every neuron connects to every neuron in the previous layer is called (or dense). The weight matrix has nl×nl1n_l \times n_{l-1} entries — every possible connection exists.

Not all layers need to be fully connected. Convolutional layers (for images) and attention layers (for sequences) impose different connectivity patterns that exploit structure in the data. But fully connected layers are the foundation — every deep learning architecture contains them in some form, especially at the output.

Quiz

1 / 3

A fully connected layer with n_in=10 inputs and n_out=5 neurons has weight matrix W with shape...