Layers: h = σ(Wx + b) — Neural Networks

From One Neuron to Many

A single neuron takes a vector of inputs, computes a weighted sum, and applies an activation. That is useful, but it only produces one output value. Real data has rich structure that cannot be captured by a single number.

Think of recognizing a hand-written digit. The pixel in the top-left corner might detect "is there ink here?", while a different neuron might ask "is there a curve here?", and another "is there a horizontal stroke here?". You need many detectors running in parallel, all looking at the same image, each contributing their own answer. That's a layer.

A is the natural generalization: a collection of neurons that all see the same input and each independently produce their own output. If you have neurons, the layer maps a vector of numbers to a vector of $n_{out}$ numbers. Each neuron has its own weight vector and bias, but they are all computed in parallel.

The Matrix Formulation

Say you have $n_{in} = 3$ inputs and $n_{out} = 4$ neurons. Each neuron $j$ has a weight vector $w_j \in \mathbb{R}^3$ and a scalar bias $b_j$ , computing:

z_j = w_j \cdot x + b_j

$w_j$: weight vector for neuron j
$x$: input vector
$b_j$: bias for neuron j

Four dot products. Four additions. But why compute them one by one? Stack the four weight vectors as rows of a $4 \times 3$ matrix :

z = Wx + b

$W$: weight matrix with one row per neuron
$x$: input vector
$b$: bias vector

Then apply the :

a^{(l)} = \sigma!\left(W^{(l)} a^{(l-1)} + b^{(l)}\right)

$a$: activation output vector
$\sigma$: elementwise activation function
$z$: pre-activation vector

This is the complete layer computation. Modern GPUs are specifically designed to perform matrix multiplications extremely fast — training deep networks is fundamentally a matrix multiplication problem.

Notation for Multi-Layer Networks

With multiple layers, superscripts index the layer. Layer $l$ has its own weights $W^{(l)}$ and biases $b^{(l)}$ . For a 3-layer network:

Compute: $a^{(0)} = x$ (the input is layer 0's activation)
Compute: $z^{(1)} = W^{(1)} a^{(0)} + b^{(1)},\quad a^{(1)} = \text{ReLU}(z^{(1)})$
Compute: $z^{(2)} = W^{(2)} a^{(1)} + b^{(2)},\quad a^{(2)} = \text{ReLU}(z^{(2)})$
Compute: $z^{(3)} = W^{(3)} a^{(2)} + b^{(3)},\quad \hat{y} = \text{softmax}(z^{(3)})$

Each layer's output becomes the next layer's input. The weight matrix $W^{(l)}$ has shape $(n_l \times n_{l-1})$ : $n_l$ rows (one per neuron in layer $l$ ) and $n_{l-1}$ columns (one per neuron in the previous layer).

InteractiveMatrix Multiplication — hover output cells to see the computation

A (2×2)

B (2×2)

C = AB (2×2)

Each output cell C[i][j] is the dot product of row i from A with column j from B — which is why the inner dimensions must match. In a neural network, this is how all inputs combine with all weights in one operation.

Counting Parameters

Every weight and bias is a learnable parameter. Layer $l$ connects $n_{l-1}$ inputs to $n_l$ neurons. Each neuron needs $n_{l-1}$ weights plus 1 bias:

\text{params per layer} = n_l \times (n_{l-1} + 1)

$n_l$: neurons in layer l
$n$: neurons in layer l-1

Concrete example: MNIST classifier with architecture 784 → 128 → 64 → 10.

Layer 1 (784 → 128): $128 \times 785 = 100{,}480$ parameters
Layer 2 (128 → 64): $64 \times 129 = 8{,}256$ parameters
Layer 3 (64 → 10): $10 \times 65 = 650$ parameters

Total: ~109,386 parameters to recognize handwritten digits.

Notice that modern vision models have hundreds of millions of parameters, and GPT-3 has 175 billion. Parameter count is a rough proxy for — how complex a function the network can represent.

The Fully Connected Name

A layer where every neuron connects to every neuron in the previous layer is called (or dense). The weight matrix has $n_l \times n_{l-1}$ entries — every possible connection exists.

Not all layers need to be fully connected. Convolutional layers (for images) and attention layers (for sequences) impose different connectivity patterns that exploit structure in the data. But fully connected layers are the foundation — every deep learning architecture contains them in some form, especially at the output.