Skip to content
Neural Networks
Lesson 4 ⏱ 14 min

The forward pass

Video coming soon

The Forward Pass: From Input to Prediction

A complete walkthrough of a forward pass with real numbers, showing how data flows layer by layer and why activations must be cached for backpropagation.

⏱ ~8 min

🧮

Quick refresher

Matrix-vector multiplication

Multiplying an m×n matrix by an n×1 vector gives an m×1 vector. Each row of the matrix dots with the vector. This computes all neurons in a layer simultaneously.

Example

Layer with W (4×3) and input x (3×1): z = Wx gives z (4×1) — one output per neuron.

The Direction of Data Flow

During inference (making a prediction), data flows in one direction: from input to output. You feed in a vector of features; the network transforms it through a series of layers; a prediction emerges. This is called the .

The forward pass is how every neural network makes every prediction — from image classifiers recognizing faces to language models generating the next token. Understanding it precisely is required before you can understand how gradients flow backward to train the network.

The forward pass also sets up training: you run data forward, observe how wrong the prediction is (the loss), then run gradients backward (backpropagation) to improve. But first, the forward pass.

A Complete Worked Example

Let's trace a specific input through a small network:

  • Input: xR2x \in \mathbb{R}^2 (2 features)
  • Layer 1: 3 neurons, ReLU
  • Layer 2: 2 neurons, ReLU
  • Output: 1 neuron, sigmoid (binary classification)

Specific weights (chosen to make arithmetic clean):

W(1)=[1amp;0 0amp;1 1amp;1],b(1)=0W^{(1)} = \begin{bmatrix}1 & 0 \ 0 & 1 \ -1 & 1\end{bmatrix},\quad b^{(1)} = \mathbf{0}
WW
weight matrix for layer 1
bb
bias vector for layer 1
W(2)=[1amp;1amp;0 0amp;1amp;1],b(2)=0W^{(2)} = \begin{bmatrix}1 & -1 & 0 \ 0 & 1 & -1\end{bmatrix},\quad b^{(2)} = \mathbf{0}
WW
weight matrix for layer 2
bb
bias vector for layer 2
W(3)=[2amp;1],b(3)=0W^{(3)} = \begin{bmatrix}2 & -1\end{bmatrix},\quad b^{(3)} = 0
WW
weight matrix for output layer
bb
bias for output layer

Input: x=[3,1]x = \lbrack 3, 1 \rbrack^\top

Step 1 - Layer 1 pre-activation:

z(1)=W(1)x=[3 1 2]z^{(1)} = W^{(1)} x = \begin{bmatrix}3 \ 1 \ -2\end{bmatrix}
zz
pre-activation vector at layer 1

Row by row: [1,0][3,1]=3\lbrack 1,0\rbrack\cdot\lbrack 3,1\rbrack=3, [0,1][3,1]=1\lbrack 0,1\rbrack\cdot\lbrack 3,1\rbrack=1, [1,1][3,1]=2\lbrack -1,1\rbrack\cdot\lbrack 3,1\rbrack=-2.

Step 2 - Apply ReLU:

a(1)=ReLU([3,1,2])=[3,1,0]a^{(1)} = \text{ReLU}([3,\thinspace1,\thinspace-2]) = [3,\thinspace1,\thinspace0]
aa
activated output of layer 1

The -2 is clamped to 0. Neuron 3 is "off" for this input.

Step 3 - Layer 2:

z(2)=W(2)a(1)=[31+0,;0+10]=[2,1]z^{(2)} = W^{(2)} a^{(1)} = [3-1+0,; 0+1-0] = [2,\thinspace1]
zz
pre-activation at layer 2
a(2)=ReLU([2,1])=[2,1]a^{(2)} = \text{ReLU}([2,1]) = [2,1]
aa
activated output of layer 2

(both positive, no change)

Step 4 - Output:

z(3)=[2,1][2,1]=3z^{(3)} = [2,-1]\cdot[2,1] = 3
zz
pre-activation at output layer
y^=σ(3)=11+e30.95\hat{y} = \sigma(3) = \frac{1}{1+e^{-3}} \approx 0.95
y^\hat{y}
predicted probability of class 1
σ\sigma
sigmoid activation

The network predicts a 95% probability of class 1 for input [3, 1].

Interactive example

Forward pass step-by-step - change input values and trace through all layers

Coming soon

The Network as a Composition

The entire forward pass is one big nested function:

y^=σ!(W(3)ReLU!(W(2)ReLU(W(1)x+b(1))+b(2))+b(3))\hat{y} = \sigma!\Bigl(W^{(3)} \cdot \text{ReLU}!\bigl(W^{(2)} \cdot \text{ReLU}(W^{(1)}x + b^{(1)}) + b^{(2)}\bigr) + b^{(3)}\Bigr)
y^\hat{y}
final prediction
WW
weight matrix at layer l
bb
bias vector at layer l

This composition structure is why neural networks are so expressive. Each layer applies a transformation; the output of one becomes the input to the next; the final composition can represent extremely complex mappings.

Think of it like a pipeline of transformations in data engineering: you pipe raw data through a series of operations, each one building on the previous result. Or think of it like nesting functions in mathematics: f(g(h(x))). Here h runs first, then g on its result, then f on that. Every layer in a neural network is one more nesting level.

The gives us depth; depth gives us expressive power.

What Gets Cached and Why

During a forward pass for training (not just inference), you must save every intermediate value: z(1),a(1),z(2),a(2),z(3)z^{(1)}, a^{(1)}, z^{(2)}, a^{(2)}, z^{(3)}. You will need all of them in the backward pass.

In our example: cache [3,1,2]\lbrack 3,1,-2\rbrack, [3,1,0]\lbrack 3,1,0\rbrack, [2,1]\lbrack 2,1\rbrack, [2,1]\lbrack 2,1\rbrack, [3]\lbrack 3\rbrack. These five vectors represent the memory of the forward pass that backpropagation will use.

The memory cost is O(L×n×B)O(L \times n \times B), where is depth, is average layer width, and is batch size. For very deep or wide networks with large batches, this can be tens of gigabytes.

The Universal Approximation Theorem

Here is a remarkable fact: a neural network with one hidden layer and enough neurons can approximate any continuous function on a bounded domain to arbitrary precision.

This does not mean one hidden layer is always best. In practice, depth helps enormously — a shallow but very wide network might need exponentially more neurons than a deeper network to represent the same function. But theoretically, depth is not required for approximation ability.

Quiz

1 / 3

During the forward pass, why do we cache (save) the intermediate layer activations a⁽l⁾?