The forward pass — Neural Networks

The Direction of Data Flow

During inference (making a prediction), data flows in one direction: from input to output. You feed in a vector of features; the network transforms it through a series of layers; a prediction emerges. This is called the .

The forward pass is how every neural network makes every prediction — from image classifiers recognizing faces to language models generating the next token. Understanding it precisely is required before you can understand how gradients flow backward to train the network.

The forward pass also sets up training: you run data forward, observe how wrong the prediction is (the loss), then run gradients backward (backpropagation) to improve. But first, the forward pass.

A Complete Worked Example

Let's trace a specific input through a small network:

Input: $x \in \mathbb{R}^2$ (2 features)
Layer 1: 3 neurons, ReLU
Layer 2: 2 neurons, ReLU
Output: 1 neuron, sigmoid (binary classification)

Specific weights (chosen to make arithmetic clean):

W^{(1)} = \begin{bmatrix}1 &amp; 0 \ 0 &amp; 1 \ -1 &amp; 1\end{bmatrix},\quad b^{(1)} = \mathbf{0}

$W$: weight matrix for layer 1
$b$: bias vector for layer 1

W^{(2)} = \begin{bmatrix}1 &amp; -1 &amp; 0 \ 0 &amp; 1 &amp; -1\end{bmatrix},\quad b^{(2)} = \mathbf{0}

$W$: weight matrix for layer 2
$b$: bias vector for layer 2

W^{(3)} = \begin{bmatrix}2 &amp; -1\end{bmatrix},\quad b^{(3)} = 0

$W$: weight matrix for output layer
$b$: bias for output layer

Input: $x = \lbrack 3, 1 \rbrack^\top$

Step 1 - Layer 1 pre-activation:

z^{(1)} = W^{(1)} x = \begin{bmatrix}3 \ 1 \ -2\end{bmatrix}

$z$: pre-activation vector at layer 1

Row by row: $\lbrack 1,0\rbrack\cdot\lbrack 3,1\rbrack=3$ , $\lbrack 0,1\rbrack\cdot\lbrack 3,1\rbrack=1$ , $\lbrack -1,1\rbrack\cdot\lbrack 3,1\rbrack=-2$ .

Step 2 - Apply ReLU:

a^{(1)} = \text{ReLU}([3,\thinspace1,\thinspace-2]) = [3,\thinspace1,\thinspace0]

$a$: activated output of layer 1

The -2 is clamped to 0. Neuron 3 is "off" for this input.

Step 3 - Layer 2:

z^{(2)} = W^{(2)} a^{(1)} = [3-1+0,; 0+1-0] = [2,\thinspace1]

$z$: pre-activation at layer 2

a^{(2)} = \text{ReLU}([2,1]) = [2,1]

$a$: activated output of layer 2

(both positive, no change)

Step 4 - Output:

z^{(3)} = [2,-1]\cdot[2,1] = 3

$z$: pre-activation at output layer

\hat{y} = \sigma(3) = \frac{1}{1+e^{-3}} \approx 0.95

$\hat{y}$: predicted probability of class 1
$\sigma$: sigmoid activation

The network predicts a 95% probability of class 1 for input [3, 1].

Interactive example

Forward pass step-by-step - change input values and trace through all layers

Coming soon

The Network as a Composition

The entire forward pass is one big nested function:

\hat{y} = \sigma!\Bigl(W^{(3)} \cdot \text{ReLU}!\bigl(W^{(2)} \cdot \text{ReLU}(W^{(1)}x + b^{(1)}) + b^{(2)}\bigr) + b^{(3)}\Bigr)

$\hat{y}$: final prediction
$W$: weight matrix at layer l
$b$: bias vector at layer l

This composition structure is why neural networks are so expressive. Each layer applies a transformation; the output of one becomes the input to the next; the final composition can represent extremely complex mappings.

Think of it like a pipeline of transformations in data engineering: you pipe raw data through a series of operations, each one building on the previous result. Or think of it like nesting functions in mathematics: f(g(h(x))). Here h runs first, then g on its result, then f on that. Every layer in a neural network is one more nesting level.

The gives us depth; depth gives us expressive power.

What Gets Cached and Why

During a forward pass for training (not just inference), you must save every intermediate value: $z^{(1)}, a^{(1)}, z^{(2)}, a^{(2)}, z^{(3)}$ . You will need all of them in the backward pass.

In our example: cache $\lbrack 3,1,-2\rbrack$ , $\lbrack 3,1,0\rbrack$ , $\lbrack 2,1\rbrack$ , $\lbrack 2,1\rbrack$ , $\lbrack 3\rbrack$ . These five vectors represent the memory of the forward pass that backpropagation will use.

The memory cost is $O(L \times n \times B)$ , where is depth, is average layer width, and is batch size. For very deep or wide networks with large batches, this can be tens of gigabytes.

The Universal Approximation Theorem

Here is a remarkable fact: a neural network with one hidden layer and enough neurons can approximate any continuous function on a bounded domain to arbitrary precision.

This does not mean one hidden layer is always best. In practice, depth helps enormously — a shallow but very wide network might need exponentially more neurons than a deeper network to represent the same function. But theoretically, depth is not required for approximation ability.