Matrices and multiplication — Vectors & Matrices

A Matrix Is a Grid of Numbers

Think of a spreadsheet. Rows and columns of numbers. That is a .

Every neural network layer is a matrix multiplication: the layer takes in a vector and multiplies it by a weight matrix to produce the next layer's input. Every dataset is a matrix where rows are examples and columns are features. If vectors are ML's individual data points, matrices are how you process entire batches at once.

A matrix with rows and columns is called an m×n matrix (read "m by n"). Example of a 2×3 matrix:

\mathbf{A} = \begin{bmatrix} 1 &amp; 2 &amp; 3 \ 4 &amp; 5 &amp; 6 \end{bmatrix}

$A_{ij}$: entry in row i, column j - row index first, column index second

Entry $A_{ij}$ is in row $i$ , column $j$ . So $A_{12} = 2$ (row 1, column 2) and $A_{23} = 6$ (row 2, column 3). Row first, column second - always.

Matrices in ML

Your training dataset is a matrix. With 100 examples each having 10 features, you stack them into a 100×10 matrix where each row is one example:

\mathbf{X} = \begin{bmatrix} x_{11} &amp; x_{12} &amp; \cdots &amp; x_{1p} \ x_{21} &amp; x_{22} &amp; \cdots &amp; x_{2p} \ \vdots &amp; &amp; \ddots &amp; \vdots \ x_{n1} &amp; x_{n2} &amp; \cdots &amp; x_{np} \end{bmatrix}

$x_{ij}$: feature j of example i
$n$: number of examples
$p$: number of features

Row $i$ is the feature vector for example $i$ . Column $j$ contains feature $j$ across all examples. This is the universal data format in ML.

Matrix-Vector Multiplication

Multiplying an m×n matrix by an n-dimensional column vector produces an m-dimensional column vector. The rule: each row of the matrix dots with the vector to produce one output number.

\begin{bmatrix} 1 &amp; 2 \ 3 &amp; 4 \end{bmatrix} \begin{bmatrix} x \ y \end{bmatrix} = \begin{bmatrix} 1x + 2y \ 3x + 4y \end{bmatrix}

$\mathbf{W}$: weight matrix - m rows, n columns
$\mathbf{x}$: input vector - n elements
$\mathbf{y}$: output vector - m elements

Concrete example:

\begin{bmatrix} 1 &amp; 2 \ 5 &amp; 6 \end{bmatrix} \begin{bmatrix} 3 \ 4 \end{bmatrix} = \begin{bmatrix} 1 \cdot 3 + 2 \cdot 4 \ 5 \cdot 3 + 6 \cdot 4 \end{bmatrix} = \begin{bmatrix} 11 \ 39 \end{bmatrix}

$\mathbf{A}$: 2x2 matrix
$\mathbf{v}$: 2-element vector

Shape rule: $(m \times n)$ matrix times $(n \times 1)$ vector $= (m \times 1)$ vector. The $n$ must match - the number of columns in the matrix must equal the number of rows in the vector.

Matrix-Matrix Multiplication

You can multiply two matrices together when the shapes are compatible. For $\mathbf{A}$ (m×k) times $\mathbf{B}$ (k×n): the result is $\mathbf{C}$ (m×n).

The entry-by-entry: $C_{ij} = \text{row}\thinspacei\text{ of }\mathbf{A} \cdot \text{column}\thinspacej\text{ of }\mathbf{B}$ .

\begin{bmatrix} 1 &amp; 2 \ 3 &amp; 4 \end{bmatrix} \begin{bmatrix} 5 &amp; 6 \ 7 &amp; 8 \end{bmatrix} = \begin{bmatrix} 1{\cdot}5+2{\cdot}7 &amp; 1{\cdot}6+2{\cdot}8 \ 3{\cdot}5+4{\cdot}7 &amp; 3{\cdot}6+4{\cdot}8 \end{bmatrix} = \begin{bmatrix} 19 &amp; 22 \ 43 &amp; 50 \end{bmatrix}

$C_{ij}$: entry in row i, column j of the result

Check: $(2 \times 2) \times (2 \times 2) \to (2 \times 2)$ ✓

Memory trick: $(\underbrace{m \times k}) \times (\underbrace{k \times n}) \to (m \times n)$ . The two $k$ 's cancel. What remains are the outer dimensions.

InteractiveMatrix Multiplication — hover output cells to see the computation

A (2×2)

B (2×2)

C = AB (2×2)

Each output cell C[i][j] is the dot product of row i from A with column j from B — which is why the inner dimensions must match. In a neural network, this is how all inputs combine with all weights in one operation.

Why Matrix Multiplication Is Everywhere in ML

A fully connected neural network layer is matrix multiplication:

\mathbf{a} = \mathbf{W}\mathbf{x} + \mathbf{b}

$\mathbf{W}$: weight matrix - shape (neurons_out x neurons_in)
$\mathbf{x}$: input vector
$\mathbf{b}$: bias vector

Every unit in the output receives a weighted sum of all inputs - which is exactly what this matrix multiplication computes. The bias $\mathbf{b}$ is a vector added afterward.

Processing a full batch:

If you have a batch of 32 inputs, each 128-dimensional, stack them as a matrix $\mathbf{B}$ of shape 32×128. With weight matrix $\mathbf{W}$ of shape 64×128, compute $\mathbf{B}\mathbf{W}^\top$ to get 32×64 - one 64-dim output per example. All 32 predictions compute in parallel on the GPU.

The full forward pass of a 3-layer network:

\mathbf{z}_1 = \mathbf{W}_1 \mathbf{x} + \mathbf{b}_1, \quad \mathbf{a}_1 = \text{ReLU}(\mathbf{z}_1)

$\mathbf{W}_1$: weight matrix of layer 1
$\mathbf{a}_1$: activations after layer 1
$\hat{\mathbf{y}}$: final prediction

\mathbf{z}_2 = \mathbf{W}_2 \mathbf{a}_1 + \mathbf{b}_2, \quad \hat{\mathbf{y}} = \mathbf{W}_3\thinspace\text{ReLU}(\mathbf{z}_2) + \mathbf{b}_3

$\mathbf{W}_2$: weight matrix of layer 2
$\mathbf{W}_3$: weight matrix of output layer

Each layer is one matrix multiplication plus a bias addition. The entire forward pass is a chain of matrix ops - which is why PyTorch and TensorFlow are fundamentally matrix computation libraries with automatic differentiation built on top.

import numpy as np

# Matrix creation
A = np.array([[1, 2], [3, 4], [5, 6]])   # 3×2 matrix
print(A.shape)   # → (3, 2)

# Matrix-vector multiplication
W = np.array([[0.1, -0.2, 0.3],
              [0.4,  0.5, 0.6]])   # 2×3 weight matrix
x = np.array([1.0, 2.0, 3.0])     # 3-dim input
z = W @ x                          # → 2-dim output (W·x)
print(z)

# Batch: process 4 examples at once (shape 4×3)
X_batch = np.random.randn(4, 3)
Z_batch = X_batch @ W.T            # → shape (4, 2)
print(Z_batch.shape)