Transpose and inverse — Vectors & Matrices

Transpose: Flipping a Matrix

The of a matrix swaps rows and columns. If $\mathbf{A}$ is m×n, then is n×m.

Transpose and inverse are the two matrix operations that appear most often in ML outside of multiplication itself. The transpose shows up in every gradient formula for a linear layer; the inverse appears in the normal equation for linear regression and in probabilistic models built on covariance matrices.

Rule: entry $(A^\top){ij} = A{ji}$ . Row $i$ of $\mathbf{A}^\top$ is column $i$ of $\mathbf{A}$ .

\mathbf{A} = \begin{bmatrix} 1 &amp; 2 &amp; 3 \ 4 &amp; 5 &amp; 6 \end{bmatrix} \quad \Rightarrow \quad \mathbf{A}^\top = \begin{bmatrix} 1 &amp; 4 \ 2 &amp; 5 \ 3 &amp; 6 \end{bmatrix}

$\mathbf{A}$: original 2×3 matrix
$\mathbf{A}^\top$: transposed 3×2 matrix

Here, $\mathbf{A}$ is 2×3. $\mathbf{A}^\top$ is 3×2. Row 1 of $\mathbf{A}$ ( $[1,2,3]$ ) became column 1 of $\mathbf{A}^\top$ .

For vectors: transposing a column vector gives a row vector. If $\mathbf{x}$ is a column vector $[x_1, x_2, x_3]^\top$ , then $\mathbf{x}^\top = [x_1, x_2, x_3]$ is a row vector. The dot product $\mathbf{x} \cdot \mathbf{y}$ can be written as the matrix product $\mathbf{x}^\top \mathbf{y}$ (row vector times column vector = scalar).

Key property:

(\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top

$\mathbf{A}$: first matrix
$\mathbf{B}$: second matrix

Note the order reverses. Think of putting on shoes and socks: you put socks on then shoes (AB), but to undo it you take shoes off first, then socks ( $\mathbf{B}^\top \mathbf{A}^\top$ ). Forgetting the reversal is a common source of dimension errors.

The Identity Matrix

Before getting to inverses, you need the identity matrix . It is the matrix version of the number 1.

The n×n identity matrix has 1s on the main diagonal and 0s everywhere else:

\mathbf{I}_3 = \begin{bmatrix} 1 &amp; 0 &amp; 0 \ 0 &amp; 1 &amp; 0 \ 0 &amp; 0 &amp; 1 \end{bmatrix}

$\delta_{ij}$: Kronecker delta - 1 if i=j, 0 otherwise

Key property: $\mathbf{AI} = \mathbf{IA} = \mathbf{A}$ for any matrix $\mathbf{A}$ with compatible shapes. Multiplying by $\mathbf{I}$ leaves a matrix completely unchanged - exactly like multiplying a number by 1.

The Matrix Inverse

For a scalar $x$ , the inverse is $1/x$ : multiplying $x \cdot (1/x) = 1$ . For a square matrix $\mathbf{A}$ , the $\mathbf{A}^{-1}$ satisfies:

\mathbf{A}\thinspace\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}

$\mathbf{A}$: square matrix
$\mathbf{A}^{-1}$: inverse of A
$\mathbf{I}$: identity matrix

Restrictions:

Must be square. A 2×3 matrix has no inverse.
Must be non-singular. The must be non-zero. A singular matrix "collapses" some dimensions, which cannot be undone.

Example: For $\mathbf{A} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$ , the inverse is $\mathbf{A}^{-1} = \begin{bmatrix} 0.5 & 0 \ 0 & 1/3 \end{bmatrix}$ . Check: $\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}$ ✓

In practice, use np.linalg.solve(A, b) to solve $\mathbf{Ax} = \mathbf{b}$ rather than computing $\mathbf{A}^{-1}$ explicitly - it is more numerically stable.

The Normal Equation

Transpose and inverse come together in one of the most elegant results in ML: the Normal Equation for linear regression.

Setting the gradient of MSE to zero and solving analytically gives the optimal weights in one formula:

\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}

$\mathbf{X}$: data matrix - n examples by p features
$\mathbf{y}$: target vector - n labels
$\mathbf{w}^*$: optimal weight vector

Reading each piece:

The term $\mathbf{X}^\top \mathbf{y}$ (shape p×1): how much each feature correlates with the target
The term $\mathbf{X}^\top \mathbf{X}$ (shape p×p): the feature covariance matrix - how features relate to each other
The term $(\mathbf{X}^\top \mathbf{X})^{-1}$ : normalizes out feature-feature correlations

The result is the exact optimal weights in a single computation - no gradient descent, no iteration. The catch: computing $(\mathbf{X}^\top \mathbf{X})^{-1}$ costs $O(p^3)$ operations. For p = 10,000 features that is $10^{12}$ operations - completely infeasible. That is why we use gradient descent for large models.

import numpy as np

# Transpose
A = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)
print(A.T.shape)            # → (3, 2)

# Normal equation: w* = (XᵀX)⁻¹ Xᵀy
X = np.array([[1, 1.0],
              [1, 2.0],
              [1, 3.0],
              [1, 4.0]])   # design matrix (intercept + one feature)
y = np.array([2.1, 3.9, 6.2, 7.8])

# Option 1: explicit inverse (only for small p)
w_star = np.linalg.inv(X.T @ X) @ X.T @ y
print("weights:", w_star)   # → intercept ≈ 0.3, slope ≈ 1.93

# Option 2: lstsq — numerically stable, preferred in practice
w_star2, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print("weights (lstsq):", w_star2)  # same answer, more stable

Interactive example

Normal equation demo - adjust a 2D dataset and watch the closed-form optimal weights update

Coming soon