Skip to content
Math Foundation Vectors & Matrices
Lesson 4 ⏱ 10 min

Transpose and inverse

Video coming soon

Transpose and Inverse: Matrix Operations That Appear in Every Gradient

Flipping matrices with transpose, the identity matrix, inverting square matrices, and the normal equation as a complete ML example.

⏱ ~6 min

🧮

Quick refresher

Matrix multiplication

For A (m×k) times B (k×n), the result is m×n. Entry C_ij = row i of A dotted with column j of B. Inner dimensions must match. AB ≠ BA in general.

Example

[[1,2],[3,4]] × [[5,6],[7,8]] = [[19,22],[43,50]].

(2×2)(2×2) → (2×2).

Transpose: Flipping a Matrix

The of a matrix swaps rows and columns. If A\mathbf{A} is m×n, then is n×m.

Transpose and inverse are the two matrix operations that appear most often in ML outside of multiplication itself. The transpose shows up in every gradient formula for a linear layer; the inverse appears in the normal equation for linear regression and in probabilistic models built on covariance matrices.

Rule: entry (A)ij=Aji(A^\top){ij} = A{ji}. Row ii of A\mathbf{A}^\top is column ii of A\mathbf{A}.

A=[1amp;2amp;3 4amp;5amp;6]A=[1amp;4 2amp;5 3amp;6]\mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \ 4 & 5 & 6 \end{bmatrix} \quad \Rightarrow \quad \mathbf{A}^\top = \begin{bmatrix} 1 & 4 \ 2 & 5 \ 3 & 6 \end{bmatrix}
A\mathbf{A}
original 2×3 matrix
A\mathbf{A}^\top
transposed 3×2 matrix

Here, A\mathbf{A} is 2×3. A\mathbf{A}^\top is 3×2. Row 1 of A\mathbf{A} ([1,2,3][1,2,3]) became column 1 of A\mathbf{A}^\top.

For vectors: transposing a column vector gives a row vector. If x\mathbf{x} is a column vector [x1,x2,x3][x_1, x_2, x_3]^\top, then x=[x1,x2,x3]\mathbf{x}^\top = [x_1, x_2, x_3] is a row vector. The dot product xy\mathbf{x} \cdot \mathbf{y} can be written as the matrix product xy\mathbf{x}^\top \mathbf{y} (row vector times column vector = scalar).

Key property:

(AB)=BA(\mathbf{AB})^\top = \mathbf{B}^\top \mathbf{A}^\top
A\mathbf{A}
first matrix
B\mathbf{B}
second matrix

Note the order reverses. Think of putting on shoes and socks: you put socks on then shoes (AB), but to undo it you take shoes off first, then socks (BA\mathbf{B}^\top \mathbf{A}^\top). Forgetting the reversal is a common source of dimension errors.

The Identity Matrix

Before getting to inverses, you need the identity matrix . It is the matrix version of the number 1.

The n×n identity matrix has 1s on the main diagonal and 0s everywhere else:

I3=[1amp;0amp;0 0amp;1amp;0 0amp;0amp;1]\mathbf{I}_3 = \begin{bmatrix} 1 & 0 & 0 \ 0 & 1 & 0 \ 0 & 0 & 1 \end{bmatrix}
δij\delta_{ij}
Kronecker delta - 1 if i=j, 0 otherwise

Key property: AI=IA=A\mathbf{AI} = \mathbf{IA} = \mathbf{A} for any matrix A\mathbf{A} with compatible shapes. Multiplying by I\mathbf{I} leaves a matrix completely unchanged - exactly like multiplying a number by 1.

The Matrix Inverse

For a scalar xx, the inverse is 1/x1/x: multiplying x(1/x)=1x \cdot (1/x) = 1. For a square matrix A\mathbf{A}, the A1\mathbf{A}^{-1} satisfies:

AA1=A1A=I\mathbf{A}\thinspace\mathbf{A}^{-1} = \mathbf{A}^{-1}\mathbf{A} = \mathbf{I}
A\mathbf{A}
square matrix
A1\mathbf{A}^{-1}
inverse of A
I\mathbf{I}
identity matrix

Restrictions:

  1. Must be square. A 2×3 matrix has no inverse.
  2. Must be non-singular. The must be non-zero. A singular matrix "collapses" some dimensions, which cannot be undone.

Example: For A=[2amp;0 0amp;3]\mathbf{A} = \begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}, the inverse is A1=[0.5amp;0 0amp;1/3]\mathbf{A}^{-1} = \begin{bmatrix} 0.5 & 0 \ 0 & 1/3 \end{bmatrix}. Check: AA1=I\mathbf{A}\mathbf{A}^{-1} = \mathbf{I}

In practice, use np.linalg.solve(A, b) to solve Ax=b\mathbf{Ax} = \mathbf{b} rather than computing A1\mathbf{A}^{-1} explicitly - it is more numerically stable.

The Normal Equation

Transpose and inverse come together in one of the most elegant results in ML: the Normal Equation for linear regression.

Setting the gradient of MSE to zero and solving analytically gives the optimal weights in one formula:

w=(XX)1Xy\mathbf{w}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}
X\mathbf{X}
data matrix - n examples by p features
y\mathbf{y}
target vector - n labels
w\mathbf{w}^*
optimal weight vector

Reading each piece:

  • The term Xy\mathbf{X}^\top \mathbf{y} (shape p×1): how much each feature correlates with the target
  • The term XX\mathbf{X}^\top \mathbf{X} (shape p×p): the feature covariance matrix - how features relate to each other
  • The term (XX)1(\mathbf{X}^\top \mathbf{X})^{-1}: normalizes out feature-feature correlations

The result is the exact optimal weights in a single computation - no gradient descent, no iteration. The catch: computing (XX)1(\mathbf{X}^\top \mathbf{X})^{-1} costs O(p3)O(p^3) operations. For p = 10,000 features that is 101210^{12} operations - completely infeasible. That is why we use gradient descent for large models.

import numpy as np

# Transpose
A = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)
print(A.T.shape)            # → (3, 2)

# Normal equation: w* = (XᵀX)⁻¹ Xᵀy
X = np.array([[1, 1.0],
              [1, 2.0],
              [1, 3.0],
              [1, 4.0]])   # design matrix (intercept + one feature)
y = np.array([2.1, 3.9, 6.2, 7.8])

# Option 1: explicit inverse (only for small p)
w_star = np.linalg.inv(X.T @ X) @ X.T @ y
print("weights:", w_star)   # → intercept ≈ 0.3, slope ≈ 1.93

# Option 2: lstsq — numerically stable, preferred in practice
w_star2, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print("weights (lstsq):", w_star2)  # same answer, more stable

Interactive example

Normal equation demo - adjust a 2D dataset and watch the closed-form optimal weights update

Coming soon

Quiz

1 / 3

The transpose of [[1, 2], [3, 4]] is...