Transpose: Flipping a Matrix
The of a matrix swaps rows and columns. If is m×n, then is n×m.
Transpose and inverse are the two matrix operations that appear most often in ML outside of multiplication itself. The transpose shows up in every gradient formula for a linear layer; the inverse appears in the normal equation for linear regression and in probabilistic models built on covariance matrices.
Rule: entry . Row of is column of .
- original 2×3 matrix
- transposed 3×2 matrix
Here, is 2×3. is 3×2. Row 1 of () became column 1 of .
For vectors: transposing a column vector gives a row vector. If is a column vector , then is a row vector. The dot product can be written as the matrix product (row vector times column vector = scalar).
Key property:
- first matrix
- second matrix
Note the order reverses. Think of putting on shoes and socks: you put socks on then shoes (AB), but to undo it you take shoes off first, then socks (). Forgetting the reversal is a common source of dimension errors.
The Identity Matrix
Before getting to inverses, you need the identity matrix . It is the matrix version of the number 1.
The n×n identity matrix has 1s on the main diagonal and 0s everywhere else:
- Kronecker delta - 1 if i=j, 0 otherwise
Key property: for any matrix with compatible shapes. Multiplying by leaves a matrix completely unchanged - exactly like multiplying a number by 1.
The Matrix Inverse
For a scalar , the inverse is : multiplying . For a square matrix , the satisfies:
- square matrix
- inverse of A
- identity matrix
Restrictions:
- Must be square. A 2×3 matrix has no inverse.
- Must be non-singular. The must be non-zero. A singular matrix "collapses" some dimensions, which cannot be undone.
Example: For , the inverse is . Check: ✓
In practice, use np.linalg.solve(A, b) to solve rather than computing explicitly - it is more numerically stable.
The Normal Equation
Transpose and inverse come together in one of the most elegant results in ML: the Normal Equation for linear regression.
Setting the gradient of MSE to zero and solving analytically gives the optimal weights in one formula:
- data matrix - n examples by p features
- target vector - n labels
- optimal weight vector
Reading each piece:
- The term (shape p×1): how much each feature correlates with the target
- The term (shape p×p): the feature covariance matrix - how features relate to each other
- The term : normalizes out feature-feature correlations
The result is the exact optimal weights in a single computation - no gradient descent, no iteration. The catch: computing costs operations. For p = 10,000 features that is operations - completely infeasible. That is why we use gradient descent for large models.
import numpy as np
# Transpose
A = np.array([[1, 2, 3],
[4, 5, 6]]) # shape (2, 3)
print(A.T.shape) # → (3, 2)
# Normal equation: w* = (XᵀX)⁻¹ Xᵀy
X = np.array([[1, 1.0],
[1, 2.0],
[1, 3.0],
[1, 4.0]]) # design matrix (intercept + one feature)
y = np.array([2.1, 3.9, 6.2, 7.8])
# Option 1: explicit inverse (only for small p)
w_star = np.linalg.inv(X.T @ X) @ X.T @ y
print("weights:", w_star) # → intercept ≈ 0.3, slope ≈ 1.93
# Option 2: lstsq — numerically stable, preferred in practice
w_star2, _, _, _ = np.linalg.lstsq(X, y, rcond=None)
print("weights (lstsq):", w_star2) # same answer, more stable
Interactive example
Normal equation demo - adjust a 2D dataset and watch the closed-form optimal weights update
Coming soon