Linear Algebra for ML — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

ML is linear algebra at scale plus calculus. Senior interviews probe whether you understand the operations you're doing — not just the syntax — and whether you can reason about properties (rank, conditioning, definiteness) that determine whether a method works or fails.

1. Matrices as linear maps

A matrix $A \in R^{m \times n}$ is a linear map $R^{n} \to R^{m}$ . Five fundamental subspaces (the four classical, plus null space of $A^{⊤}$ ):

Column space $Col (A) \subseteq R^{m}$ : outputs $A$ can produce.
Null space $Null (A) \subseteq R^{n}$ : ${x : A x = 0}$ .
Row space $Row (A) = Col (A^{⊤})$ .
Left null space $Null (A^{⊤}) \subseteq R^{m}$ .

Rank-nullity: $rank (A) + dim (Null (A)) = n$ .

Rank facts:

$rank (A) = rank (A^{⊤})$ (row rank = column rank).
$rank (A B) \leq min (rank (A), rank (B))$ .
For $A \in R^{m \times n}$ : full rank means $rank = min (m, n)$ .

2. Eigendecomposition

For square $A \in R^{n \times n}$ :

$A v = λ v$

$λ$ is an eigenvalue, $v$ a (right) eigenvector. The characteristic polynomial $det (A - λ I) = 0$ gives eigenvalues.

Diagonalization: if $A$ has $n$ linearly independent eigenvectors, then $A = V Λ V^{- 1}$ where $Λ$ is diagonal of eigenvalues.

Symmetric matrices — special

If $A = A^{⊤}$ :

All eigenvalues are real.
Eigenvectors of distinct eigenvalues are orthogonal.
$A$ is diagonalizable: $A = Q Λ Q^{⊤}$ with $Q$ orthogonal.

This is the spectral theorem. It's why PCA (covariance is symmetric), kernel methods, and tons of ML rely on it.

Powers and functions of matrices

$A^{k} = V Λ^{k} V^{- 1}$ . So $Λ^{k}$ raises eigenvalues to the $k$ -th power. This is why repeated multiplication by $A$ converges (or explodes) based on the largest $∣ λ ∣$ — the spectral radius.

For symmetric $A$ : $f (A) = Q f (Λ) Q^{⊤}$ for any analytic $f$ .

3. SVD — the universal factorization

For any $A \in R^{m \times n}$ :

$A = U Σ V^{⊤}$

$U \in R^{m \times m}$ , orthogonal. Columns are left singular vectors.
$Σ \in R^{m \times n}$ , "diagonal" with non-negative singular values $σ_{1} \geq σ_{2} \geq \dots \geq 0$ .
$V \in R^{n \times n}$ , orthogonal. Columns are right singular vectors.

Geometric intuition: $A$ rotates ( $V^{⊤}$ ), scales axes ( $Σ$ ), then rotates again ( $U$ ). Any linear map decomposes this way.

Connections to other things

Rank: number of nonzero singular values.
$∥ A ∥_{2}$ (operator norm): largest singular value $σ_{1}$ .
$∥ A ∥_{F}$ (Frobenius): $\sum σ_{i}^{2}$ .
Condition number: $κ (A) = σ_{1} / σ_{r}$ .
Pseudoinverse: $A^{+} = V Σ^{+} U^{⊤}$ where $Σ^{+}$ inverts nonzero singular values.

Eckart-Young theorem

The truncated SVD $A_{k} = U_{k} Σ_{k} V_{k}^{⊤}$ (top- $k$ singular components) is the best rank- $k$ approximation to $A$ in both operator and Frobenius norm. Foundation of PCA, low-rank matrix completion, model compression.

Connection to eigendecomposition

For symmetric PSD $A$ : SVD = eigendecomposition (singular values = eigenvalues, left = right singular vectors = eigenvectors).

For general $A$ :

$A^{⊤} A = V Σ^{⊤} Σ V^{⊤}$ — eigendecomposition of $A^{⊤} A$ has eigenvalues $σ_{i}^{2}$ and eigenvectors $V$ .
$A A^{⊤} = U Σ Σ^{⊤} U^{⊤}$ — eigendecomp gives eigenvectors $U$ .

This is how SVD is computed numerically (in practice via more stable bidiagonalization, but conceptually).

4. Positive (semi)definiteness

A symmetric matrix $A$ is:

Positive definite (PD) if $x^{⊤} A x > 0$ for all $x \neq = 0$ . Equivalent: all eigenvalues $> 0$ .
Positive semidefinite (PSD) if $x^{⊤} A x \geq 0$ for all $x$ . Equivalent: all eigenvalues $\geq 0$ .

Why PD/PSD matters in ML

Covariance matrices are PSD.
Hessian at a local minimum is PSD; PD at a strict local min.
Convex quadratic $\frac{1}{2} x^{⊤} A x + b^{⊤} x$ is convex iff $A$ is PSD.
Kernel matrices (Gram matrices) must be PSD (Mercer's condition).
PD allows Cholesky: $A = L L^{⊤}$ with $L$ lower-triangular. Numerically efficient for solving.

Quick PSD check

$A = B^{⊤} B$ for any $B$ → PSD.
All principal minors $\geq 0$ → PSD (Sylvester's criterion: leading principal minors $> 0$ for PD).

5. Matrix calculus — the four core formulas

These come up constantly in derivations.

Scalar-by-vector (gradient):

$\nabla_{x} (b^{⊤} x) = b, \nabla_{x} (x^{⊤} A x) = (A + A^{⊤}) x$

For symmetric $A$ : $\nabla_{x} (x^{⊤} A x) = 2 A x$ .

Vector-by-vector (Jacobian): for $f (x) \in R^{m}$ , $f$ from $R^{n}$ , $J_{ij} = \partial f_{i} / \partial x_{j}$ .

Scalar-by-matrix: $\nabla_{W} tr (W^{⊤} A) = A$ , $\nabla_{W} tr (A W^{⊤} B) = B^{⊤} A^{⊤}$ .

Chain rule for Jacobians: $J_{f \circ g} (x) = J_{f} (g (x)) \cdot J_{g} (x)$ .

OLS gradient — derive it once

$L (w) = \frac{1}{2} ∥ y - Xw ∥^{2} = \frac{1}{2} (y - Xw)^{⊤} (y - Xw)$ .

$\nabla_{w} L = - X^{⊤} (y - Xw) = X^{⊤} Xw - X^{⊤} y$ .

Setting to zero: $\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y$ (when $X^{⊤} X$ invertible).

Hessian: $\nabla^{2} L = X^{⊤} X$ — PSD always; PD if $X$ has full column rank.

6. Matrix norms

Norm	Formula	Property
Frobenius	$∥ A ∥_{F} = \sum_{ij} a_{ij}^{2}$	Sum of squared entries
Operator (spectral)	$∥ A ∥_{2} = σ_{m a x}$	Largest stretch
Nuclear	$∥ A ∥_{*} = \sum σ_{i}$	Convex relaxation of rank
1-norm	$∥ A ∥_{1} = max_{j} \sum_{i} ∣ a_{ij} ∣$	Max column abs-sum
$\infty$ -norm	$∥ A ∥_{\infty} = max_{i} \sum_{j} ∣ a_{ij} ∣$	Max row abs-sum

Frobenius is the default in ML (it's just $ℓ_{2}$ on the vectorized matrix). Nuclear norm is used as a convex relaxation of rank — the workhorse of low-rank matrix completion.

7. Condition number — why training breaks

For a square invertible $A$ :

$κ (A) = ∥ A ∥∥ A^{- 1} ∥ = σ_{1} / σ_{n}$

When solving $A x = b$ , perturbations in $b$ are amplified by $κ$ . Large condition number = ill-conditioned = numerically unstable.

Why ML cares

Hessian conditioning controls gradient descent convergence rate. Convex quadratic with Hessian $H$ : GD with optimal step $η = 2/ (L + μ)$ contracts at rate $((κ - 1) / (κ + 1))^{k}$ ; with simpler step $1/ L$ , contracts at $(1 - μ / L)^{k}$ . Bad conditioning → slow.
Adaptive optimizers (Adam, RMSprop) approximate per-parameter rescaling — implicitly handle bad conditioning.
Normalization (BN, LN) reduces internal-layer condition number, which is one explanation for why it speeds up training.

Improving conditioning

Standardize features (subtract mean, divide by SD).
Whiten data.
Add diagonal: $A + λ I$ — ridge regression bumps small eigenvalues, lowers $κ$ .

8. Projections and least squares

A projection $P$ satisfies $P^{2} = P$ . Orthogonal if also $P = P^{⊤}$ .

For a matrix $X$ with linearly independent columns:

$P = X (X^{⊤} X)^{- 1} X^{⊤}$

projects onto $Col (X)$ . The OLS solution $\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y$ gives $\overset{y}{^} = P y$ — fitted values are the projection of $y$ onto column space.

Geometric view of OLS: find the closest point in $Col (X)$ to $y$ . The residual $y - \overset{y}{^}$ is orthogonal to $Col (X)$ — the normal equations: $X^{⊤} (y - X \overset{w}{^}) = 0$ .

9. Common interview gotchas

Question	Common wrong answer	Right answer
Is rank always $min (m, n)$ ?	Yes	Only if full rank — rank can be lower
Is $X^{⊤} X$ always invertible?	Yes	Only if $X$ has full column rank
Are eigenvectors of a symmetric matrix unique?	Yes	Only up to sign and degenerate-eigenvalue rotation
What's the difference between rank and dimension?	Same thing	Dimension is for spaces; rank is for matrices (= dim of column/row space)
Largest eigenvalue = operator norm?	Yes	For symmetric matrices yes; in general operator norm is largest singular value
Does Adam fix bad conditioning?	Yes	Approximately — it rescales per-coordinate, which helps when curvature varies axis-by-axis
PSD + PSD = PSD?	Maybe	Yes, sum of PSD is PSD
PSD × PSD = PSD?	Yes	Not in general — only if they commute

10. Eight most-asked interview questions

Derive OLS gradient and prove the Hessian is PSD. (Vectorized chain rule + $X^{⊤} X ⪰ 0$ .)
What's the SVD of a matrix and why is it unique? (Up to sign of singular vectors when SVs are distinct; up to a rotation when degenerate.)
Why does PCA work? Connect to SVD. (Eigendecomposition of covariance = SVD of centered data; top- $k$ approx via Eckart-Young.)
What's a condition number and when does it matter? (Sensitivity of solution; affects GD convergence; normalization helps.)
What does it mean for a matrix to be PSD? List 3 equivalent characterizations. (All eigenvalues $\geq 0$ ; $x^{⊤} A x \geq 0$ ; $A = B^{⊤} B$ .)
Compute the gradient of $∥ A x - b ∥^{2}$ w.r.t. $x$ . (Should take 30 seconds: $2 A^{⊤} (A x - b)$ .)
Why is $X^{⊤} X$ used instead of $X X^{⊤}$ in OLS? (Solves for $w \in R^{d}$ , dim of features. Use $X X^{⊤}$ when $n < d$ — kernel trick.)
What's the geometric meaning of the rank of a matrix? (Dim of column space = "number of independent output directions"; if $A$ is a linear map, $rank =$ dim of image.)

11. Drill plan

Derive OLS gradient + Hessian + closed form on paper. Repeat until 2 minutes.
Recite SVD definition, properties, connection to eigendecomp.
For a $3 \times 3$ symmetric matrix, compute eigenvalues and eigenvectors by hand.
For each ML method (PCA, ridge, OLS, kernel ridge), state the relevant linear algebra fact it relies on.
Recite three equivalent definitions of PSD; derive Cholesky for a $2 \times 2$ PD.

12. Further reading

Strang, Introduction to Linear Algebra — the canonical undergrad text.
Trefethen & Bau, Numerical Linear Algebra — focused on what actually breaks numerically.
Petersen & Pedersen, The Matrix Cookbook — quick reference for matrix calculus.
Boyd & Vandenberghe, Convex Optimization, Appendix A — concise linear algebra refresher.

ML & LLM Interview Prep — Deep Dives