Linear Algebra for ML — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
ML is linear algebra at scale plus calculus. Senior interviews probe whether you understand the operations you're doing — not just the syntax — and whether you can reason about properties (rank, conditioning, definiteness) that determine whether a method works or fails.
1. Matrices as linear maps
A matrix is a linear map . Five fundamental subspaces (the four classical, plus null space of ):
- Column space : outputs can produce.
- Null space : .
- Row space .
- Left null space .
Rank-nullity: .
Rank facts:
- (row rank = column rank).
- .
- For : full rank means .
2. Eigendecomposition
For square :
is an eigenvalue, a (right) eigenvector. The characteristic polynomial gives eigenvalues.
Diagonalization: if has linearly independent eigenvectors, then where is diagonal of eigenvalues.
Symmetric matrices — special
If :
- All eigenvalues are real.
- Eigenvectors of distinct eigenvalues are orthogonal.
- is diagonalizable: with orthogonal.
This is the spectral theorem. It's why PCA (covariance is symmetric), kernel methods, and tons of ML rely on it.
Powers and functions of matrices
. So raises eigenvalues to the -th power. This is why repeated multiplication by converges (or explodes) based on the largest — the spectral radius.
For symmetric : for any analytic .
3. SVD — the universal factorization
For any :
- , orthogonal. Columns are left singular vectors.
- , "diagonal" with non-negative singular values .
- , orthogonal. Columns are right singular vectors.
Geometric intuition: rotates (), scales axes (), then rotates again (). Any linear map decomposes this way.
Connections to other things
- Rank: number of nonzero singular values.
- (operator norm): largest singular value .
- (Frobenius): .
- Condition number: .
- Pseudoinverse: where inverts nonzero singular values.
Eckart-Young theorem
The truncated SVD (top- singular components) is the best rank- approximation to in both operator and Frobenius norm. Foundation of PCA, low-rank matrix completion, model compression.
Connection to eigendecomposition
For symmetric PSD : SVD = eigendecomposition (singular values = eigenvalues, left = right singular vectors = eigenvectors).
For general :
- — eigendecomposition of has eigenvalues and eigenvectors .
- — eigendecomp gives eigenvectors .
This is how SVD is computed numerically (in practice via more stable bidiagonalization, but conceptually).
4. Positive (semi)definiteness
A symmetric matrix is:
- Positive definite (PD) if for all . Equivalent: all eigenvalues .
- Positive semidefinite (PSD) if for all . Equivalent: all eigenvalues .
Why PD/PSD matters in ML
- Covariance matrices are PSD.
- Hessian at a local minimum is PSD; PD at a strict local min.
- Convex quadratic is convex iff is PSD.
- Kernel matrices (Gram matrices) must be PSD (Mercer's condition).
- PD allows Cholesky: with lower-triangular. Numerically efficient for solving.
Quick PSD check
- for any → PSD.
- All principal minors → PSD (Sylvester's criterion: leading principal minors for PD).
5. Matrix calculus — the four core formulas
These come up constantly in derivations.
Scalar-by-vector (gradient):
For symmetric : .
Vector-by-vector (Jacobian): for , from , .
Scalar-by-matrix: , .
Chain rule for Jacobians: .
OLS gradient — derive it once
.
.
Setting to zero: (when invertible).
Hessian: — PSD always; PD if has full column rank.
6. Matrix norms
| Norm | Formula | Property |
|---|---|---|
| Frobenius | Sum of squared entries | |
| Operator (spectral) | Largest stretch | |
| Nuclear | Convex relaxation of rank | |
| 1-norm | Max column abs-sum | |
| -norm | Max row abs-sum |
Frobenius is the default in ML (it's just on the vectorized matrix). Nuclear norm is used as a convex relaxation of rank — the workhorse of low-rank matrix completion.
7. Condition number — why training breaks
For a square invertible :
When solving , perturbations in are amplified by . Large condition number = ill-conditioned = numerically unstable.
Why ML cares
- Hessian conditioning controls gradient descent convergence rate. Convex quadratic with Hessian : GD with optimal step contracts at rate ; with simpler step , contracts at . Bad conditioning → slow.
- Adaptive optimizers (Adam, RMSprop) approximate per-parameter rescaling — implicitly handle bad conditioning.
- Normalization (BN, LN) reduces internal-layer condition number, which is one explanation for why it speeds up training.
Improving conditioning
- Standardize features (subtract mean, divide by SD).
- Whiten data.
- Add diagonal: — ridge regression bumps small eigenvalues, lowers .
8. Projections and least squares
A projection satisfies . Orthogonal if also .
For a matrix with linearly independent columns:
projects onto . The OLS solution gives — fitted values are the projection of onto column space.
Geometric view of OLS: find the closest point in to . The residual is orthogonal to — the normal equations: .
9. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Is rank always ? | Yes | Only if full rank — rank can be lower |
| Is always invertible? | Yes | Only if has full column rank |
| Are eigenvectors of a symmetric matrix unique? | Yes | Only up to sign and degenerate-eigenvalue rotation |
| What's the difference between rank and dimension? | Same thing | Dimension is for spaces; rank is for matrices (= dim of column/row space) |
| Largest eigenvalue = operator norm? | Yes | For symmetric matrices yes; in general operator norm is largest singular value |
| Does Adam fix bad conditioning? | Yes | Approximately — it rescales per-coordinate, which helps when curvature varies axis-by-axis |
| PSD + PSD = PSD? | Maybe | Yes, sum of PSD is PSD |
| PSD × PSD = PSD? | Yes | Not in general — only if they commute |
10. Eight most-asked interview questions
- Derive OLS gradient and prove the Hessian is PSD. (Vectorized chain rule + .)
- What's the SVD of a matrix and why is it unique? (Up to sign of singular vectors when SVs are distinct; up to a rotation when degenerate.)
- Why does PCA work? Connect to SVD. (Eigendecomposition of covariance = SVD of centered data; top- approx via Eckart-Young.)
- What's a condition number and when does it matter? (Sensitivity of solution; affects GD convergence; normalization helps.)
- What does it mean for a matrix to be PSD? List 3 equivalent characterizations. (All eigenvalues ; ; .)
- Compute the gradient of w.r.t. . (Should take 30 seconds: .)
- Why is used instead of in OLS? (Solves for , dim of features. Use when — kernel trick.)
- What's the geometric meaning of the rank of a matrix? (Dim of column space = "number of independent output directions"; if is a linear map, dim of image.)
11. Drill plan
- Derive OLS gradient + Hessian + closed form on paper. Repeat until 2 minutes.
- Recite SVD definition, properties, connection to eigendecomp.
- For a symmetric matrix, compute eigenvalues and eigenvectors by hand.
- For each ML method (PCA, ridge, OLS, kernel ridge), state the relevant linear algebra fact it relies on.
- Recite three equivalent definitions of PSD; derive Cholesky for a PD.
12. Further reading
- Strang, Introduction to Linear Algebra — the canonical undergrad text.
- Trefethen & Bau, Numerical Linear Algebra — focused on what actually breaks numerically.
- Petersen & Pedersen, The Matrix Cookbook — quick reference for matrix calculus.
- Boyd & Vandenberghe, Convex Optimization, Appendix A — concise linear algebra refresher.