Linear Algebra for ML — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

ML is linear algebra at scale plus calculus. Senior interviews probe whether you understand the operations you're doing — not just the syntax — and whether you can reason about properties (rank, conditioning, definiteness) that determine whether a method works or fails.


1. Matrices as linear maps

A matrix is a linear map . Five fundamental subspaces (the four classical, plus null space of ):

  • Column space : outputs can produce.
  • Null space : .
  • Row space .
  • Left null space .

Rank-nullity: .

Rank facts:

  • (row rank = column rank).
  • .
  • For : full rank means .

2. Eigendecomposition

For square :

is an eigenvalue, a (right) eigenvector. The characteristic polynomial gives eigenvalues.

Diagonalization: if has linearly independent eigenvectors, then where is diagonal of eigenvalues.

Symmetric matrices — special

If :

  • All eigenvalues are real.
  • Eigenvectors of distinct eigenvalues are orthogonal.
  • is diagonalizable: with orthogonal.

This is the spectral theorem. It's why PCA (covariance is symmetric), kernel methods, and tons of ML rely on it.

Powers and functions of matrices

. So raises eigenvalues to the -th power. This is why repeated multiplication by converges (or explodes) based on the largest — the spectral radius.

For symmetric : for any analytic .


3. SVD — the universal factorization

For any :

  • , orthogonal. Columns are left singular vectors.
  • , "diagonal" with non-negative singular values .
  • , orthogonal. Columns are right singular vectors.

Geometric intuition: rotates (), scales axes (), then rotates again (). Any linear map decomposes this way.

Connections to other things

  • Rank: number of nonzero singular values.
  • (operator norm): largest singular value .
  • (Frobenius): .
  • Condition number: .
  • Pseudoinverse: where inverts nonzero singular values.

Eckart-Young theorem

The truncated SVD (top- singular components) is the best rank- approximation to in both operator and Frobenius norm. Foundation of PCA, low-rank matrix completion, model compression.

Connection to eigendecomposition

For symmetric PSD : SVD = eigendecomposition (singular values = eigenvalues, left = right singular vectors = eigenvectors).

For general :

  • — eigendecomposition of has eigenvalues and eigenvectors .
  • — eigendecomp gives eigenvectors .

This is how SVD is computed numerically (in practice via more stable bidiagonalization, but conceptually).


4. Positive (semi)definiteness

A symmetric matrix is:

  • Positive definite (PD) if for all . Equivalent: all eigenvalues .
  • Positive semidefinite (PSD) if for all . Equivalent: all eigenvalues .

Why PD/PSD matters in ML

  • Covariance matrices are PSD.
  • Hessian at a local minimum is PSD; PD at a strict local min.
  • Convex quadratic is convex iff is PSD.
  • Kernel matrices (Gram matrices) must be PSD (Mercer's condition).
  • PD allows Cholesky: with lower-triangular. Numerically efficient for solving.

Quick PSD check

  • for any → PSD.
  • All principal minors → PSD (Sylvester's criterion: leading principal minors for PD).

5. Matrix calculus — the four core formulas

These come up constantly in derivations.

Scalar-by-vector (gradient):

For symmetric : .

Vector-by-vector (Jacobian): for , from , .

Scalar-by-matrix: , .

Chain rule for Jacobians: .

OLS gradient — derive it once

.

.

Setting to zero: (when invertible).

Hessian: — PSD always; PD if has full column rank.


6. Matrix norms

NormFormulaProperty
FrobeniusSum of squared entries
Operator (spectral)Largest stretch
NuclearConvex relaxation of rank
1-normMax column abs-sum
-normMax row abs-sum

Frobenius is the default in ML (it's just on the vectorized matrix). Nuclear norm is used as a convex relaxation of rank — the workhorse of low-rank matrix completion.


7. Condition number — why training breaks

For a square invertible :

When solving , perturbations in are amplified by . Large condition number = ill-conditioned = numerically unstable.

Why ML cares

  • Hessian conditioning controls gradient descent convergence rate. Convex quadratic with Hessian : GD with optimal step contracts at rate ; with simpler step , contracts at . Bad conditioning → slow.
  • Adaptive optimizers (Adam, RMSprop) approximate per-parameter rescaling — implicitly handle bad conditioning.
  • Normalization (BN, LN) reduces internal-layer condition number, which is one explanation for why it speeds up training.

Improving conditioning

  • Standardize features (subtract mean, divide by SD).
  • Whiten data.
  • Add diagonal: — ridge regression bumps small eigenvalues, lowers .

8. Projections and least squares

A projection satisfies . Orthogonal if also .

For a matrix with linearly independent columns:

projects onto . The OLS solution gives — fitted values are the projection of onto column space.

Geometric view of OLS: find the closest point in to . The residual is orthogonal to — the normal equations: .


9. Common interview gotchas

QuestionCommon wrong answerRight answer
Is rank always ?YesOnly if full rank — rank can be lower
Is always invertible?YesOnly if has full column rank
Are eigenvectors of a symmetric matrix unique?YesOnly up to sign and degenerate-eigenvalue rotation
What's the difference between rank and dimension?Same thingDimension is for spaces; rank is for matrices (= dim of column/row space)
Largest eigenvalue = operator norm?YesFor symmetric matrices yes; in general operator norm is largest singular value
Does Adam fix bad conditioning?YesApproximately — it rescales per-coordinate, which helps when curvature varies axis-by-axis
PSD + PSD = PSD?MaybeYes, sum of PSD is PSD
PSD × PSD = PSD?YesNot in general — only if they commute

10. Eight most-asked interview questions

  1. Derive OLS gradient and prove the Hessian is PSD. (Vectorized chain rule + .)
  2. What's the SVD of a matrix and why is it unique? (Up to sign of singular vectors when SVs are distinct; up to a rotation when degenerate.)
  3. Why does PCA work? Connect to SVD. (Eigendecomposition of covariance = SVD of centered data; top- approx via Eckart-Young.)
  4. What's a condition number and when does it matter? (Sensitivity of solution; affects GD convergence; normalization helps.)
  5. What does it mean for a matrix to be PSD? List 3 equivalent characterizations. (All eigenvalues ; ; .)
  6. Compute the gradient of w.r.t. . (Should take 30 seconds: .)
  7. Why is used instead of in OLS? (Solves for , dim of features. Use when — kernel trick.)
  8. What's the geometric meaning of the rank of a matrix? (Dim of column space = "number of independent output directions"; if is a linear map, dim of image.)

11. Drill plan

  • Derive OLS gradient + Hessian + closed form on paper. Repeat until 2 minutes.
  • Recite SVD definition, properties, connection to eigendecomp.
  • For a symmetric matrix, compute eigenvalues and eigenvectors by hand.
  • For each ML method (PCA, ridge, OLS, kernel ridge), state the relevant linear algebra fact it relies on.
  • Recite three equivalent definitions of PSD; derive Cholesky for a PD.

12. Further reading

  • Strang, Introduction to Linear Algebra — the canonical undergrad text.
  • Trefethen & Bau, Numerical Linear Algebra — focused on what actually breaks numerically.
  • Petersen & Pedersen, The Matrix Cookbook — quick reference for matrix calculus.
  • Boyd & Vandenberghe, Convex Optimization, Appendix A — concise linear algebra refresher.