Linear Algebra for ML — Interview Grill

50 questions on rank, eigendecomp, SVD, PSD, matrix calculus, conditioning, projections. Drill until you can answer 35+ cold.

A. Rank and subspaces

1. Define rank of a matrix. Dimension of the column space (= dimension of the row space). Equivalently, number of linearly independent rows or columns.

2. State the rank-nullity theorem. For $A \in R^{m \times n}$ : $rank (A) + dim (Null (A)) = n$ .

3. What does row rank = column rank mean intuitively? A counterintuitive fact. Both give the same number; this is a deep theorem proved via SVD or RREF arguments.

4. Inequality for $rank (A B)$ ? $rank (A B) \leq min (rank (A), rank (B))$ .

5. When is $X^{⊤} X$ invertible? When $X$ has full column rank (columns linearly independent).

6. What if $X^{⊤} X$ is singular in OLS? Use pseudoinverse, or add ridge ( $X^{⊤} X + λ I$ ), or remove redundant columns.

7. What's the four fundamental subspaces? $Col (A)$ , $Null (A)$ , $Row (A) = Col (A^{⊤})$ , $Null (A^{⊤})$ . $Col (A) ⊥ Null (A^{⊤})$ , $Row (A) ⊥ Null (A)$ .

B. Eigendecomposition

8. Define eigenvalue and eigenvector. $A v = λ v$ with $v \neq = 0$ . $λ$ is the eigenvalue, $v$ the eigenvector.

9. How do you find eigenvalues? Roots of characteristic polynomial: $det (A - λ I) = 0$ .

10. State the spectral theorem. Real symmetric matrix $A$ has $n$ real eigenvalues and an orthonormal basis of eigenvectors. $A = Q Λ Q^{⊤}$ with $Q$ orthogonal.

11. Why are eigenvectors of distinct eigenvalues orthogonal (for symmetric $A$ )? $λ_{1} v_{1}^{⊤} v_{2} = (A v_{1})^{⊤} v_{2} = v_{1}^{⊤} A v_{2} = λ_{2} v_{1}^{⊤} v_{2}$ . If $λ_{1} \neq = λ_{2}$ , must have $v_{1}^{⊤} v_{2} = 0$ .

12. Which matrices are NOT diagonalizable? Defective matrices — those without a full set of linearly independent eigenvectors. E.g., $(1011)$ has only one eigenvector (up to scaling).

13. Eigenvalues of $A^{k}$ ? $λ^{k}$ for each eigenvalue $λ$ of $A$ .

14. Eigenvalues of $A^{- 1}$ ? $1/ λ$ for each $λ \neq = 0$ .

15. What's the spectral radius? $ρ (A) = max_{i} ∣ λ_{i} ∣$ — largest absolute eigenvalue. Determines convergence/divergence of $A^{k}$ .

C. SVD

16. State the SVD theorem. Any $A \in R^{m \times n}$ factors as $A = U Σ V^{⊤}$ with $U, V$ orthogonal and $Σ$ diagonal with non-negative singular values.

17. Geometric interpretation of SVD? Rotation ( $V^{⊤}$ ) → axis-aligned scaling ( $Σ$ ) → rotation ( $U$ ). Any linear map decomposes this way.

18. SVD vs eigendecomposition? SVD works for any matrix; eigendecomposition only for diagonalizable square matrices. For symmetric PSD, they coincide. SVD = eigendecomposition of $A^{⊤} A$ (or $A A^{⊤}$ ).

19. What's the operator norm of $A$ in terms of SVD? Largest singular value: $∥ A ∥_{2} = σ_{1}$ .

20. Frobenius norm in terms of SVD? $∥ A ∥_{F} = \sum_{i} σ_{i}^{2}$ .

21. How do you compute rank from SVD? Number of nonzero singular values (in practice, number greater than some tolerance).

22. State Eckart-Young. The truncated SVD $A_{k} = U_{k} Σ_{k} V_{k}^{⊤}$ is the best rank- $k$ approximation in operator and Frobenius norms.

23. Why does PCA reduce to SVD? Centered data $X$ . Covariance $Σ_{X} = X^{⊤} X / n$ . Eigendecomp of $Σ_{X}$ = right singular vectors $V$ of $X$ . PCA scores = $U S$ .

24. SVD of a low-rank matrix? Rank- $r$ matrix has only $r$ nonzero singular values. Truncated SVD with $k = r$ recovers exactly.

25. What's the pseudoinverse via SVD? $A^{+} = V Σ^{+} U^{⊤}$ where $Σ^{+}$ inverts the nonzero singular values. Solves least-squares for any $A$ .

D. PSD / definiteness

26. Define positive semidefinite. Symmetric and $x^{⊤} A x \geq 0$ for all $x$ . Equivalently, all eigenvalues $\geq 0$ .

27. Define positive definite. PSD + $x^{⊤} A x > 0$ for $x \neq = 0$ . All eigenvalues $> 0$ .

28. Three equivalent characterizations of PSD? (1) $x^{⊤} A x \geq 0\forall x$ . (2) All eigenvalues $\geq 0$ . (3) $A = B^{⊤} B$ for some $B$ .

29. Why is the Hessian PSD at a local minimum? Necessary second-order condition: at a local min, the function curves upward (or flat) in every direction.

30. Why is covariance always PSD? $Cov (X) = E [(X - μ) (X - μ)^{⊤}]$ . For any $w$ : $w^{⊤} Cov (X) w = Var (w^{⊤} X) \geq 0$ .

31. Why must kernel matrices be PSD? Mercer's theorem: a kernel function corresponds to an inner product in some Hilbert space iff its Gram matrix is PSD for any data.

32. Sum of two PSD matrices? PSD: $x^{⊤} (A + B) x = x^{⊤} A x + x^{⊤} B x \geq 0$ .

33. Product of two PSD matrices — always PSD? No (in general). $A B$ may not even be symmetric. PSD only if $A, B$ commute.

34. Cholesky decomposition — when does it exist? For PD matrices: $A = L L^{⊤}$ with $L$ lower triangular and positive diagonal. For PSD, need to allow zeros (semi-Cholesky).

E. Matrix calculus

35. $\nabla_{x} (b^{⊤} x) = ?$ $b$ .

36. $\nabla_{x} (x^{⊤} A x) = ?$ $(A + A^{⊤}) x$ . For symmetric $A$ : $2 A x$ .

37. $\nabla_{x} ∥ y - A x ∥^{2} = ?$ $- 2 A^{⊤} (y - A x) = 2 A^{⊤} A x - 2 A^{⊤} y$ .

38. Hessian of $∥ y - A x ∥^{2}$ ? $2 A^{⊤} A$ . PSD always; PD iff $A$ has full column rank.

39. Closed-form OLS? $\overset{x}{^} = (A^{⊤} A)^{- 1} A^{⊤} y$ .

40. What's the chain rule for matrix functions? $d (f \circ g) / d x = (df / d g) (d g / d x)$ — Jacobian product. Backprop is exactly this.

41. Derivative of $lo g det A$ w.r.t. $A$ ? $A^{- T}$ . Used in VAEs, normalizing flows, GMM.

F. Conditioning

42. Definition of condition number? $κ (A) = σ_{m a x} / σ_{m i n}$ for invertible $A$ . Measures sensitivity to perturbations.

43. Why does it matter for gradient descent? GD on a quadratic with Hessian $H$ converges at rate $\propto (κ - 1) / (κ + 1)$ . Large $κ$ → slow.

44. How does Adam help with bad conditioning? Per-coordinate adaptive learning rates approximate diagonal preconditioning. Effectively rescales axes — not perfect, but helps when curvature varies axis-by-axis.

45. How does normalization (BN/LN) help with conditioning? Renormalizes activations → reduces conditioning of intermediate Jacobians/Hessians. One reason normalization speeds up training.

46. What does adding $λ I$ to a matrix do to its condition number? Reduces $κ$ . New eigenvalues $λ_{i} + λ$ . Smallest eigenvalue boosted from $λ_{n}$ to $λ_{n} + λ$ . Ridge regression's stabilizing effect.

G. Projections and OLS

47. Define a projection matrix. $P^{2} = P$ . Orthogonal projection: also $P = P^{⊤}$ .

48. Projection onto column space of $X$ ? $P = X (X^{⊤} X)^{- 1} X^{⊤}$ .

49. Geometric view of OLS solution? $\overset{y}{^} = P y$ — projection of $y$ onto $Col (X)$ . Residual $y - \overset{y}{^}$ is orthogonal to columns of $X$ (normal equations).

50. Trace of the hat matrix $P$ ? $tr (P) = rank (X)$ = degrees of freedom of the fit.

Quick fire

51. Operator norm of $A$ ? $σ_{m a x}$ . 52. Frobenius norm via SVD? $\sum σ_{i}^{2}$ . 53. Best rank-k approximation? Truncated SVD. 54. Eigenvalues of $A^{⊤} A$ ? $σ_{i}^{2}$ of $A$ . 55. Hessian of $\frac{1}{2} ∥ Xw - y ∥^{2}$ ? $X^{⊤} X$ . 56. Trace of $A B$ vs $B A$ ? Equal. 57. Determinant of an orthogonal matrix? $\pm 1$ . 58. Inverse of an orthogonal matrix? Its transpose. 59. PSD allows what decomposition? Cholesky. 60. Rank of an outer product $u v^{⊤}$ ? 1 (unless $u$ or $v$ is zero).

Self-grading

If you can't answer 1-15, you don't know basic linear algebra. If you can't answer 16-35, you'll get tripped up on PCA/SVD/optimization questions. If you can't answer 36-50, frontier-lab interviews on matrix calculus / numerical methods will go past you.

Aim for 40+/60 cold.

ML & LLM Interview Prep — Deep Dives