Optimization and Matrix Calculus — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
Optimization is what training is. Senior interviews go beyond "use Adam" — they probe whether you understand convexity, second-order behavior, conditioning, and the trade-offs between methods. This deep dive complements the linear-algebra deep dive by focusing on what changes during training: gradients, Hessians, step sizes, and convergence behavior.
1. Convex sets and convex functions
A set is convex if for all and , . Lines, balls, half-spaces, polytopes, intersections of these.
A function is convex on a convex domain if for all and :
Equivalent characterizations:
- is twice-differentiable and Hessian everywhere.
- The epigraph is a convex set.
- For all : (first-order condition).
Strict convexity: strict inequality. Strictly convex + smooth unique minimum.
Strong convexity (-strongly convex): . Equivalently, Hessian .
Why convexity matters
For convex :
- Every local minimum is global.
- Gradient descent converges (with decreasing step size).
- Strong convexity gives geometric convergence rate.
Most ML problems are NOT convex (deep nets, GMMs, K-means), but convex theory still informs:
- Convex sub-problems (e.g., per-step Adam updates).
- Loss landscape locally behaves like quadratic near a minimum (Taylor) — convexity intuition transfers.
2. Smoothness and Lipschitz gradients
A function is -smooth if its gradient is -Lipschitz:
Equivalent (for twice-differentiable): .
This bounds how fast the gradient can change. Gives the descent lemma:
Setting (gradient descent with step ):
Each GD step decreases by at least . This is the foundation of GD convergence proofs.
3. Gradient descent convergence rates
For different problem classes:
| Problem | Step size | Convergence rate |
|---|---|---|
| Convex, -smooth | ||
| Strongly convex (), -smooth | — geometric | |
| Non-convex, -smooth | at rate |
For strongly convex + smooth, is the condition number. With step size , GD contracts as per step. With optimal step , the rate becomes . Bad conditioning → slow convergence either way.
Nesterov acceleration
For convex + smooth, Nesterov's accelerated GD achieves — faster than vanilla. Builds momentum from the gradient at a "look-ahead" point.
For strongly convex + smooth, achieves rate — quadratic improvement in conditioning dependence.
4. Second-order methods
Use the Hessian for better step direction.
Newton's method
For a quadratic, converges in one step. For convex + smooth + strongly convex, achieves quadratic convergence near the optimum (number of correct digits doubles per iteration).
Costs: forming and inverting is in dimension. Infeasible for .
Quasi-Newton (BFGS, L-BFGS)
Approximate from successive gradients. L-BFGS stores only history; popular for medium-scale convex optimization (logistic regression, GLMs).
Why not for deep learning?
- Hessian is huge ( for big models).
- Hessian isn't PSD (loss is non-convex).
- Cost of forming/storing/inverting prohibitive.
- Second-order info noisy on stochastic batches.
Approximations (Shampoo, K-FAC, Sophia) try to use cheap diagonal/block-diagonal approximations of while keeping memory manageable.
Gauss-Newton
For least-squares with residual : approximate Hessian by (Jacobian product). Always PSD. Lev-Marq adds regularization .
5. Stochastic gradient methods
For empirical risk where is huge:
SGD: for random index . Unbiased estimate of gradient; high variance.
SGD convergence
- Convex + smooth: with diminishing step.
- Strongly convex + smooth: with step.
- Non-convex + smooth: at .
Slower than full GD, but each iteration is as expensive — usually a win for huge datasets.
Variance reduction
- Mini-batch: average gradients; reduces variance .
- SVRG, SAGA: explicit variance reduction methods using past gradients. Theoretical wins for finite-sum convex problems; rarely used in deep learning.
- Larger batch + LR: linear scaling rule (Goyal et al., 2017): scale LR with batch size up to a critical batch.
Adaptive methods (Adam et al.)
Adapt per-parameter step size based on historical gradients. Not strictly needed for convex problems; often wins for deep learning due to varying gradient magnitudes across parameters.
(See 10_optimizers/ for full optimizer details.)
6. Constrained optimization
Minimize subject to , .
Lagrangian
with .
KKT conditions (necessary at optimum under a constraint qualification like LICQ or Slater's; sufficient for convex)
- Stationarity: .
- Primal feasibility: , .
- Dual feasibility: .
- Complementary slackness: for each .
Complementary slackness says: for each , at least one of and is zero (their product is zero). So an inactive constraint () must have zero multiplier; a non-zero multiplier signals an active constraint.
Examples in ML
- SVM dual: derived via Lagrangian + KKT. Support vectors = points with non-zero .
- Constrained capacity in MoE: capacity factor caps tokens per expert — Lagrangian relaxation.
- PCA: s.t. → Lagrangian gives eigenvalue equation.
Lagrangian duality
Define dual function . Weak duality: always. Strong duality: for convex problems with constraint qualifications (Slater's condition).
7. The loss landscape in deep learning
Deep network losses are non-convex. Theoretical results that matter:
Saddle points dominate
In high dimensions, saddle points (Hessian has both positive and negative eigenvalues) vastly outnumber local minima. Most "stuck" points are saddles, not minima. Negative-curvature directions can be exploited to escape.
Most local minima are good
Empirical and theoretical evidence (Choromanska et al. 2015; Kawaguchi 2016) suggests that for over-parameterized networks, most local minima have similar (low) loss values. Large flat regions of low loss.
Flat vs sharp minima
Flat minima (low Hessian eigenvalues) generalize better than sharp ones empirically (Hochreiter & Schmidhuber 1997; Keskar et al. 2017). SGD's noise drives it toward flat minima.
Edge of stability (Cohen et al. 2021)
When training with full-batch GD, the largest Hessian eigenvalue grows until , then oscillates. Training continues despite violating classical stability. Surprising.
8. Conditioning revisited (in optimization context)
Condition number for strongly convex problems. Affects:
- GD convergence rate: — exponential, but slow when large.
- Number of iterations: for accuracy .
- With Nesterov: .
In ML, ill-conditioning shows up because:
- Different parameters have different scales.
- Different layers have different curvatures.
- Some directions in parameter space are "stiff" (small eigenvalues of Hessian = directions that change loss slowly).
Mitigations
- Standardize features: reduces conditioning of input.
- Normalization layers (BN, LN): renormalize internal activations.
- Adaptive optimizers (Adam): per-parameter step ≈ diagonal preconditioning.
- Second-order methods (Shampoo, K-FAC): explicit preconditioning.
- Architecture design: residual connections, careful initialization.
9. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Is deep learning convex? | Some parts | No — non-convex, but most local minima are reasonable |
| Newton's method always works? | Yes | Only locally; may diverge far from optimum or hit non-PSD Hessian |
| Adam = preconditioned SGD? | Sort of | Approximately — diagonal preconditioning + momentum |
| KKT applies only to convex? | Yes | Necessary conditions hold for any smooth optimization; sufficient only if convex |
| Why scale LR with batch size? | Tradition | Linear scaling rule from Goyal et al. — gradients have lower variance with bigger batch |
| Edge of stability — what's stable? | Loss going down | Loss bounces but trends down; classical stability bound violated |
| Saddle points in deep nets? | Bad | Very common; Hessian eigenvalues are mixed; SGD escapes via noise |
10. Eight most-asked interview questions
- What's strong convexity and why does it matter? (-strongly convex; gives geometric GD convergence rate.)
- Derive the gradient descent convergence rate for smooth + strongly convex. (.)
- Why doesn't Newton's method work for deep learning? (Hessian too big to form/invert; not PSD; noisy on batches.)
- What's KKT and when does it apply? (Necessary at optimum; sufficient for convex; complementary slackness.)
- Derive the SVM dual using Lagrangian. (Standard derivation; support vectors emerge from KKT.)
- Why does Adam help in deep learning? (Diagonal preconditioning of varying gradient scales; momentum.)
- What's the edge of stability phenomenon? (Hessian top eigenvalue oscillates around ; classical stability violated.)
- Why are flat minima better for generalization? (Robustness to perturbation; effective Bayesian model averaging in their basin.)
11. Drill plan
- For convex, smooth, strongly convex — recite definitions + GD rate.
- Derive descent lemma from -smoothness in 5 lines.
- For Newton's method, recite: update, convergence rate, why it fails for deep learning.
- For Lagrangian + KKT — recite all four conditions.
- Derive SVM dual on paper.
- Sketch the loss landscape: saddles dominate, flat = good, edge of stability.
12. Further reading
- Boyd & Vandenberghe, Convex Optimization — the canonical text. Chapters 1–5 are essential.
- Nocedal & Wright, Numerical Optimization — second-order methods, quasi-Newton.
- Bubeck, Convex Optimization: Algorithms and Complexity — modern, concise.
- Goodfellow, Bengio, Courville, Deep Learning, ch. 8 — optimization for neural networks.
- Cohen et al. (2021), Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability.
- Choromanska et al. (2015), The Loss Surfaces of Multilayer Networks.