Optimizers — Interview Grill

40 questions focused on optimizer algorithms specifically — different angle from the LR-centric grill in 02_gradient_descent/INTERVIEW_GRILL.md. Use both.

A. Algorithmic foundations

1. What's the relationship between optimizers and Newton's method? Newton uses $H_{t}^{- 1} g_{t}$ as the update direction — accounting for second-order curvature. Storing $H$ is $O (d^{2})$ and inverting is $O (d^{3})$ , infeasible at scale. Every modern optimizer is some cheap approximation: SGD = identity preconditioner; Adam/RMSProp = diagonal $1/ \overset{v}{^}$ preconditioner approximating $diag (H)^{- 1/2}$ ; Shampoo = block-Kronecker; Sophia = stochastic Hutchinson estimate of $diag (H)$ .

2. Walk me through SGD with classical momentum.

$v_{t + 1} = β v_{t} + g_{t}, θ_{t + 1} = θ_{t} - η v_{t + 1}$

Velocity $v_{t}$ is an exponentially-weighted sum of past gradients. $β = 0.9$ is standard. Effective gradient horizon is $1/ (1 - β) \approx 10$ . Helps convergence in ill-conditioned valleys by averaging out perpendicular oscillations and reinforcing the persistent direction along the valley.

3. What's Nesterov momentum and why is it different? Computes the gradient at the lookahead position (where momentum will take you anyway):

$v_{t + 1} = β v_{t} + \nabla L (θ_{t} - η β v_{t}), θ_{t + 1} = θ_{t} - η v_{t + 1}$

Theoretically improves convex convergence from $O (1/ T)$ to $O (1/ T^{2})$ for smooth strongly-convex problems. Empirically often slightly better than Polyak momentum.

4. Walk me through RMSProp.

$v_{t} = β v_{t - 1} + (1 - β) g_{t}^{2}, θ_{t + 1} = θ_{t} - η \cdot \frac{g _{t}}{v _{t} + ε}$

Per-parameter rescaling by RMS of recent gradients. The second-moment $E [g g^{⊤}]$ is the Fisher information matrix (not the Hessian directly). For likelihood losses, $F = H$ only at a stationary point — so "diagonal Hessian approximation" is loose; "diagonal Fisher" is more accurate. Removes most LR-tuning sensitivity that plain SGD has.

5. Walk me through Adam with bias correction.

$m_{t} v_{t} \overset{m}{^}_{t} \overset{v}{^}_{t} θ_{t + 1} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2} = m_{t} / (1 - β_{1}^{t}) = v_{t} / (1 - β_{2}^{t}) = θ_{t} - η \cdot \overset{m}{^}_{t} / (\overset{v}{^}_{t} + ε) (first moment, momentum) (second moment, RMS) (bias correction) (bias correction)$

Defaults $β_{1} = 0.9, β_{2} = 0.999, ε = 1 0^{- 8}$ . Combines momentum and adaptive per-parameter rescaling.

6. Why is bias correction necessary? $m_{t}$ and $v_{t}$ initialize at zero. Without correction, the first $\sim 1/ (1 - β)$ steps have moments that are underestimates of the true running averages — biased low. For $β_{2} = 0.999$ , $\overset{v}{^}$ is biased for ~1000 steps. Without correction, the early effective LR $η / \overset{v}{^}$ is too large, training often diverges. The bias correction $1/ (1 - β^{t})$ exactly inverts the geometric-series discount.

7. What does $ε$ in Adam control? Two roles: (a) numerical floor preventing $1/ \overset{v}{^}$ from blowing up when $\overset{v}{^} \approx 0$ , (b) implicit cap on per-parameter LR — when $\overset{v}{^} ≪ ε$ , the update is $(η / ε) \cdot \overset{m}{^}$ , so dimensions with very small gradients still get sensible updates. Some recipes set $ε = 1 0^{- 3}$ for embeddings to dampen aggressive updates on rare tokens.

8. What if you set $β_{2} = 0.9999$ ? The second-moment horizon grows to ~10000 steps. Pros: more robustness to outlier gradients. Cons: very slow to track changes in gradient statistics — when training transitions from warmup to the main phase, $\overset{v}{^}$ lags badly. Empirically, $β_{2} = 0.999$ is a sweet spot. $β_{2} = 0.95$ is sometimes used for very long pretraining for the opposite reason: faster reaction.

B. AdamW vs Adam vs L2

9. What is AdamW? Adam with decoupled weight decay. The update becomes:

$θ_{t + 1} = θ_{t} - η \cdot \frac{m ^ _{t}}{v ^ _{t} + ε} - η \cdot λ \cdot θ_{t}$

Weight decay applied directly to $θ$ after the Adam update, not added to the gradient.

10. Why isn't Adam-with-L2 equivalent to AdamW? Adam-with-L2 adds $λ θ$ to the gradient: $g_{t} \leftarrow g_{t} + λ θ_{t}$ . Then $v_{t}$ accumulates $(g_{t} + λ θ_{t})^{2}$ , the regularization term gets divided by $\overset{v}{^}$ , and parameters with high gradient variance see weakened L2. Decay strength becomes non-uniform across parameters in a way nobody intends. AdamW separates decay from preconditioning; every parameter shrinks by exactly $η \cdot λ$ regardless of its gradient statistics.

11. For SGD, are L2 and weight decay equivalent? Yes. Gradient of $(λ /2) ∥ θ ∥^{2}$ is $λ θ$ , so SGD with explicit decay is identical to SGD with L2. They diverge only when there's preconditioning (Adam, RMSProp, K-FAC).

12. What's a typical AdamW weight decay value for LLMs? $λ = 0.1$ for pretraining is the modern default. $λ = 0.01$ is more typical for vision and smaller models. SFT and DPO usually use $0.0$ or very small ( $0.001$ ).

13. Why do attention layers and embeddings sometimes have different weight decay? Embedding parameters often see sparse gradient updates (only sampled tokens get gradient). Decay applied uniformly per step over-shrinks rare-token embeddings. Common fixes: zero weight decay on embeddings, layer-norm parameters, and biases; non-zero decay only on weight matrices.

C. Lion, Sophia, and modern alternatives

14. Walk me through Lion. Sign-based update:

$c_{t} θ_{t + 1} m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t} = θ_{t} - η \cdot sign (c_{t}) - η \cdot λ \cdot θ_{t} = β_{2} m_{t - 1} + (1 - β_{2}) g_{t} (interpolation) (momentum)$

Update magnitude per parameter is exactly $η$ (modulo decay). No second moment, no division, no square root. Memory: one state buffer per param vs Adam's two.

15. Why does Lion sometimes work as well as AdamW? Sign normalization is an extreme form of per-parameter rescaling — like Adam's $1/ \overset{v}{^}$ taken to the limit. When gradient magnitudes are similar across parameters (after normalization layers do their job), the normalization in Adam is doing less work than people assume; sign is "good enough" and saves memory.

16. What's the LR difference between Lion and AdamW? Lion's optimal $η$ is typically 3–10x smaller than AdamW's, because sign updates are "always full magnitude" while Adam's updates can be smaller for low-gradient parameters. Lion's optimal weight decay is typically 3x larger.

17. What's Sophia? Adam-like, but uses a stochastic Hessian-diagonal estimate via Hutchinson's estimator instead of $\overset{v}{^}$ :

$\hat{h}_{t} = clip (stoch-hutchinson-diag-H, ρ)$

$θ_{t + 1} = θ_{t} - η \cdot \frac{m ^ _{t}}{max ( γ \cdot h ^ _{t} , ε )} - η \cdot λ \cdot θ_{t}$

Hutchinson uses $diag (H) \approx E [v ⊙ H v]$ for random $v$ ; $H v$ is computed via Hessian-vector product (one extra backward pass). Reportedly converges in fewer steps than AdamW on language modeling. Cost: ~25% more compute per step.

18. Why isn't Sophia universally adopted? (a) Per-step compute cost. (b) Implementation complexity (HVP via PyTorch isn't a one-liner). (c) Public benchmarks at 70B+ scale are scarce. (d) AdamW is "good enough" — frontier labs are conservative about changing the optimizer mid-training run.

19. What's Shampoo? Per-layer Kronecker-factored preconditioner. For an $m \times n$ weight matrix, store left factor $L_{t}$ ( $m \times m$ ) and right factor $R_{t}$ ( $n \times n$ ). Update:

$W_{t + 1} = W_{t} - η \cdot L_{t}^{- 1/4} G_{t} R_{t}^{- 1/4}$

Memory $O (m^{2} + n^{2})$ per layer instead of $O (d^{2})$ . Empirically state-of-the-art on some tasks but adopted slowly because of implementation complexity and the cost of computing matrix inverse-roots.

20. When would you actually pick Shampoo or K-FAC over Adam? Specific small-model regimes where the per-step compute overhead is acceptable, generalization is paramount, and you have engineering bandwidth. In standard LLM pretraining at scale, AdamW dominates because the implementation is battle-tested and the gains from second-order are not large enough to justify the complexity.

D. Why optimizers fail and how to debug

21. Adam diverges at step 200. What's going on and how do you fix it? Most likely: warmup is too short or peak $η$ is too high. The $\overset{v}{^}$ estimate becomes unreliable when an outlier gradient hits before the variance is stable. Fix: extend warmup to 2000+ steps, lower peak $η$ 3x. Secondary fixes: gradient clipping at norm 1.0, increase $β_{2}$ to 0.9999 for slower variance updates.

22. Adam works on smaller batch but not on larger. LR scaling rule. For batch size scaling $k$ , Adam typically needs $k$ LR scaling. If you doubled batch size and kept $η$ constant, you may have under-scaled. Also: longer warmup is needed for larger batches because each step now has bigger effective magnitude.

23. Adam learns fast then plateaus. Schedule decayed too aggressively. Or $\overset{v}{^}$ accumulated outliers and is now over-suppressing the update direction. Or the LR finder picked a value that's only good for early training. Solutions: warm restart, switch to a less aggressive schedule, or transition to SGD for the final phase.

24. SGD with momentum is unstable on transformers. Expected. Transformers have ill-conditioned gradients across layers — embedding tables and FFN layers have wildly different scales. SGD's single global LR can't accommodate this. Fix: switch to AdamW. SGD+momentum without per-layer scaling is essentially never the right answer for transformers.

25. Loss spikes occasionally with Adam at the right LR. Edge of stability. Common, often benign. Add gradient clipping at norm 1.0 if not already present. Don't reflexively lower LR — that may move you below the optimal operating point.

26. Loss is fine but eval is degrading. Probably overfitting. Optimizer can contribute (Adam's preconditioning tends toward sharper minima), but the first move is to add regularization (weight decay, dropout, more data) rather than change optimizer.

27. After a checkpoint reload, training is unstable. Likely: optimizer state wasn't loaded. Adam without $m_{t}, v_{t}$ state is just Adam-from-scratch with incorrect $t$ . Always serialize and restore optimizer state, including $t$ .

28. Your team's Adam runs work; mine doesn't. What do you check? Optimizer state (loaded?), bias correction (correctly implemented?), $ε$ placement ( $\overset{v}{^} + ε$ or $\overset{v}{^} + ε$ — different!), warmup length (matches reference?), batch size and LR scaling (compatible?), gradient clipping (in place?). The $ε$ placement is a real bug source — PyTorch and TF have differed historically.

29. Why might LARS or LAMB show up? Very large batch training (>16K). Per-layer trust ratios prevent any single layer's update from being too large relative to its parameters. Mostly superseded by muP at frontier labs but appears in some published large-batch ablations.

30. What's muP and how does it relate to optimizers? muP changes initialization scales and per-layer LR factors so the optimal LR is invariant under model width. Sweep LR cheaply on a small model, scale up. Doesn't replace the optimizer (you still use AdamW under muP) — it changes how parameters and learning rates are scaled across model sizes.

E. Theoretical / advanced

31. Why does Adam achieve lower training loss but worse test loss than SGD on some tasks? Adam's preconditioning biases the optimizer toward sharper minima. Several explanations: (a) per-parameter rescaling reduces SGD-style gradient noise that biases toward flat minima, (b) $1/ \overset{v}{^}$ directs more aggressive updates toward sharper directions, (c) different effective trajectory shape. Mitigations: AdamW (helps), longer training (helps), AdamSwitch to SGD for last epochs (sometimes helps).

32. What's the convergence rate of SGD on convex problems? For smooth convex: $O (1/ T)$ with constant LR; $O (1/ T)$ with optimal LR or strong convexity. With Polyak averaging: $O (1/ T)$ . With Nesterov on smooth strongly-convex: $O (exp (- c \cdot T / κ))$ . Real deep learning is non-convex so these are loose upper bounds, but they motivate why momentum and acceleration matter in theory.

33. What's the implicit regularization perspective on SGD vs. Adam? SGD's mini-batch noise has scale $η / B$ , biasing toward flat minima. Adam's preconditioning rescales per parameter, changing the noise structure: noise in low-gradient parameters is amplified, noise in high-gradient parameters is suppressed. The net effect is a different (and sometimes weaker) implicit regularization than SGD.

34. Why don't we use second-order methods for deep learning? Storage $O (d^{2})$ and inversion $O (d^{3})$ . For $1 0^{9}$ parameters, that's $1 0^{18}$ Hessian entries and $1 0^{27}$ inversion operations — wildly infeasible. Stochastic-Hessian approximations (Sophia, K-FAC, Shampoo) trade exactness for tractability. Even those are expensive enough that AdamW remains dominant in production.

35. What's the natural gradient and how does it relate to optimizers? Natural gradient is $F^{- 1} g$ , where $F$ is the Fisher information matrix (expected Hessian of log-likelihood). It's the steepest descent in distribution space rather than parameter space — optimal in an information-geometric sense. K-FAC approximates $F$ block-diagonally; SGD ignores it. The relationship: under specific assumptions, RMSProp's $1/ E [g^{2}] \approx 1/ diag (F)$ , giving Adam an information-geometric interpretation.

36. Why might $ε$ placement matter? $\overset{v}{^} + ε$ vs $\overset{v}{^} + ε$ ? $\overset{v}{^} + ε$ : $ε$ is added after the square root, so it's a floor on the divisor. Standard Adam. $\overset{v}{^} + ε$ : $ε$ is added inside, behaves like a tiny variance prior. Almost equivalent for $\overset{v}{^} ≫ ε$ , but different near zero. Different libraries have used different conventions historically; PyTorch uses $\overset{v}{^} + ε$ . Worth knowing if you're translating between codebases.

F. Quick fire

37. Default Adam betas? $0.9, 0.999$ . 38. AdamW weight decay for LLM pretrain? $0.1$ . 39. Lion LR vs AdamW LR? Lion ~3–10x lower. 40. Sophia per-step compute cost? ~25% more than AdamW (one extra HVP per step).

Self-grading

If you can't answer 1–10, you don't know optimizers. If you can't answer 11–20, you don't know modern LLM training. If you can't answer 21–36, you'll struggle in frontier-lab applied scientist screens. Aim for 30+/40 cold before walking in.

ML & LLM Interview Prep — Deep Dives