Whiteboard Derivations — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
This deep dive is the catalog of derivations you should be able to do on a whiteboard cold. Frontier-lab interviews routinely ask "derive X" — backprop, attention, OLS gradient, KL, EM, DPO. Knowing the shape of these derivations beats memorizing the answer.
This is a meta-document that points to the relevant deep dive for each derivation while listing the key steps you need to recite.
1. Backpropagation for a 2-layer MLP
Setup: , , , , .
Steps:
-
Cross-entropy + softmax simplification (the magic step — derive it, don't just assert):
Softmax Jacobian: .
.
(using ).
So .
-
.
-
.
-
.
-
.
-
.
Key insights:
- Cross-entropy + softmax simplifies dramatically: gradient is just . The mess from softmax's Jacobian and CE's cancel.
- Chain rule: each layer multiplies by (transpose) and .
See 31_neural_networks/.
2. Scaled dot-product attention
Setup: .
Steps:
- .
- .
- .
Why : variance of entries scales with if have unit-variance entries. Divide by to keep variance at 1 → softmax doesn't saturate.
Multi-head: project to heads of dim ; do attention per head; concatenate; project back.
See 04_transformers/, 05_attention_mechanisms/.
3. OLS closed form
Setup: .
Steps:
- .
- Set to zero: .
- Solve: (assuming invertible).
Hessian: — PSD always; PD if has full column rank.
Geometric: where is the projection onto .
See 24_linear_algebra_qa/, 48_optimization_and_matrix_calculus/.
4. Logistic regression gradient
Setup: , .
Steps:
- (combine fractions).
- (sigmoid derivative).
- Chain rule — the magic cancellation: . The from sigmoid derivative kills the in the denominator from CE — that's the GLM canonical-link beauty.
- (since , ).
Key insight: same gradient form as linear regression (residual times input) — that's why these models feel the same. Hessian is , always PSD → loss convex.
See 01_classical_ml/, 37_mle_map_estimation/.
5. KL divergence
Definition: .
Properties:
-
, with equality iff (Gibbs' inequality). Proof via Jensen (memorize this — most-asked):
. Since is concave, Jensen's inequality gives . So , i.e. . Equality iff is constant, i.e. (since both are distributions).
-
Asymmetric: .
-
Forward KL (): mean-seeking. MLE.
-
Reverse KL (): mode-seeking. Variational inference.
MLE = forward KL minimization: — the entropy term is constant.
See 33_information_theory/, 37_mle_map_estimation/.
6. EM for GMM
Setup: .
E-step: posterior responsibilities
M-step: weighted MLE updates
Why EM converges (the key identity to memorize):
For any distribution :
So always, with equality iff .
- E-step: set (the posterior responsibilities ). KL = 0 → bound is tight: .
- M-step: maximize over (since is fixed, this is just weighted MLE). raises the bound.
- Net: . Likelihood non-decreasing → bounded above → converges.
See 19_advanced_clustering/.
7. PCA via SVD
Setup: centered .
Steps:
- Center the data, compute covariance: .
- SVD of centered : with , .
- Substitute and simplify: (using — that's the load-bearing step). So — this is the eigendecomposition of .
- Top- principal directions: columns of . Variances along them: .
- Reduced data: (project data onto top- directions).
Eckart-Young: truncated SVD minimizes over rank- .
See 21_dimensionality_reduction/.
8. SVM dual
Primal: s.t. .
Lagrangian: .
Steps:
- .
- (constraint on ).
- Substitute back into — this is the load-bearing step:
- .
- (the full quadratic).
- (using ).
- stays.
- Combining: .
Dual: s.t. .
Kernel trick: replace with . The dual is the only place data enters as inner products — perfect for kernels.
KKT — support vectors: complementary slackness gives only for points where (on margin); for soft-margin with , for margin violators.
See 35_kernel_functions/, 48_optimization_and_matrix_calculus/.
9. RoPE rotation
Goal: encode relative position via rotation in 2D subspaces.
Setup: pair up dimensions; for pair , apply rotation by to position :
with .
Property: . Inner product depends only on the relative position .
Why this works (the algebra to memorize):
- .
- Rotations are orthogonal, so .
- Rotations also compose by adding angles: .
- Therefore — a function of only.
This is what makes attention self-positionally-aware in a relative way without any added position embeddings to the input.
See 14_advanced_positional_embeddings/.
10. DPO (direct preference optimization)
Starting point: RLHF objective with KL regularization to a reference policy:
Step 1 — derive the closed-form optimal policy. Set up Lagrangian on the constrained max (with ). Setting gives , where is from the normalization Lagrange multiplier. Cleaning up:
with — depends only on prompt , not on .
Step 2 — invert for :
Step 3 — substitute into Bradley-Terry: . Critically, depends on only — it appears identically in both reward terms and cancels in the subtraction.
Step 4 — final DPO loss (NLL of preferences):
Key insight: closed-form optimal policy + depending only on prompt = reward model eliminates itself. No RL loop, no rollouts, just a supervised classification loss on preferences.
See 08_training_techniques/.
11. Variational lower bound (ELBO)
Setup: latent-variable model . Want to maximize .
Trick: introduce variational distribution and use Jensen's:
Jensen's inequality for concave : . Apply it:
This is the ELBO.
Equivalent form (split ):
The gap to true log-likelihood: — exactly the KL between approximate and true posterior. ELBO is tight when matches the true posterior.
Reconstruction term + KL-to-prior term. The VAE objective.
See 21_dimensionality_reduction/ (autoencoders), 33_information_theory/.
12. Bias-variance decomposition
Setup: estimate from random training set . Evaluate at fixed .
Steps:
- Let .
- Add and subtract: .
- Cross-term vanishes: take . and are constants w.r.t. , so (by definition of ).
- .
- Now take over the noise in : first term becomes . Second term is .
See 27_advanced_theory/, 52_statistical_learning_theory/.
13. Information gain (decision tree split)
Setup: dataset with class labels.
Entropy: .
After split on feature into :
Information gain: .
Key identity: — IG is exactly the mutual information between class label and feature . That makes it intuitive: pick the feature that's most informative about the label.
Why : conditioning never increases entropy (Jensen on concave , applied to ). Equality iff .
Tree picks the split that maximizes IG (or Gini decrease in CART).
Gini: . Computationally cheaper (no log); similar selection.
See 26_tree_based_methods/.
14. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| What's in attention? | Tradition | Variance scaling — keeps QK product unit-variance |
| Cross-entropy + softmax gradient? | Complicated | . Beautifully simple. |
| Why does EM converge? | Gradient descent | Each E-step gives lower bound; M-step maximizes; likelihood monotone |
| What does ELBO bound? | Posterior | Log-marginal-likelihood from below |
| KL forward vs reverse? | Same | Forward mode-covering (MLE); reverse mode-seeking (VI) |
| SVM dual support vectors? | Random points | Points where ; on/violating margin |
| RoPE relative property? | Magic | depends only on |
15. Eight derivations to drill cold
- 2-layer MLP backprop with cross-entropy + softmax.
- Scaled dot-product attention with multi-head + masking.
- OLS gradient + closed form with PSD Hessian.
- Logistic regression gradient showing convexity.
- EM for GMM: E-step posterior, M-step updates.
- DPO loss from RLHF + Bradley-Terry.
- ELBO derivation via Jensen's inequality.
- Bias-variance decomposition.
For each: 5 minutes on a whiteboard. Until automatic.
16. Drill plan
- 1 derivation per day for 8 days. Then cycle.
- Time yourself: 5 min per derivation cold; 3 min after a week of practice.
- Practice teaching each: explain to an imaginary interviewer.
- Pair the derivation with the relevant deep dive's "8 most-asked interview questions" to make sure you can recite both proof and intuition.
17. Further reading
This deep dive is a meta-collection. The full derivations live in:
31_neural_networksfor backprop.04_transformersand05_attention_mechanismsfor attention.01_classical_mlfor OLS and logistic.19_advanced_clusteringfor EM.08_training_techniquesfor DPO.21_dimensionality_reductionfor ELBO/VAE.27_advanced_theoryfor bias-variance.
Drill the derivations in those locations and you'll be ready for the whiteboard rounds.