Whiteboard Derivations — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This deep dive is the catalog of derivations you should be able to do on a whiteboard cold. Frontier-lab interviews routinely ask "derive X" — backprop, attention, OLS gradient, KL, EM, DPO. Knowing the shape of these derivations beats memorizing the answer.

This is a meta-document that points to the relevant deep dive for each derivation while listing the key steps you need to recite.


1. Backpropagation for a 2-layer MLP

Setup: , , , , .

Steps:

  1. Cross-entropy + softmax simplification (the magic step — derive it, don't just assert):

    Softmax Jacobian: .

    .

    (using ).

    So .

  2. .

  3. .

  4. .

  5. .

  6. .

Key insights:

  • Cross-entropy + softmax simplifies dramatically: gradient is just . The mess from softmax's Jacobian and CE's cancel.
  • Chain rule: each layer multiplies by (transpose) and .

See 31_neural_networks/.


2. Scaled dot-product attention

Setup: .

Steps:

  1. .
  2. .
  3. .

Why : variance of entries scales with if have unit-variance entries. Divide by to keep variance at 1 → softmax doesn't saturate.

Multi-head: project to heads of dim ; do attention per head; concatenate; project back.

See 04_transformers/, 05_attention_mechanisms/.


3. OLS closed form

Setup: .

Steps:

  1. .
  2. Set to zero: .
  3. Solve: (assuming invertible).

Hessian: — PSD always; PD if has full column rank.

Geometric: where is the projection onto .

See 24_linear_algebra_qa/, 48_optimization_and_matrix_calculus/.


4. Logistic regression gradient

Setup: , .

Steps:

  1. (combine fractions).
  2. (sigmoid derivative).
  3. Chain rule — the magic cancellation: . The from sigmoid derivative kills the in the denominator from CE — that's the GLM canonical-link beauty.
  4. (since , ).

Key insight: same gradient form as linear regression (residual times input) — that's why these models feel the same. Hessian is , always PSD → loss convex.

See 01_classical_ml/, 37_mle_map_estimation/.


5. KL divergence

Definition: .

Properties:

  • , with equality iff (Gibbs' inequality). Proof via Jensen (memorize this — most-asked):

    . Since is concave, Jensen's inequality gives . So , i.e. . Equality iff is constant, i.e. (since both are distributions).

  • Asymmetric: .

  • Forward KL (): mean-seeking. MLE.

  • Reverse KL (): mode-seeking. Variational inference.

MLE = forward KL minimization: — the entropy term is constant.

See 33_information_theory/, 37_mle_map_estimation/.


6. EM for GMM

Setup: .

E-step: posterior responsibilities

M-step: weighted MLE updates

Why EM converges (the key identity to memorize):

For any distribution :

So always, with equality iff .

  • E-step: set (the posterior responsibilities ). KL = 0 → bound is tight: .
  • M-step: maximize over (since is fixed, this is just weighted MLE). raises the bound.
  • Net: . Likelihood non-decreasing → bounded above → converges.

See 19_advanced_clustering/.


7. PCA via SVD

Setup: centered .

Steps:

  1. Center the data, compute covariance: .
  2. SVD of centered : with , .
  3. Substitute and simplify: (using — that's the load-bearing step). So — this is the eigendecomposition of .
  4. Top- principal directions: columns of . Variances along them: .
  5. Reduced data: (project data onto top- directions).

Eckart-Young: truncated SVD minimizes over rank- .

See 21_dimensionality_reduction/.


8. SVM dual

Primal: s.t. .

Lagrangian: .

Steps:

  1. .
  2. (constraint on ).
  3. Substitute back into — this is the load-bearing step:
    • .
    • (the full quadratic).
    • (using ).
    • stays.
    • Combining: .

Dual: s.t. .

Kernel trick: replace with . The dual is the only place data enters as inner products — perfect for kernels.

KKT — support vectors: complementary slackness gives only for points where (on margin); for soft-margin with , for margin violators.

See 35_kernel_functions/, 48_optimization_and_matrix_calculus/.


9. RoPE rotation

Goal: encode relative position via rotation in 2D subspaces.

Setup: pair up dimensions; for pair , apply rotation by to position :

with .

Property: . Inner product depends only on the relative position .

Why this works (the algebra to memorize):

  • .
  • Rotations are orthogonal, so .
  • Rotations also compose by adding angles: .
  • Therefore — a function of only.

This is what makes attention self-positionally-aware in a relative way without any added position embeddings to the input.

See 14_advanced_positional_embeddings/.


10. DPO (direct preference optimization)

Starting point: RLHF objective with KL regularization to a reference policy:

Step 1 — derive the closed-form optimal policy. Set up Lagrangian on the constrained max (with ). Setting gives , where is from the normalization Lagrange multiplier. Cleaning up:

with — depends only on prompt , not on .

Step 2 — invert for :

Step 3 — substitute into Bradley-Terry: . Critically, depends on only — it appears identically in both reward terms and cancels in the subtraction.

Step 4 — final DPO loss (NLL of preferences):

Key insight: closed-form optimal policy + depending only on prompt = reward model eliminates itself. No RL loop, no rollouts, just a supervised classification loss on preferences.

See 08_training_techniques/.


11. Variational lower bound (ELBO)

Setup: latent-variable model . Want to maximize .

Trick: introduce variational distribution and use Jensen's:

Jensen's inequality for concave : . Apply it:

This is the ELBO.

Equivalent form (split ):

The gap to true log-likelihood: — exactly the KL between approximate and true posterior. ELBO is tight when matches the true posterior.

Reconstruction term + KL-to-prior term. The VAE objective.

See 21_dimensionality_reduction/ (autoencoders), 33_information_theory/.


12. Bias-variance decomposition

Setup: estimate from random training set . Evaluate at fixed .

Steps:

  1. Let .
  2. Add and subtract: .
  3. Cross-term vanishes: take . and are constants w.r.t. , so (by definition of ).
  4. .
  5. Now take over the noise in : first term becomes . Second term is .

See 27_advanced_theory/, 52_statistical_learning_theory/.


13. Information gain (decision tree split)

Setup: dataset with class labels.

Entropy: .

After split on feature into :

Information gain: .

Key identity: — IG is exactly the mutual information between class label and feature . That makes it intuitive: pick the feature that's most informative about the label.

Why : conditioning never increases entropy (Jensen on concave , applied to ). Equality iff .

Tree picks the split that maximizes IG (or Gini decrease in CART).

Gini: . Computationally cheaper (no log); similar selection.

See 26_tree_based_methods/.


14. Common interview gotchas

QuestionCommon wrong answerRight answer
What's in attention?TraditionVariance scaling — keeps QK product unit-variance
Cross-entropy + softmax gradient?Complicated. Beautifully simple.
Why does EM converge?Gradient descentEach E-step gives lower bound; M-step maximizes; likelihood monotone
What does ELBO bound?PosteriorLog-marginal-likelihood from below
KL forward vs reverse?SameForward mode-covering (MLE); reverse mode-seeking (VI)
SVM dual support vectors?Random pointsPoints where ; on/violating margin
RoPE relative property?Magic depends only on

15. Eight derivations to drill cold

  1. 2-layer MLP backprop with cross-entropy + softmax.
  2. Scaled dot-product attention with multi-head + masking.
  3. OLS gradient + closed form with PSD Hessian.
  4. Logistic regression gradient showing convexity.
  5. EM for GMM: E-step posterior, M-step updates.
  6. DPO loss from RLHF + Bradley-Terry.
  7. ELBO derivation via Jensen's inequality.
  8. Bias-variance decomposition.

For each: 5 minutes on a whiteboard. Until automatic.


16. Drill plan

  • 1 derivation per day for 8 days. Then cycle.
  • Time yourself: 5 min per derivation cold; 3 min after a week of practice.
  • Practice teaching each: explain to an imaginary interviewer.
  • Pair the derivation with the relevant deep dive's "8 most-asked interview questions" to make sure you can recite both proof and intuition.

17. Further reading

This deep dive is a meta-collection. The full derivations live in:

  • 31_neural_networks for backprop.
  • 04_transformers and 05_attention_mechanisms for attention.
  • 01_classical_ml for OLS and logistic.
  • 19_advanced_clustering for EM.
  • 08_training_techniques for DPO.
  • 21_dimensionality_reduction for ELBO/VAE.
  • 27_advanced_theory for bias-variance.

Drill the derivations in those locations and you'll be ready for the whiteboard rounds.