Whiteboard Derivations — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This deep dive is the catalog of derivations you should be able to do on a whiteboard cold. Frontier-lab interviews routinely ask "derive X" — backprop, attention, OLS gradient, KL, EM, DPO. Knowing the shape of these derivations beats memorizing the answer.

This is a meta-document that points to the relevant deep dive for each derivation while listing the key steps you need to recite.

1. Backpropagation for a 2-layer MLP

Setup: $z_{1} = W_{1} x + b_{1}$ , $h_{1} = σ (z_{1})$ , $z_{2} = W_{2} h_{1} + b_{2}$ , $\overset{y}{^} = softmax (z_{2})$ , $L = - \sum y lo g \overset{y}{^}$ .

Steps:

Cross-entropy + softmax simplification (the magic step — derive it, don't just assert):

Softmax Jacobian: $\partial \overset{y}{^}_{i} / \partial z_{2, j} = \overset{y}{^}_{i} (δ_{ij} - \overset{y}{^}_{j})$ .

$\partial L / \partial \overset{y}{^}_{i} = - y_{i} / \overset{y}{^}_{i}$ .

$\partial L / \partial z_{2, j} = \sum_{i} \frac{\partial L}{\partial y ^ _{i}} \frac{\partial y ^ _{i}}{\partial z _{2, j}} = - \sum_{i} \frac{y _{i}}{y ^ _{i}} \overset{y}{^}_{i} (δ_{ij} - \overset{y}{^}_{j}) = - y_{j} + \overset{y}{^}_{j} \sum_{i} y_{i} = \overset{y}{^}_{j} - y_{j}$ (using $\sum y_{i} = 1$ ).

So $δ_{2} = \overset{y}{^} - y$ .
$\nabla_{W_{2}} L = δ_{2} h_{1}^{⊤}$ .
$\nabla_{b_{2}} L = δ_{2}$ .
$δ_{1} = W_{2}^{⊤} δ_{2} ⊙ σ^{'} (z_{1})$ .
$\nabla_{W_{1}} L = δ_{1} x^{⊤}$ .
$\nabla_{b_{1}} L = δ_{1}$ .

Key insights:

Cross-entropy + softmax simplifies dramatically: gradient is just $\overset{y}{^} - y$ . The mess from softmax's Jacobian and CE's $1/ \overset{y}{^}$ cancel.
Chain rule: each layer multiplies by $W^{⊤}$ (transpose) and $σ^{'}$ .

See 31_neural_networks/.

2. Scaled dot-product attention

Setup: $Q, K, V \in R^{L \times d}$ .

Steps:

$scores = Q K^{⊤} / d$ .
$attn = softmax (scores)$ .
$output = attn \cdot V$ .

Why $d$ : variance of $Q K^{⊤}$ entries scales with $d$ if $Q, K$ have unit-variance entries. Divide by $d$ to keep variance at 1 → softmax doesn't saturate.

Multi-head: project to $h$ heads of dim $d / h$ ; do attention per head; concatenate; project back.

See 04_transformers/, 05_attention_mechanisms/.

3. OLS closed form

Setup: $L (w) = \frac{1}{2} ∥ y - Xw ∥^{2}$ .

Steps:

$\nabla_{w} L = - X^{⊤} (y - Xw) = X^{⊤} Xw - X^{⊤} y$ .
Set to zero: $X^{⊤} Xw = X^{⊤} y$ .
Solve: $\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y$ (assuming $X^{⊤} X$ invertible).

Hessian: $\nabla^{2} L = X^{⊤} X$ — PSD always; PD if $X$ has full column rank.

Geometric: $\overset{y}{^} = P y$ where $P = X (X^{⊤} X)^{- 1} X^{⊤}$ is the projection onto $Col (X)$ .

See 24_linear_algebra_qa/, 48_optimization_and_matrix_calculus/.

4. Logistic regression gradient

Setup: $p = σ (w^{⊤} x)$ , $L = - [y lo g p + (1 - y) lo g (1 - p)]$ .

Steps:

$\partial L / \partial p = - y / p + (1 - y) / (1 - p) = (p - y) / (p (1 - p))$ (combine fractions).
$\partial p / \partial z = σ (z) (1 - σ (z)) = p (1 - p)$ (sigmoid derivative).
Chain rule — the magic cancellation: $\partial L / \partial z = \frac{p - y}{p ( 1 - p )} \cdot p (1 - p) = p - y$ . The $p (1 - p)$ from sigmoid derivative kills the $p (1 - p)$ in the denominator from CE — that's the GLM canonical-link beauty.
$\nabla_{w} L = (p - y) x$ (since $z = w^{⊤} x$ , $\partial z / \partial w = x$ ).

Key insight: same gradient form as linear regression (residual times input) — that's why these models feel the same. Hessian is $\sum p (1 - p) x x^{⊤}$ , always PSD → loss convex.

See 01_classical_ml/, 37_mle_map_estimation/.

5. KL divergence

Definition: $KL (p ∥ q) = \sum_{x} p (x) lo g \frac{p ( x )}{q ( x )}$ .

Properties:

$\geq 0$ , with equality iff $p = q$ (Gibbs' inequality). Proof via Jensen (memorize this — most-asked):

$- KL (p ∥ q) = \sum_{x} p (x) lo g \frac{q ( x )}{p ( x )}$ . Since $lo g$ is concave, Jensen's inequality gives $\sum p (x) lo g \frac{q}{p} \leq lo g \sum p (x) \cdot \frac{q ( x )}{p ( x )} = lo g \sum q (x) = lo g 1 = 0$ . So $- KL \leq 0$ , i.e. $KL \geq 0$ . Equality iff $q / p$ is constant, i.e. $p = q$ (since both are distributions).
Asymmetric: $KL (p ∥ q) \neq = KL (q ∥ p)$ .
Forward KL ( $KL (p^{*} ∥ q)$ ): mean-seeking. MLE.
Reverse KL ( $KL (q ∥ p^{*})$ ): mode-seeking. Variational inference.

MLE = forward KL minimization: $ar g max_{θ} E_{p^{*}} [lo g q_{θ} (x)] = ar g min_{θ} KL (p^{*} ∥ q_{θ}) + H (p^{*})$ — the entropy term is constant.

See 33_information_theory/, 37_mle_map_estimation/.

6. EM for GMM

Setup: $p (x) = \sum_{k} π_{k} N (x ∣ μ_{k}, Σ_{k})$ .

E-step: posterior responsibilities

$γ_{ik} = \frac{π _{k} N ( x _{i} ∣ μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{i} ∣ μ _{j} , Σ _{j} )}$

M-step: weighted MLE updates

$μ_{k} = \frac{\sum _{i} γ _{ik} x _{i}}{\sum _{i} γ _{ik}}$

$Σ_{k} = \frac{\sum _{i} γ _{ik} ( x _{i} - μ _{k} ) ( x _{i} - μ _{k} ) ^{⊤}}{\sum _{i} γ _{ik}}$

$π_{k} = \frac{\sum _{i} γ _{ik}}{N}$

Why EM converges (the key identity to memorize):

For any distribution $q (z)$ : $lo g p_{θ} (x) = L (q, θ) — ELBO E_{q} [lo g \frac{p _{θ} ( x , z )}{q ( z )}] + \geq 0 KL (q (z) ∥ p_{θ} (z ∣ x))$

So $lo g p_{θ} (x) \geq L (q, θ)$ always, with equality iff $q = p_{θ} (z ∣ x)$ .

E-step: set $q = p_{θ} (z ∣ x)$ (the posterior responsibilities $γ_{ik}$ ). KL = 0 → bound is tight: $lo g p_{θ} (x) = L (q, θ_{t})$ .
M-step: maximize $L (q, θ)$ over $θ$ (since $q$ is fixed, this is just weighted MLE). $θ_{t + 1}$ raises the bound.
Net: $lo g p_{θ} (x_{t + 1}) \geq L (q, θ_{t + 1}) \geq L (q, θ_{t}) = lo g p_{θ} (x_{t})$ . Likelihood non-decreasing → bounded above → converges.

See 19_advanced_clustering/.

7. PCA via SVD

Setup: centered $X \in R^{n \times d}$ .

Steps:

Center the data, compute covariance: $Σ = X^{⊤} X / n$ .
SVD of centered $X$ : $X = U S V^{⊤}$ with $U^{⊤} U = I$ , $V^{⊤} V = I$ .
Substitute and simplify: $X^{⊤} X = (U S V^{⊤})^{⊤} (U S V^{⊤}) = V S U^{⊤} U S V^{⊤} = V S^{2} V^{⊤}$ (using $U^{⊤} U = I$ — that's the load-bearing step). So $Σ = V (S^{2} / n) V^{⊤}$ — this is the eigendecomposition of $Σ$ .
Top- $k$ principal directions: columns of $V$ . Variances along them: $S^{2} / n$ .
Reduced data: $X V_{k} = U_{k} S_{k}$ (project data onto top- $k$ directions).

Eckart-Young: truncated SVD $X_{k} = U_{k} S_{k} V_{k}^{⊤}$ minimizes $∥ X - X ∥_{F}^{2}$ over rank- $k$ $X$ .

See 21_dimensionality_reduction/.

8. SVM dual

Primal: $min_{w} \frac{1}{2} ∥ w ∥^{2}$ s.t. $y_{i} (w^{⊤} x_{i} + b) \geq 1$ .

Lagrangian: $L = \frac{1}{2} ∥ w ∥^{2} - \sum_{i} α_{i} [y_{i} (w^{⊤} x_{i} + b) - 1]$ .

Steps:

$\partial L / \partial w = w - \sum_{i} α_{i} y_{i} x_{i} = 0 ⟹ w^{*} = \sum_{i} α_{i} y_{i} x_{i}$ .
$\partial L / \partial b = - \sum_{i} α_{i} y_{i} = 0 ⟹ \sum_{i} α_{i} y_{i} = 0$ (constraint on $α$ ).
Substitute $w^{*}$ back into $L$ — this is the load-bearing step:
- $\frac{1}{2} ∥ w^{*} ∥^{2} = \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$ .
- $\sum_{i} α_{i} y_{i} (w^{* ⊤} x_{i}) = \sum_{i} α_{i} y_{i} \sum_{j} α_{j} y_{j} x_{j}^{⊤} x_{i} = \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$ (the full quadratic).
- $\sum_{i} α_{i} y_{i} b = b \cdot 0 = 0$ (using $\sum α_{i} y_{i} = 0$ ).
- $\sum_{i} α_{i}$ stays.
- Combining: $L (w^{*}, b, α) = \frac{1}{2} \sum_{ij} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j} - \sum_{ij} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j} + \sum_{i} α_{i} = \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$ .

Dual: $max_{α} \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}$ s.t. $α \geq 0, \sum_{i} α_{i} y_{i} = 0$ .

Kernel trick: replace $x_{i}^{⊤} x_{j}$ with $K (x_{i}, x_{j})$ . The dual is the only place data enters as inner products — perfect for kernels.

KKT — support vectors: complementary slackness gives $α_{i} > 0$ only for points where $y_{i} (w^{⊤} x_{i} + b) = 1$ (on margin); for soft-margin with $0 \leq α_{i} \leq C$ , $α_{i} = C$ for margin violators.

See 35_kernel_functions/, 48_optimization_and_matrix_calculus/.

9. RoPE rotation

Goal: encode relative position via rotation in 2D subspaces.

Setup: pair up dimensions; for pair $(2 i, 2 i + 1)$ , apply rotation by $m θ_{i}$ to position $m$ :

$R_{m} = (cos m θ_{i} sin m θ_{i} - sin m θ_{i} cos m θ_{i})$

with $θ_{i} = 1000 0^{- 2 i / d}$ .

Property: $⟨ R_{m} q, R_{n} k ⟩ = ⟨ q, R_{n - m} k ⟩$ . Inner product depends only on the relative position $n - m$ .

Why this works (the algebra to memorize):

$⟨ R_{m} q, R_{n} k ⟩ = (R_{m} q)^{⊤} (R_{n} k) = q^{⊤} R_{m}^{⊤} R_{n} k$ .
Rotations are orthogonal, so $R_{m}^{⊤} = R_{m}^{- 1} = R_{- m}$ .
Rotations also compose by adding angles: $R_{- m} R_{n} = R_{n - m}$ .
Therefore $q^{⊤} R_{n - m} k = ⟨ q, R_{n - m} k ⟩$ — a function of $n - m$ only.

This is what makes attention self-positionally-aware in a relative way without any added position embeddings to the input.

See 14_advanced_positional_embeddings/.

10. DPO (direct preference optimization)

Starting point: RLHF objective with KL regularization to a reference policy:

$π max E_{x, y \sim π} [r (x, y)] - β KL (π (\cdot ∣ x) ∥ π_{ref} (\cdot ∣ x))$

Step 1 — derive the closed-form optimal policy. Set up Lagrangian on the constrained max (with $\sum_{y} π (y ∣ x) = 1$ ). Setting $\partial / \partial π (y ∣ x) = 0$ gives $lo g π (y ∣ x) = lo g π_{ref} (y ∣ x) + r (x, y) / β - lo g Z (x) - 1$ , where $Z$ is from the normalization Lagrange multiplier. Cleaning up:

$π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{ref} (y ∣ x) exp (r (x, y) / β)$

with $Z (x) = \sum_{y} π_{ref} (y ∣ x) exp (r (x, y) / β)$ — depends only on prompt $x$ , not on $y$ .

Step 2 — invert for $r$ :

$r (x, y) = β lo g \frac{π ^{*} ( y ∣ x )}{π _{ref} ( y ∣ x )} + β lo g Z (x)$

Step 3 — substitute into Bradley-Terry: $p (y_{w} ≻ y_{l} ∣ x) = σ (r (x, y_{w}) - r (x, y_{l}))$ . Critically, $β lo g Z (x)$ depends on $x$ only — it appears identically in both reward terms and cancels in the subtraction.

Step 4 — final DPO loss (NLL of preferences):

$L_{DPO} = - lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{ref} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{ref} ( y _{l} ∣ x )})$

Key insight: closed-form optimal policy + $Z (x)$ depending only on prompt = reward model eliminates itself. No RL loop, no rollouts, just a supervised classification loss on preferences.

See 08_training_techniques/.

11. Variational lower bound (ELBO)

Setup: latent-variable model $p_{θ} (x, z)$ . Want to maximize $lo g p_{θ} (x)$ .

Trick: introduce variational distribution $q (z ∣ x)$ and use Jensen's:

$lo g p_{θ} (x) = lo g \int p_{θ} (x, z) d z = lo g E_{q (z ∣ x)} [\frac{p _{θ} ( x , z )}{q ( z ∣ x )}]$

Jensen's inequality for concave $lo g$ : $lo g E [X] \geq E [lo g X]$ . Apply it:

$lo g p_{θ} (x) = lo g E_{q} [\frac{p _{θ} ( x , z )}{q ( z ∣ x )}] \geq E_{q} [lo g \frac{p _{θ} ( x , z )}{q ( z ∣ x )}] = E_{q} [lo g p_{θ} (x, z)] + H (q)$

This is the ELBO.

Equivalent form (split $lo g p_{θ} (x, z) = lo g p_{θ} (x ∣ z) + lo g p (z)$ ):

$ELBO = E_{q} [lo g p_{θ} (x ∣ z)] + E_{q} [lo g p (z)] - E_{q} [lo g q (z ∣ x)] = E_{q} [lo g p_{θ} (x ∣ z)] - KL (q (z ∣ x) ∥ p (z))$

The gap to true log-likelihood: $lo g p_{θ} (x) - ELBO = KL (q (z ∣ x) ∥ p_{θ} (z ∣ x))$ — exactly the KL between approximate and true posterior. ELBO is tight when $q$ matches the true posterior.

Reconstruction term + KL-to-prior term. The VAE objective.

See 21_dimensionality_reduction/ (autoencoders), 33_information_theory/.

12. Bias-variance decomposition

Setup: estimate $f^{*} (x)$ from random training set $D$ . Evaluate at fixed $x$ .

Steps:

Let $\overset{ˉ}{f} (x) = E_{D} [\hat{f}_{D} (x)]$ .
Add and subtract: $(y - \hat{f}_{D})^{2} = (y - \overset{ˉ}{f} + \overset{ˉ}{f} - \hat{f}_{D})^{2} = (y - \overset{ˉ}{f})^{2} + 2 (y - \overset{ˉ}{f}) (\overset{ˉ}{f} - \hat{f}_{D}) + (\overset{ˉ}{f} - \hat{f}_{D})^{2}$ .
Cross-term vanishes: take $E_{D}$ . $y$ and $\overset{ˉ}{f}$ are constants w.r.t. $D$ , so $E_{D} [2 (y - \overset{ˉ}{f}) (\overset{ˉ}{f} - \hat{f}_{D})] = 2 (y - \overset{ˉ}{f}) E_{D} [\overset{ˉ}{f} - \hat{f}_{D}] = 2 (y - \overset{ˉ}{f}) \cdot 0 = 0$ (by definition of $\overset{ˉ}{f}$ ).
$E_{D} [(y - \hat{f}_{D})^{2}] = (y - \overset{ˉ}{f})^{2} + E_{D} [(\overset{ˉ}{f} - \hat{f}_{D})^{2}]$ .
Now take $E$ over the noise in $y = f^{*} (x) + ϵ$ : first term becomes $(\overset{ˉ}{f} - f^{*})^{2} + σ^{2} = Bias^{2} + σ^{2}$ . Second term is $Var$ .

See 27_advanced_theory/, 52_statistical_learning_theory/.

13. Information gain (decision tree split)

Setup: dataset $S$ with class labels.

Entropy: $H (S) = - \sum_{c} p_{c} lo g p_{c}$ .

After split on feature $A$ into ${S_{v}}$ :

$H (S ∣ A) = v \sum \frac{∣ S _{v} ∣}{∣ S ∣} H (S_{v})$

Information gain: $IG = H (S) - H (S ∣ A)$ .

Key identity: $IG = I (S; A)$ — IG is exactly the mutual information between class label and feature $A$ . That makes it intuitive: pick the feature that's most informative about the label.

Why $IG \geq 0$ : conditioning never increases entropy (Jensen on concave $H$ , applied to $H (S ∣ A) \leq H (S)$ ). Equality iff $S ⊥ A$ .

Tree picks the split that maximizes IG (or Gini decrease in CART).

Gini: $G (S) = 1 - \sum_{c} p_{c}^{2}$ . Computationally cheaper (no log); similar selection.

See 26_tree_based_methods/.

14. Common interview gotchas

Question	Common wrong answer	Right answer
What's $d$ in attention?	Tradition	Variance scaling — keeps QK product unit-variance
Cross-entropy + softmax gradient?	Complicated	$p - y$ . Beautifully simple.
Why does EM converge?	Gradient descent	Each E-step gives lower bound; M-step maximizes; likelihood monotone
What does ELBO bound?	Posterior	Log-marginal-likelihood from below
KL forward vs reverse?	Same	Forward mode-covering (MLE); reverse mode-seeking (VI)
SVM dual support vectors?	Random points	Points where $α_{i} > 0$ ; on/violating margin
RoPE relative property?	Magic	$⟨ R_{m} q, R_{n} k ⟩$ depends only on $n - m$

15. Eight derivations to drill cold

2-layer MLP backprop with cross-entropy + softmax.
Scaled dot-product attention with multi-head + masking.
OLS gradient + closed form with PSD Hessian.
Logistic regression gradient showing convexity.
EM for GMM: E-step posterior, M-step updates.
DPO loss from RLHF + Bradley-Terry.
ELBO derivation via Jensen's inequality.
Bias-variance decomposition.

For each: 5 minutes on a whiteboard. Until automatic.

16. Drill plan

1 derivation per day for 8 days. Then cycle.
Time yourself: 5 min per derivation cold; 3 min after a week of practice.
Practice teaching each: explain to an imaginary interviewer.
Pair the derivation with the relevant deep dive's "8 most-asked interview questions" to make sure you can recite both proof and intuition.

17. Further reading

This deep dive is a meta-collection. The full derivations live in:

31_neural_networks for backprop.
04_transformers and 05_attention_mechanisms for attention.
01_classical_ml for OLS and logistic.
19_advanced_clustering for EM.
08_training_techniques for DPO.
21_dimensionality_reduction for ELBO/VAE.
27_advanced_theory for bias-variance.

Drill the derivations in those locations and you'll be ready for the whiteboard rounds.

ML & LLM Interview Prep — Deep Dives