Information Theory — Interview Grill

40 questions on information theory in ML. Drill until you can answer 30+ cold.

A. Foundations

1. Define entropy. $H (p) = - \sum_{x} p (x) lo g p (x)$ . Average surprise / number of bits (or nats) needed to encode an outcome from $p$ . Maximum at uniform distribution; minimum ( $= 0$ ) at deterministic.

2. State the bounds on $H (p)$ . $0 \leq H (p) \leq lo g ∣ X ∣$ . Lower bound at deterministic distributions; upper bound at uniform.

3. Why is $H$ concave? Mixing two distributions produces higher entropy than the average of their entropies. Intuitively: averaging adds uncertainty. Formally: Jensen's inequality applied to $- p lo g p$ .

4. Define cross-entropy. $H (p, q) = - \sum_{x} p (x) lo g q (x)$ . Average code length when encoding samples from $p$ using a code optimal for $q$ . Equals $H (p) + KL (p ∥ q)$ .

5. Why is cross-entropy bounded below by entropy? $H (p, q) = H (p) + KL (p ∥ q) \geq H (p)$ because KL $\geq 0$ . You can't encode samples from $p$ more efficiently than $H (p)$ (Shannon's source coding theorem).

6. Define KL divergence. $KL (p ∥ q) = \sum p (x) lo g (p (x) / q (x)) = E_{x \sim p} [lo g p - lo g q]$ . Measures how $q$ differs from $p$ from $p$ 's perspective.

7. Three properties of KL. Non-negative ( $KL \geq 0$ , with equality iff $p = q$ ). Asymmetric ( $KL (p ∥ q) \neq = KL (q ∥ p)$ ). Not a metric (triangle inequality fails). Coordinate-invariant under reparameterization.

B. Forward vs reverse KL

8. What's the difference between forward and reverse KL? Forward KL is mean-seeking; reverse is mode-seeking. Forward $KL (p ∥ q)$ penalizes $q$ being small where $p$ is large → $q$ spreads to cover all of $p$ . Reverse $KL (q ∥ p)$ penalizes $q$ being large where $p$ is small → $q$ collapses to one mode. MLE uses forward; variational inference uses reverse.

9. Which one does MLE optimize? Forward KL. Minimizing cross-entropy = minimizing $KL (p_{data} ∥ p_{θ})$ . Mean-seeking — the model tries to cover all of the data distribution.

10. Why do MLE-trained models often produce "average" outputs? Forward KL is mean-seeking. If the data has multiple modes (e.g., translations have multiple correct outputs), the model spreads probability across them. Sampling produces an average-looking output that may not match any single mode.

11. When would you use reverse KL? Variational inference (where you want a tractable $q$ to fit the most likely mode of the posterior). Some RL methods. Knowledge distillation in some forms.

12. Why do GANs use Jensen-Shannon? $JS = (1/2) KL (p ∥ M) + (1/2) KL (q ∥ M)$ where $M = (p + q) /2$ . Symmetric, bounded $[0, lo g 2]$ . The original GAN (Goodfellow 2014) optimizes a JS-related objective. Provides smoother gradients than KL alone.

C. Cross-entropy as ML loss

13. Why is cross-entropy the standard ML loss? Three views: (a) MLE under categorical distribution (likelihood-justified); (b) Forward KL between data and model (mean-seeking); (c) Compression-optimal code length (Shannon).

14. Cross-entropy gradient w.r.t. logits? Predicted minus actual. $\partial L / \partial z = softmax (z) - one_hot (y) = \overset{p}{^} - y$ . Same form as logistic regression — the GLM canonical-link cancellation (sigmoid/softmax derivative kills the $1/ p$ from log).

15. Why don't we use MSE for classification? Two reasons. (a) MLE under Bernoulli/categorical mandates cross-entropy; MSE corresponds to a different (Gaussian) generative assumption. (b) MSE+sigmoid has vanishing gradients on confidently-wrong predictions and is non-convex.

16. Walk me through MLE = forward KL minimization. One-line story: Maximizing log-likelihood = minimizing KL from data to model. Entropy of the data is fixed, so it drops out.

Algebra: $max_{θ} E_{p_{data}} [lo g p_{θ}] = min_{θ} - E_{p_{data}} [lo g p_{θ}] = min_{θ} E_{p_{data}} [lo g p_{data} - lo g p_{θ}] - H (p_{data}) = min_{θ} KL (p_{data} ∥ p_{θ}) - H (p_{data})$ . The $H$ term doesn't depend on $θ$ , so MLE = forward KL minimization.

D. Mutual information

17. Define mutual information. $I (X; Y) = KL (P (X, Y) ∥ P (X) P (Y)) = H (X) + H (Y) - H (X, Y) = H (Y) - H (Y ∣ X)$ . Multiple equivalent forms.

18. What does MI measure? How much knowing $Y$ reduces uncertainty about $X$ . If $X ⊥ Y$ , MI $= 0$ . If $Y$ perfectly determines $X$ , $I (X; Y) = H (X)$ .

19. Properties of MI? Non-negative. Symmetric $I (X; Y) = I (Y; X)$ . $I (X; X) = H (X)$ .

20. What's InfoNCE? $L = - E [lo g exp (f (x, y_{+})) / \sum_{i} exp (f (x, y_{i}))]$ . Contrastive loss; lower bound on $I (X; Y_{+})$ . Used in CLIP, MoCo, SimCLR. Trains representations that have high MI with positives, low with negatives.

21. What's the information bottleneck? Tishby et al. 2000. Train representations $Z$ to maximize $I (Y; Z)$ (predictive of label) while minimizing $I (X; Z)$ (compress input). Theoretical framework for learning compressed yet predictive representations.

E. Conditional and joint entropy

22. Define conditional entropy. $H (X ∣ Y) = - \sum_{x, y} p (x, y) lo g p (x ∣ y)$ . Average uncertainty about $X$ given known $Y$ . Always between 0 and $H (X)$ .

23. Chain rule for entropy. $H (X, Y) = H (X) + H (Y ∣ X) = H (Y) + H (X ∣ Y)$ . Joint = marginal + conditional. Same as probability chain rule but for entropy.

24. What's $H (Y ∣ X)$ in ML? The irreducible "noise" any model has to contend with — the lower bound on cross-entropy loss when predicting $Y$ from $X$ . If $H (Y ∣ X) = 0$ , the input perfectly determines the output (deterministic mapping). Otherwise, there's a fundamental limit on prediction quality.

F. KL in machine learning

25. Where does KL appear in VAE training? ELBO: $lo g p (x) \geq E_{q (z ∣ x)} [lo g p (x ∣ z)] - KL (q (z ∣ x) ∥ p (z))$ . The KL term penalizes the variational posterior $q$ for being far from the prior $p (z)$ .

26. Where does KL appear in RLHF? The objective: $max E [r] - β \cdot KL (π ∥ π_{ref})$ . KL anchor prevents the policy from drifting too far from the SFT reference. Bounds reward hacking.

27. Where does KL appear in distillation? Train student to match teacher's distribution: $min_{student} KL (p_{teacher} ∥ p_{student})$ . Student inherits teacher's full confidence pattern, not just hard predictions.

28. Why is the KL from the optimal RLHF policy what gives DPO? Closed-form solution to the RLHF objective: $π^{*} = π_{ref} \cdot exp (r / β) / Z$ . Solve for $r$ and substitute into Bradley-Terry. $Z$ cancels. Result is DPO loss. See 08_training_techniques/ALIGNMENT_DEEP_DIVE.md.

29. KL between two Gaussians? For $p = N (μ_{1}, Σ_{1}), q = N (μ_{2}, Σ_{2})$ :

$KL (p ∥ q) = \frac{1}{2} [lo g \frac{∣ Σ _{2} ∣}{∣ Σ _{1} ∣} - d + tr (Σ_{2}^{- 1} Σ_{1}) + (μ_{2} - μ_{1})^{⊤} Σ_{2}^{- 1} (μ_{2} - μ_{1})]$

Closed form in dimensions and means. Famous formula; sometimes asked.

G. Other divergences

30. What's the relationship between KL and total variation? Pinsker's inequality: $TV (p, q) \leq KL (p ∥ q) /2$ . Bounds TV by KL. Used in concentration bounds and convergence proofs.

31. What's an f-divergence? A family $D_{f} (p ∥ q) = \sum_{x} q (x) f (p (x) / q (x))$ for convex $f$ with $f (1) = 0$ . KL: $f (t) = t lo g t$ . Reverse KL: $f (t) = - lo g t$ . JS, Hellinger, $χ^{2}$ are also f-divergences.

32. What's Wasserstein distance and how is it different? Optimal transport distance: minimum cost to "move" mass to transform $p$ into $q$ , where cost is integrated over the underlying space. Considers geometry of the space (not just distribution mass). Used in WGAN, optimal transport, distribution matching. Stronger smoothness properties than KL.

33. Why might WGAN beat vanilla GAN? Wasserstein gives smoother gradients than JS, especially when $p$ and $q$ have disjoint supports. Vanilla GAN's JS-based objective can saturate; WGAN's continuous Wasserstein landscape doesn't.

H. Compression connections

34. State Shannon's source coding theorem. The minimum average code length per symbol for a lossless code is $H (p)$ . You cannot compress below entropy.

35. What does cross-entropy tell us about compression? Cross-entropy $H (p, q)$ is the average code length when using a code optimal for $q$ to encode samples from $p$ . Always $\geq H (p)$ . Minimizing cross-entropy = finding a near-optimal code (compressor) for the data.

36. How does this relate to LLMs? LLMs are compressors of their training distribution. Better LM → lower cross-entropy → better compression. Modern LLMs can compress text below traditional methods (gzip, etc.) — Deletang et al. 2023.

I. Numerical and gotcha

37. What's the log-sum-exp trick? For numerical stability: $lo g \sum exp (z) = max (z) + lo g \sum exp (z - max (z))$ . Without this, large logits overflow $exp$ . Standard in softmax/cross-entropy implementations.

38. Can KL be infinite? Yes. If $p (x) > 0$ but $q (x) = 0$ for some $x$ , then $KL (p ∥ q) = \infty$ . (Encoding samples from $p$ with $q$ 's code is impossible — $q$ assigns 0 probability to outcomes that occur.)

39. Is the entropy of a mixture always greater than the average entropy? Yes (concavity). $H ((p + q) /2) \geq (H (p) + H (q)) /2$ . Mixing increases entropy.

40. Why do KL divergences appear in PAC-Bayes / generalization bounds? KL between learned posterior and prior bounds generalization error. Lower KL (posterior close to prior) = tighter generalization bound. PAC-Bayesian framework underpins much of modern generalization theory.

Quick fire

41. Entropy in bits if log base 2. True. 42. Entropy in nats if log base $e$ . True. 43. KL is a metric? No. 44. Cross-entropy = entropy + KL. True. 45. MLE = forward KL minimization. True.

Self-grading

If you can't answer 1-15, you don't know information theory. If you can't answer 16-30, you'll struggle on RLHF/distillation interviews. If you can't answer 31-45, frontier-lab interviews will go past you.

Aim for 30+/45 cold.

ML & LLM Interview Prep — Deep Dives