Statistical Learning Theory — Interview Grill

40 questions on ERM, PAC, VC, Rademacher, bias-variance, double descent. Drill until you can answer 28+ cold.

A. Empirical risk minimization

1. Define population risk. $R (f) = E_{(x, y) \sim D} [ℓ (f (x), y)]$ — expected loss on the true distribution.

2. Define empirical risk. $\hat{R}_{n} (f) = \frac{1}{n} \sum_{i} ℓ (f (x_{i}), y_{i})$ — average loss on training sample.

3. ERM definition? $\hat{f}_{n} = ar g min_{f \in F} \hat{R}_{n} (f)$ .

4. Approximation vs estimation error? Approximation: gap between best in $F$ and true optimum. Estimation: gap between ERM solution and best in $F$ . Bias-variance in formal terms.

5. Why does ERM fail for too-large $F$ ? Many functions interpolate the training data with very different test behavior. Empirical winner overfits.

B. PAC learning

6. State PAC learnability. Algorithm returns $\hat{f}$ such that $P (R (\hat{f}) - R^{*} \leq ϵ) \geq 1 - δ$ , with sample complexity $n (ϵ, δ)$ polynomial.

7. Sample complexity for finite $F$ realizable? $n \geq (lo g ∣ F ∣ + lo g (1/ δ)) / ϵ$ .

8. Sample complexity agnostic case? $n \geq (lo g ∣ F ∣ + lo g (1/ δ)) / ϵ^{2}$ . Slower rate.

9. Why log $∣ F ∣$ and not $∣ F ∣$ ? Union bound over $F$ gives $∣ F ∣$ , log appears via Hoeffding's exponential concentration → log of count.

10. Realizable vs agnostic? Realizable: some $f \in F$ has zero error. Agnostic: best $f$ has positive error.

C. VC dimension

11. Define VC dimension. Size of largest set shattered by $F$ — i.e., for which $F$ realizes every binary labeling.

12. VC of linear classifiers in $R^{d}$ ? $d + 1$ .

13. VC of axis-aligned rectangles in $R^{2}$ ? $4$ .

14. VC of decision stumps? 1D axis-aligned threshold: VC = 2 (a stump can shatter any 2 points but not 3 collinear). Over $d$ binary features (axis-aligned thresholds), VC = $Θ (lo g d)$ .

15. VC bound on generalization gap? $O ((VC + lo g (1/ δ)) / n)$ . Shrinks as $1/ n$ for fixed VC.

16. Why is VC bound vacuous for deep nets? VC dim of a deep net is enormous (exponential in some parameters). Bound says "you might be wildly overfitting" — but empirically you're not.

D. Rademacher complexity

17. Rademacher complexity intuition? How well can $F$ fit random binary labels (Rademacher variables)? Larger = more capacity.

18. Rademacher generalization bound? $R (f) \leq \hat{R}_{n} (f) + 2 R_{n} (F) + O (lo g (1/ δ) / n)$ .

19. Rademacher of linear classifiers with $∥ w ∥ \leq B$ ? $O (B / n)$ . Depends on norm, not dimension!

20. Why is Rademacher tighter than VC? Distribution-aware. VC is worst case over all distributions; Rademacher uses the actual training sample.

21. Margin-based bounds — what's the idea? Replace VC dim with norm-times-margin terms. Tighter for trained networks (Bartlett-Foster-Telgarsky).

E. Bias-variance and double descent

22. State the bias-variance trade-off. More capacity → less approximation error, more estimation error. U-shaped test error. Find sweet spot.

23. What's double descent? Test error has second descent in over-parameterized regime (params ≫ data).

24. Where's the double descent peak? At interpolation threshold (params ≈ data), test error spikes. Past that, decreases.

25. Why does double descent happen? Implicit regularization (SGD finds particular interpolators), margin-based bounds, structure of overparameterized loss landscape.

26. Lottery ticket hypothesis? Dense networks contain sparse subnetworks ("winning tickets") that, retrained from same init, match dense performance. Frankle & Carbin 2018.

27. NTK — what is it? Neural Tangent Kernel. In infinite-width limit, deep networks behave like a kernel method with a specific kernel. Provides theoretical handle on generalization.

F. No-free-lunch and inductive bias

28. State no-free-lunch. Averaged over all possible data distributions, all learning algorithms have the same expected performance.

29. What does NFL imply? ML works because of bias toward useful structure in real data. Without inductive bias, no algorithm is universally better.

30. Examples of inductive bias? Convolutions: locality and translation equivariance. Attention: content-based mixing. RNN: sequential. MLP: smooth. GBDT: hierarchical splits.

31. Why do CNNs work for images? Inductive bias matches structure of natural images: local features, translation invariance, hierarchy.

32. Why don't CNNs work as well for tabular? Tabular features don't have local spatial structure. GBDT inductive bias (axis-aligned splits) matches better.

G. Regularization

33. Regularization as inductive bias — explain. Regularizer adds preference for some functions over others. Equivalent to a prior in the Bayesian sense.

34. $ℓ_{2}$ regularization corresponds to which prior? Gaussian on weights.

35. $ℓ_{1}$ regularization corresponds to which prior? Laplace on weights → sparsity.

36. Why does early stopping regularize? GD started from small weights stays close to them; small effective norm → regularization. Analogous to $ℓ_{2}$ .

37. Data augmentation as regularization? Encodes invariance — model must be robust to specified transformations. Implicit prior.

H. Modern bounds

38. PAC-Bayes idea? Bound generalization gap by $KL (posterior ∥ prior)$ . Trained model = posterior; init = prior. Empirically gives nonvacuous bounds for deep nets.

39. Stability-based generalization? If algorithm output is stable to small training set changes, it generalizes. SGD is approximately stable.

40. Compression-based bounds? If a trained network compresses to few effective parameters, that's the relevant capacity. Lottery-ticket flavor.

41. VC linear classifier in $R^{d}$ ? $d + 1$ . 42. Rademacher of linear with norm $B$ ? $O (B / n)$ . 43. Sample complexity rate, agnostic? $1/ ϵ^{2}$ . 44. NFL — implication? Need inductive bias. 45. Double descent location? At interpolation. 46. Lottery ticket? Sparse subnetwork matching dense performance. 47. NTK regime? Infinite width. 48. PAC stands for? Probably Approximately Correct. 49. VC for over-parameterized nets? Vacuous bounds. 50. Inductive bias of CNN? Translation equivariance + locality.

Self-grading

If you can't answer 1-15, you don't know SLT basics. If you can't answer 16-30, you'll struggle on capacity / generalization questions. If you can't answer 31-45, frontier-lab theory questions will go past you.

Aim for 30+/50 cold.

ML & LLM Interview Prep — Deep Dives