Advanced ML Theory — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
This is the "ML theory you should actually know cold" — bias-variance with proof, cross-validation theory, learning curves, model selection (AIC/BIC), and ROC analysis. Some of this overlaps with the SLT and generalization deep dives but here we focus on the practical decisions these theories inform.
1. Bias-variance — the proof
For a regression model trained on a random dataset , evaluating at a fixed test point :
where:
- — average error of the model from truth.
- — variability across training sets.
- — irreducible noise.
Derivation
Let (average prediction across training sets).
Expanding (cross-term vanishes by definition of ):
The first term is bias² + noise:
The second term is variance.
Implications
- Underfit: high bias (model too simple), low variance.
- Overfit: low bias, high variance.
- Tradeoff: total error minimized at intermediate capacity.
- Modern over-parameterized regime: double descent (see SLT deep dive). Classical view doesn't apply.
2. Cross-validation
k-fold CV
Split data into folds. For each fold: train on , test on 1. Average the test errors.
where is the model trained without fold , is fold .
Why matters
- : high bias (each fold trains on only half the data → underestimates large- performance); low variance (folds barely overlap, estimates are nearly independent).
- (LOO): low bias (uses samples, almost all data) but high variance (training sets differ in only one example → estimates highly correlated).
- or : standard compromise between the two.
Variants
- Stratified k-fold: preserve class ratios. Default for classification.
- Group k-fold: keep groups (users, patients) entirely on one side.
- Time-series split: sliding or expanding window. Never random for time series.
- Repeated k-fold: run k-fold multiple times with different seeds; average.
- Nested CV: outer for evaluation, inner for hyperparameter tuning. Avoids contamination.
Common pitfalls
- Hyperparameter tuning + final evaluation on same fold → optimistic bias.
- Preprocessing on full data before splitting → leakage.
- Not stratifying for imbalanced classes → high CV variance.
- Random split for time-series → temporal leakage.
LOO-CV closed forms
For linear regression:
where is the -th diagonal of the hat matrix . Computed without retraining times.
3. Learning curves
Plot training error and validation error vs training set size .
What they tell you
High bias (underfitting):
- Train error high.
- Validation error converges to train error from above.
- Gap small.
- More data won't help — model is fundamentally too simple.
High variance (overfitting):
- Train error low.
- Validation error high.
- Big gap.
- More data will help (gap closes as grows).
Decision-making
- See big gap? → more data, regularize, or simpler model.
- See high training error? → bigger model, better features, less regularization.
Practical use
Always plot learning curves before deciding "we need more data" vs "we need a better model." Often answers it definitively.
4. Validation curves
Plot training error and validation error vs a hyperparameter (e.g., model capacity, regularization strength).
Reveals the bias-variance trade-off across hyperparameter values.
Sweet spot: minimum of validation error. Train error keeps improving past this; validation error rises again — overfitting.
5. Information criteria for model selection
When you can compute model likelihood, criteria let you compare models without held-out data.
AIC (Akaike Information Criterion)
where = number of parameters, = max likelihood. Lower is better.
Derivation: estimates the expected KL divergence between the fitted model and the true distribution. Penalty adjusts for using the data twice (training + evaluation).
BIC (Bayesian Information Criterion)
with = number of observations. Lower is better.
Derivation: large-sample approximation of the log marginal likelihood (Bayesian model evidence). Penalty grows with .
AIC vs BIC
- BIC penalty grows with → BIC selects simpler models for large .
- AIC: optimal for prediction; doesn't assume true model in candidate set.
- BIC: consistent for true model selection if true model is in candidate set.
- BIC > AIC penalty for .
Limitations
- Both require evaluating likelihood — only meaningful when likelihood is well-defined.
- Don't directly apply to regularized models (effective unclear).
- Assume model is correctly specified.
6. ROC and PR curves
ROC curve
Plot True Positive Rate (TPR) vs False Positive Rate (FPR) as threshold varies.
- TPR = TP / (TP + FN) — sensitivity / recall.
- FPR = FP / (FP + TN) — fall-out.
- Top-left corner = perfect classifier.
- Diagonal = random classifier.
AUROC = area under ROC. Probability that a random positive ranks above a random negative.
PR curve
Plot Precision vs Recall as threshold varies.
- Better for imbalanced (where most negatives are easy).
- AUPRC: more informative than AUROC for severe imbalance.
Choosing operating point
- Cost-aware: .
- Recall constraint: pick such that recall ≥ X.
- F-score optimization: .
F-beta score
: F1. : weight recall more (e.g., disease screening). : weight precision more (e.g., spam).
7. Confusion matrix and derived metrics
| Predicted positive | Predicted negative | |
|---|---|---|
| Actual positive | TP | FN |
| Actual negative | FP | TN |
- Accuracy: .
- Precision: — what fraction of positive predictions were right.
- Recall (sensitivity, TPR): — what fraction of actual positives were found.
- Specificity (TNR): .
- F1: harmonic mean of P and R.
- MCC (Matthews Correlation Coefficient): balanced metric for imbalanced.
Why F1 not arithmetic mean?
Harmonic mean penalizes imbalance more — F1 = 0.5 only when both P and R = 0.5. F1 = 0 if either is 0.
8. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Bias-variance — what's the third term? | "Just bias and variance" | Irreducible noise |
| Why is LOO-CV high variance? | Lots of data | Training sets are highly correlated → predictions are correlated → empirical mean has high variance |
| Why does k=10 work well? | Tradition | Empirical compromise: most data used, manageable variance |
| AIC vs BIC — same purpose? | Yes | AIC for prediction, BIC for model selection (true model in candidates) |
| AUROC vs AUPRC for imbalance? | Same | AUPRC much more informative; AUROC dominated by easy negatives |
| Time-series with k-fold? | Sure | Never — temporal leakage |
| F1 = arithmetic mean of P and R? | Yes | Harmonic mean — penalizes imbalance |
9. Eight most-asked interview questions
- Derive the bias-variance decomposition. (Add and subtract ; expand; cross-term zero.)
- What's the main purpose of cross-validation? (Estimate generalization without leaking test data.)
- What does a learning curve tell you? (High bias vs high variance via train-val gap; informs "more data" vs "better model".)
- AIC vs BIC? (Both penalize complexity; BIC penalty grows; AIC for prediction, BIC for true-model identification.)
- What's wrong with AUROC for severe imbalance? (Negatives dominate; many easy positives lift AUROC; AUPRC focuses on positives.)
- F1 vs accuracy? (Accuracy misleading for imbalance; F1 is harmonic mean of P and R.)
- Why use stratified k-fold? (Preserve class ratios; reduces CV variance.)
- What's nested CV? (Outer for evaluation; inner for hyperparameter tuning. Prevents tuning bias in outer estimate.)
10. Drill plan
- Derive bias-variance decomposition on paper.
- For each CV variant (k-fold, stratified, group, time-series, nested), recite when used.
- Recite AIC and BIC formulas + when each.
- Sketch ROC and PR curves for: random, perfect, threshold-based binary classifier.
- For each F-score variant (), recite when used.
- Plot a learning curve for "high bias" vs "high variance" — describe to interviewer.
11. Further reading
- Hastie, Tibshirani, Friedman, The Elements of Statistical Learning — chapters 7 (model assessment), 8 (model inference).
- Bishop, Pattern Recognition and Machine Learning — chapter 1 (bias-variance).
- Kohavi (1995), A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
- Saito & Rehmsmeier (2015), The Precision-Recall Plot is More Informative than the ROC Plot...
- Burnham & Anderson, Model Selection and Multi-Model Inference — AIC/BIC reference.