Advanced ML Theory — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

This is the "ML theory you should actually know cold" — bias-variance with proof, cross-validation theory, learning curves, model selection (AIC/BIC), and ROC analysis. Some of this overlaps with the SLT and generalization deep dives but here we focus on the practical decisions these theories inform.


1. Bias-variance — the proof

For a regression model trained on a random dataset , evaluating at a fixed test point :

where:

  • — average error of the model from truth.
  • — variability across training sets.
  • — irreducible noise.

Derivation

Let (average prediction across training sets).

Expanding (cross-term vanishes by definition of ):

The first term is bias² + noise:

The second term is variance.

Implications

  • Underfit: high bias (model too simple), low variance.
  • Overfit: low bias, high variance.
  • Tradeoff: total error minimized at intermediate capacity.
  • Modern over-parameterized regime: double descent (see SLT deep dive). Classical view doesn't apply.

2. Cross-validation

k-fold CV

Split data into folds. For each fold: train on , test on 1. Average the test errors.

where is the model trained without fold , is fold .

Why matters

  • : high bias (each fold trains on only half the data → underestimates large- performance); low variance (folds barely overlap, estimates are nearly independent).
  • (LOO): low bias (uses samples, almost all data) but high variance (training sets differ in only one example → estimates highly correlated).
  • or : standard compromise between the two.

Variants

  • Stratified k-fold: preserve class ratios. Default for classification.
  • Group k-fold: keep groups (users, patients) entirely on one side.
  • Time-series split: sliding or expanding window. Never random for time series.
  • Repeated k-fold: run k-fold multiple times with different seeds; average.
  • Nested CV: outer for evaluation, inner for hyperparameter tuning. Avoids contamination.

Common pitfalls

  • Hyperparameter tuning + final evaluation on same fold → optimistic bias.
  • Preprocessing on full data before splitting → leakage.
  • Not stratifying for imbalanced classes → high CV variance.
  • Random split for time-series → temporal leakage.

LOO-CV closed forms

For linear regression:

where is the -th diagonal of the hat matrix . Computed without retraining times.


3. Learning curves

Plot training error and validation error vs training set size .

What they tell you

High bias (underfitting):

  • Train error high.
  • Validation error converges to train error from above.
  • Gap small.
  • More data won't help — model is fundamentally too simple.

High variance (overfitting):

  • Train error low.
  • Validation error high.
  • Big gap.
  • More data will help (gap closes as grows).

Decision-making

  • See big gap? → more data, regularize, or simpler model.
  • See high training error? → bigger model, better features, less regularization.

Practical use

Always plot learning curves before deciding "we need more data" vs "we need a better model." Often answers it definitively.


4. Validation curves

Plot training error and validation error vs a hyperparameter (e.g., model capacity, regularization strength).

Reveals the bias-variance trade-off across hyperparameter values.

Sweet spot: minimum of validation error. Train error keeps improving past this; validation error rises again — overfitting.


5. Information criteria for model selection

When you can compute model likelihood, criteria let you compare models without held-out data.

AIC (Akaike Information Criterion)

where = number of parameters, = max likelihood. Lower is better.

Derivation: estimates the expected KL divergence between the fitted model and the true distribution. Penalty adjusts for using the data twice (training + evaluation).

BIC (Bayesian Information Criterion)

with = number of observations. Lower is better.

Derivation: large-sample approximation of the log marginal likelihood (Bayesian model evidence). Penalty grows with .

AIC vs BIC

  • BIC penalty grows with → BIC selects simpler models for large .
  • AIC: optimal for prediction; doesn't assume true model in candidate set.
  • BIC: consistent for true model selection if true model is in candidate set.
  • BIC > AIC penalty for .

Limitations

  • Both require evaluating likelihood — only meaningful when likelihood is well-defined.
  • Don't directly apply to regularized models (effective unclear).
  • Assume model is correctly specified.

6. ROC and PR curves

ROC curve

Plot True Positive Rate (TPR) vs False Positive Rate (FPR) as threshold varies.

  • TPR = TP / (TP + FN) — sensitivity / recall.
  • FPR = FP / (FP + TN) — fall-out.
  • Top-left corner = perfect classifier.
  • Diagonal = random classifier.

AUROC = area under ROC. Probability that a random positive ranks above a random negative.

PR curve

Plot Precision vs Recall as threshold varies.

  • Better for imbalanced (where most negatives are easy).
  • AUPRC: more informative than AUROC for severe imbalance.

Choosing operating point

  • Cost-aware: .
  • Recall constraint: pick such that recall ≥ X.
  • F-score optimization: .

F-beta score

: F1. : weight recall more (e.g., disease screening). : weight precision more (e.g., spam).


7. Confusion matrix and derived metrics

Predicted positivePredicted negative
Actual positiveTPFN
Actual negativeFPTN
  • Accuracy: .
  • Precision: — what fraction of positive predictions were right.
  • Recall (sensitivity, TPR): — what fraction of actual positives were found.
  • Specificity (TNR): .
  • F1: harmonic mean of P and R.
  • MCC (Matthews Correlation Coefficient): balanced metric for imbalanced.

Why F1 not arithmetic mean?

Harmonic mean penalizes imbalance more — F1 = 0.5 only when both P and R = 0.5. F1 = 0 if either is 0.


8. Common interview gotchas

QuestionCommon wrong answerRight answer
Bias-variance — what's the third term?"Just bias and variance"Irreducible noise
Why is LOO-CV high variance?Lots of dataTraining sets are highly correlated → predictions are correlated → empirical mean has high variance
Why does k=10 work well?TraditionEmpirical compromise: most data used, manageable variance
AIC vs BIC — same purpose?YesAIC for prediction, BIC for model selection (true model in candidates)
AUROC vs AUPRC for imbalance?SameAUPRC much more informative; AUROC dominated by easy negatives
Time-series with k-fold?SureNever — temporal leakage
F1 = arithmetic mean of P and R?YesHarmonic mean — penalizes imbalance

9. Eight most-asked interview questions

  1. Derive the bias-variance decomposition. (Add and subtract ; expand; cross-term zero.)
  2. What's the main purpose of cross-validation? (Estimate generalization without leaking test data.)
  3. What does a learning curve tell you? (High bias vs high variance via train-val gap; informs "more data" vs "better model".)
  4. AIC vs BIC? (Both penalize complexity; BIC penalty grows; AIC for prediction, BIC for true-model identification.)
  5. What's wrong with AUROC for severe imbalance? (Negatives dominate; many easy positives lift AUROC; AUPRC focuses on positives.)
  6. F1 vs accuracy? (Accuracy misleading for imbalance; F1 is harmonic mean of P and R.)
  7. Why use stratified k-fold? (Preserve class ratios; reduces CV variance.)
  8. What's nested CV? (Outer for evaluation; inner for hyperparameter tuning. Prevents tuning bias in outer estimate.)

10. Drill plan

  • Derive bias-variance decomposition on paper.
  • For each CV variant (k-fold, stratified, group, time-series, nested), recite when used.
  • Recite AIC and BIC formulas + when each.
  • Sketch ROC and PR curves for: random, perfect, threshold-based binary classifier.
  • For each F-score variant (), recite when used.
  • Plot a learning curve for "high bias" vs "high variance" — describe to interviewer.

11. Further reading

  • Hastie, Tibshirani, Friedman, The Elements of Statistical Learning — chapters 7 (model assessment), 8 (model inference).
  • Bishop, Pattern Recognition and Machine Learning — chapter 1 (bias-variance).
  • Kohavi (1995), A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.
  • Saito & Rehmsmeier (2015), The Precision-Recall Plot is More Informative than the ROC Plot...
  • Burnham & Anderson, Model Selection and Multi-Model Inference — AIC/BIC reference.