Topic 49: Generalization and Evaluation

🔥 For interviews, read these first:

GENERALIZATION_DEEP_DIVE.md — frontier-lab deep dive: data leakage (4 types), calibration (ECE, Platt/isotonic/temperature), distribution shift (covariate/label/concept), class imbalance, double descent, cross-validation done right, ablations, metric uncertainty.

INTERVIEW_GRILL.md — 60 active-recall questions.

What You'll Learn

This topic is about the questions that separate "I trained a model" from "I know whether the model actually works."

You will learn:

Train / validation / test roles
Overfitting and underfitting
Data leakage
Class imbalance
Calibration
Slice-based error analysis
Distribution shift
Ablations and experiment interpretation
Metric confidence intervals

Why This Matters for Research Scientist Interviews

A strong research answer is rarely just:

"My model got a better score."

A strong answer is:

"The gain appears on the right slices, survives repeated runs, uses the correct metric, and is not explained by leakage or evaluation artifacts."

That is the mindset this topic is meant to build.

Core Intuition

1. Train, Validation, Test

Train: fit model parameters
Validation: tune choices
Test: final unbiased evaluation

Easy rule:

If the test set affects your design choices, it is no longer a real test set.

2. Overfitting and Underfitting

Underfitting

training error high
validation error high
model too weak or optimization poor

Overfitting

training error low
validation error high
model memorizes training patterns that do not transfer

Interview explanation:

"Overfitting is not just a big model. It is a mismatch between fit to observed training data and ability to generalize to unseen data."

3. Data Leakage

Leakage is one of the highest-value topics in interviews because many candidates ignore it.

Leakage means information from the future, the label, or the evaluation set sneaks into training or feature creation.

Common forms:

fitting preprocessing on all data before splitting
duplicate records across train and test
time leakage from future features
target leakage hidden inside engineered columns

Good answer:

"Before trusting any result, I would check split logic, duplication, feature generation timestamps, and whether any preprocessing was fit outside the training partition."

4. Class Imbalance

Accuracy can be misleading.

If 99% of examples are negative, a dumb classifier that always predicts negative gets 99% accuracy.

That is why interviews often ask:

precision
recall
F1
ROC-AUC
PR-AUC

Rule of thumb:

use PR-focused metrics when positives are rare and important
use recall when missing positives is costly
use precision when false positives are costly

5. Calibration

Calibration asks:

"When the model says 0.8 confidence, is it right about 80% of the time?"

This matters a lot in decision systems.

A model can rank examples well but still be poorly calibrated.

Important distinction:

discrimination: can it rank good vs bad?
calibration: do probabilities mean what they claim?

6. Slice-Based Evaluation

Average performance can hide serious failures.

Always ask:

Does the model fail on long inputs?
Does it fail on rare classes?
Does it fail on low-resource languages?
Does it fail on certain customer segments?

Research scientist interviews often reward this kind of thinking.

7. Distribution Shift

Performance can collapse when train and deployment distributions differ.

Common shifts:

covariate shift: input distribution changes
label shift: class frequencies change
concept shift: relationship between input and label changes

Useful answer:

"I would compare feature distributions, error slices, and calibration before and after deployment periods. Then I would check whether the shift is in inputs, labels, or the mapping itself."

8. Ablations

Ablations answer:

"Which part of the system caused the gain?"

Good ablations:

remove one change at a time
keep compute and data consistent
report the base model clearly
show failure cases, not just the best metric

9. Confidence Intervals for Metrics

If your metric moves from 84.1 to 84.4, that may or may not matter.

Bootstrap is often the easiest way to estimate uncertainty for:

accuracy
F1
recall@k
exact match

That is especially useful in research discussions where exact analytic variance is awkward.

preprocessing fit on full data
duplicate rows across splits
future information in historical features
target-derived engineered columns

3. Treating Average Performance as Complete Evidence

Average metrics can hide failure on:

rare classes
long inputs
specific languages or user groups
safety-critical slices

overfitting to the validation set
accidental tuning on the test set earlier
shift between validation and test
metric instability

Binary confusion matrix
Accuracy / precision / recall / F1
Expected calibration error (ECE)
Bootstrap confidence interval for any metric
Slice accuracy
Simple ablation deltas

These are the kinds of compact utilities that help during coding rounds and during your own experiment analysis.

What to Practice Saying Out Loud

Why is accuracy a bad metric under class imbalance?
How would you detect leakage in an offline pipeline?
What does calibration measure that ROC-AUC does not?
Why should ablations keep compute and data fixed?
If validation gets better but test does not, what are your first hypotheses?

Next Steps

After this topic:

Use Topic 50 for fast coding-round patterns
Use Topic 51 for LLM-specific research interview prep

ML & LLM Interview Prep — Deep Dives

Topic 49: Generalization and Evaluation

What You'll Learn

Why This Matters for Research Scientist Interviews

Core Intuition

1. Train, Validation, Test

2. Overfitting and Underfitting

Underfitting

Overfitting

3. Data Leakage

4. Class Imbalance

5. Calibration

6. Slice-Based Evaluation

7. Distribution Shift

8. Ablations

9. Confidence Intervals for Metrics

Common Failure Modes

1. Choosing a Convenient Metric Instead of the Right Metric

2. Hidden Leakage in the Pipeline

3. Treating Average Performance as Complete Evidence

4. Ignoring Calibration

5. Believing Tiny Metric Differences Without Uncertainty Estimates

Edge Cases and Follow-Up Questions

What if validation improves but test does not?

What if calibration is poor but accuracy is strong?

What if the positive class is extremely rare?

What if one slice regresses badly while the overall average improves?

Boilerplate Code

What to Practice Saying Out Loud

Next Steps