Statistical Inference — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
Statistical inference is what separates "I trained a model and it has 87% accuracy" from "I have evidence that my model's true accuracy is 87% ± 1.2% and that's a statistically significant 0.4-point improvement over the baseline." Senior interviews probe this hard because production ML decisions hinge on it.
1. Estimators — what they are and what makes one "good"
An estimator is a function of the data that tries to recover an unknown parameter .
Properties
Unbiased: . Sample mean is unbiased for population mean. Sample variance with denominator is unbiased; with it isn't (Bessel's correction).
Consistent: as . Most useful estimators are consistent. (Note: unbiased ≠ consistent in general; both are different properties.)
Efficient: minimum variance among unbiased estimators. Cramér-Rao lower bound:
where is the Fisher information. MLE is asymptotically efficient — achieves CRLB.
Bias-variance decomposition for MSE:
Important: a biased estimator with low variance can have lower MSE than an unbiased one with high variance. This is the whole point of regularization.
2. Maximum likelihood estimation
The likelihood: . The log-likelihood: .
MLE: .
Properties of MLE
- Consistent:
- Asymptotically normal:
- Asymptotically efficient: variance hits CRLB
- Invariant to reparameterization:
Worked examples
Bernoulli (coin flip): . . (sample mean).
Gaussian (known variance): . .
Gaussian (both unknown): , — biased! Unbiased estimator uses .
3. Confidence intervals
A CI is a random interval with — over repeated sampling.
Common misinterpretation: "There's a 95% probability is in [1.2, 3.4]." Wrong (under frequentist interpretation). is fixed; the interval is random. The correct statement: "If we repeated this procedure many times, 95% of intervals would contain ."
Wald CI (asymptotic)
For an asymptotically normal estimator:
with for 95%. Standard error from Fisher information or sample variance.
Bootstrap CI
When you can't compute SE analytically: resample data with replacement times, compute for each, then take quantiles (percentile method) or use bootstrap-t.
for b in 1 .. B:
sample X_b with replacement from X (size n)
compute theta_b = T(X_b)
CI = [quantile(thetas, alpha/2), quantile(thetas, 1-alpha/2)]
Bootstrap is non-parametric, simple, and extremely useful in ML for things like AUC confidence intervals.
Bayesian credible interval
The interval that contains 95% of the posterior probability mass. A different concept than Wald CI — and the credible interval supports the natural-language " is in [...] with 95% probability" interpretation, conditional on prior.
4. Hypothesis testing
Testing a claim vs alternative .
Components
- Test statistic : function of data.
- Null distribution: distribution of under .
- Rejection region: values of where we reject .
- Significance level : (Type I error).
- Power : .
p-value
-value = — probability of seeing data this extreme if is true.
Common interpretation errors:
- -value is NOT .
- A small -value doesn't mean a large effect — just that the effect is unlikely under .
- doesn't prove — just lack of evidence against it.
Standard tests
z-test: Gaussian, known variance. .
t-test: Gaussian, unknown variance. Use sample SD; statistic follows .
Chi-squared: categorical data goodness-of-fit, contingency tables. .
Mann-Whitney U / Wilcoxon: non-parametric two-sample.
A/B test (proportions): binomial / two-proportion z-test.
Type I vs Type II
- Type I (false positive): reject when true. Controlled by .
- Type II (false negative): fail to reject when true. , depends on effect size, , .
Power analysis picks to achieve target (typically 80%) for a minimum detectable effect.
5. Multiple testing
When you run tests at , the family-wise probability of any false rejection grows: under independence, for small . With tests at , you expect 1 false positive.
Corrections
- Bonferroni: use per test. Conservative; controls family-wise error rate (FWER).
- Holm-Bonferroni: step-down version — less conservative.
- Benjamini-Hochberg: controls false discovery rate (FDR = expected proportion of false positives among rejections). Less conservative; standard in genomics, A/B testing at scale.
When this matters in ML
- Hyperparameter search: 100 hyperparam combos → some "win" by luck.
- Many A/B tests on the same data: false positives.
- Feature selection: testing each feature for significance inflates Type I.
- Subgroup analysis ("but the model works better for users in California!") — almost always overstated without correction.
6. The bootstrap — workhorse for ML
The bootstrap (Efron 1979) lets you estimate sampling distributions when you can't derive them analytically.
Recipe (non-parametric bootstrap):
- Resample from your data with replacement, size .
- Compute .
- Repeat times (typically 1000–10000).
- The empirical distribution of approximates the sampling distribution.
What you can do:
- SE estimate: SD of the bootstrap distribution.
- CI: quantiles (percentile method) or bias-corrected accelerated (BCa).
- Hypothesis test: reject if observed value falls in tail.
Bootstrap in ML practice:
- AUC CI: bootstrap test set predictions.
- Model comparison: paired bootstrap of metric differences.
- Random forest internals: bagging is bootstrapping.
Limitations:
- Doesn't work for extreme order statistics (e.g., min/max).
- Doesn't work well for time series without block bootstrap.
- Computationally expensive for large .
7. Bayesian inference
Frequentist: is fixed, data is random. Bayesian: has a probability distribution.
- : prior — your belief before seeing data.
- : likelihood — same as in MLE.
- : posterior — updated belief.
- : marginal likelihood / evidence.
Conjugate priors
Posterior in the same family as prior. Examples:
- Beta prior + Bernoulli likelihood → Beta posterior.
- Gamma prior + Poisson likelihood → Gamma posterior.
- Dirichlet prior + multinomial likelihood → Dirichlet posterior.
- Gaussian prior + Gaussian likelihood (known variance) → Gaussian posterior.
Beta-Bernoulli example: prior . After observing successes in trials: posterior . Posterior mean: .
MAP
Maximum a posteriori: .
This is exactly MLE + log-prior penalty. The penalty is the regularizer.
- Gaussian prior on weights → regularization (ridge).
- Laplace prior → (lasso).
Posterior summaries
- Posterior mean:
- Posterior median, mode (MAP)
- Credible interval: with
Bayesian inference in practice
- Conjugate cases: closed-form (rare beyond simple models).
- MCMC (Metropolis-Hastings, Gibbs, HMC): sample from posterior.
- Variational inference: approximate posterior with simpler distribution; minimize KL.
- Laplace approximation: Gaussian centered at MAP.
8. Common ML stats gotchas
| Mistake | Why it's wrong | Fix |
|---|---|---|
| "p > 0.05 → no effect" | Absence of evidence ≠ evidence of absence | Report effect size + CI |
| "p = 0.001 → big effect" | Small p just means precise estimate, not large | Report effect size separately |
| "Train/test gap shows generalization" | Single split is noisy | Cross-validation or bootstrap |
| "AUC = 0.85 vs 0.84 → better model" | Without CI, can be noise | Bootstrap CIs, paired tests |
| "Multiple A/B tests at " | FWER blows up | Bonferroni / BH correction |
| "Use confidence interval as 'probability in interval'" | That's a credible interval | Be precise about interpretation |
| "MLE is always optimal" | Only asymptotically; can overfit, can be biased in finite samples | Consider MAP / regularization |
| "Bootstrap fixes any sample size problem" | Tiny → biased bootstrap | Need large enough for empirical to approximate true |
9. Eight most-asked interview questions
- What's the difference between a confidence interval and a credible interval? (Frequentist vs Bayesian; "interval random vs random.")
- Derive the MLE for a Gaussian. (Lock down log-likelihood + zero-derivative routine.)
- What does a p-value mean exactly? (Probability of data this extreme under , NOT .)
- When would you use bootstrap? (No analytic SE, ML metrics like AUC, paired model comparison.)
- What's the bias-variance tradeoff for estimators? (MSE = bias² + variance; biased estimators can win.)
- Why use Bessel's correction ()? (Sample variance with underestimates; unbiases it.)
- What's MAP and how does it relate to regularization? (MLE + log-prior; Gaussian prior = , Laplace = .)
- You ran 20 A/B tests, two were significant at . What do you do? (Multiple testing — apply Bonferroni or BH correction.)
10. Drill plan
- For Bernoulli, Gaussian (both params), Poisson — derive MLE on paper. 5 minutes each.
- For Beta-Bernoulli — derive posterior. Recite posterior mean.
- Bootstrap loop in 30 lines of NumPy. AUC CI on a real dataset.
- For each common test (z, t, chi-squared, two-prop), recite: assumptions, statistic, null distribution, when to use.
- Interpret 5 different p-values and CI statements; flag the wrong ones.
11. Further reading
- Casella & Berger, Statistical Inference — the canonical text.
- Wasserman, All of Statistics — fast & broad, ML-friendly.
- Efron & Hastie, Computer Age Statistical Inference — bootstrap, modern methods.
- Gelman et al., Bayesian Data Analysis — Bayesian bible.
- xkcd 882 (jelly beans) — the canonical multiple-testing comic.