Statistical Inference — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Statistical inference is what separates "I trained a model and it has 87% accuracy" from "I have evidence that my model's true accuracy is 87% ± 1.2% and that's a statistically significant 0.4-point improvement over the baseline." Senior interviews probe this hard because production ML decisions hinge on it.


1. Estimators — what they are and what makes one "good"

An estimator is a function of the data that tries to recover an unknown parameter .

Properties

Unbiased: . Sample mean is unbiased for population mean. Sample variance with denominator is unbiased; with it isn't (Bessel's correction).

Consistent: as . Most useful estimators are consistent. (Note: unbiased ≠ consistent in general; both are different properties.)

Efficient: minimum variance among unbiased estimators. Cramér-Rao lower bound:

where is the Fisher information. MLE is asymptotically efficient — achieves CRLB.

Bias-variance decomposition for MSE:

Important: a biased estimator with low variance can have lower MSE than an unbiased one with high variance. This is the whole point of regularization.


2. Maximum likelihood estimation

The likelihood: . The log-likelihood: .

MLE: .

Properties of MLE

  • Consistent:
  • Asymptotically normal:
  • Asymptotically efficient: variance hits CRLB
  • Invariant to reparameterization:

Worked examples

Bernoulli (coin flip): . . (sample mean).

Gaussian (known variance): . .

Gaussian (both unknown): , — biased! Unbiased estimator uses .


3. Confidence intervals

A CI is a random interval with over repeated sampling.

Common misinterpretation: "There's a 95% probability is in [1.2, 3.4]." Wrong (under frequentist interpretation). is fixed; the interval is random. The correct statement: "If we repeated this procedure many times, 95% of intervals would contain ."

Wald CI (asymptotic)

For an asymptotically normal estimator:

with for 95%. Standard error from Fisher information or sample variance.

Bootstrap CI

When you can't compute SE analytically: resample data with replacement times, compute for each, then take quantiles (percentile method) or use bootstrap-t.

for b in 1 .. B:
  sample X_b with replacement from X (size n)
  compute theta_b = T(X_b)
CI = [quantile(thetas, alpha/2), quantile(thetas, 1-alpha/2)]

Bootstrap is non-parametric, simple, and extremely useful in ML for things like AUC confidence intervals.

Bayesian credible interval

The interval that contains 95% of the posterior probability mass. A different concept than Wald CI — and the credible interval supports the natural-language " is in [...] with 95% probability" interpretation, conditional on prior.


4. Hypothesis testing

Testing a claim vs alternative .

Components

  • Test statistic : function of data.
  • Null distribution: distribution of under .
  • Rejection region: values of where we reject .
  • Significance level : (Type I error).
  • Power : .

p-value

-value = — probability of seeing data this extreme if is true.

Common interpretation errors:

  • -value is NOT .
  • A small -value doesn't mean a large effect — just that the effect is unlikely under .
  • doesn't prove — just lack of evidence against it.

Standard tests

z-test: Gaussian, known variance. .

t-test: Gaussian, unknown variance. Use sample SD; statistic follows .

Chi-squared: categorical data goodness-of-fit, contingency tables. .

Mann-Whitney U / Wilcoxon: non-parametric two-sample.

A/B test (proportions): binomial / two-proportion z-test.

Type I vs Type II

  • Type I (false positive): reject when true. Controlled by .
  • Type II (false negative): fail to reject when true. , depends on effect size, , .

Power analysis picks to achieve target (typically 80%) for a minimum detectable effect.


5. Multiple testing

When you run tests at , the family-wise probability of any false rejection grows: under independence, for small . With tests at , you expect 1 false positive.

Corrections

  • Bonferroni: use per test. Conservative; controls family-wise error rate (FWER).
  • Holm-Bonferroni: step-down version — less conservative.
  • Benjamini-Hochberg: controls false discovery rate (FDR = expected proportion of false positives among rejections). Less conservative; standard in genomics, A/B testing at scale.

When this matters in ML

  • Hyperparameter search: 100 hyperparam combos → some "win" by luck.
  • Many A/B tests on the same data: false positives.
  • Feature selection: testing each feature for significance inflates Type I.
  • Subgroup analysis ("but the model works better for users in California!") — almost always overstated without correction.

6. The bootstrap — workhorse for ML

The bootstrap (Efron 1979) lets you estimate sampling distributions when you can't derive them analytically.

Recipe (non-parametric bootstrap):

  1. Resample from your data with replacement, size .
  2. Compute .
  3. Repeat times (typically 1000–10000).
  4. The empirical distribution of approximates the sampling distribution.

What you can do:

  • SE estimate: SD of the bootstrap distribution.
  • CI: quantiles (percentile method) or bias-corrected accelerated (BCa).
  • Hypothesis test: reject if observed value falls in tail.

Bootstrap in ML practice:

  • AUC CI: bootstrap test set predictions.
  • Model comparison: paired bootstrap of metric differences.
  • Random forest internals: bagging is bootstrapping.

Limitations:

  • Doesn't work for extreme order statistics (e.g., min/max).
  • Doesn't work well for time series without block bootstrap.
  • Computationally expensive for large .

7. Bayesian inference

Frequentist: is fixed, data is random. Bayesian: has a probability distribution.

  • : prior — your belief before seeing data.
  • : likelihood — same as in MLE.
  • : posterior — updated belief.
  • : marginal likelihood / evidence.

Conjugate priors

Posterior in the same family as prior. Examples:

  • Beta prior + Bernoulli likelihood → Beta posterior.
  • Gamma prior + Poisson likelihood → Gamma posterior.
  • Dirichlet prior + multinomial likelihood → Dirichlet posterior.
  • Gaussian prior + Gaussian likelihood (known variance) → Gaussian posterior.

Beta-Bernoulli example: prior . After observing successes in trials: posterior . Posterior mean: .

MAP

Maximum a posteriori: .

This is exactly MLE + log-prior penalty. The penalty is the regularizer.

  • Gaussian prior on weights → regularization (ridge).
  • Laplace prior → (lasso).

Posterior summaries

  • Posterior mean:
  • Posterior median, mode (MAP)
  • Credible interval: with

Bayesian inference in practice

  • Conjugate cases: closed-form (rare beyond simple models).
  • MCMC (Metropolis-Hastings, Gibbs, HMC): sample from posterior.
  • Variational inference: approximate posterior with simpler distribution; minimize KL.
  • Laplace approximation: Gaussian centered at MAP.

8. Common ML stats gotchas

MistakeWhy it's wrongFix
"p > 0.05 → no effect"Absence of evidence ≠ evidence of absenceReport effect size + CI
"p = 0.001 → big effect"Small p just means precise estimate, not largeReport effect size separately
"Train/test gap shows generalization"Single split is noisyCross-validation or bootstrap
"AUC = 0.85 vs 0.84 → better model"Without CI, can be noiseBootstrap CIs, paired tests
"Multiple A/B tests at "FWER blows upBonferroni / BH correction
"Use confidence interval as 'probability in interval'"That's a credible intervalBe precise about interpretation
"MLE is always optimal"Only asymptotically; can overfit, can be biased in finite samplesConsider MAP / regularization
"Bootstrap fixes any sample size problem"Tiny → biased bootstrapNeed large enough for empirical to approximate true

9. Eight most-asked interview questions

  1. What's the difference between a confidence interval and a credible interval? (Frequentist vs Bayesian; "interval random vs random.")
  2. Derive the MLE for a Gaussian. (Lock down log-likelihood + zero-derivative routine.)
  3. What does a p-value mean exactly? (Probability of data this extreme under , NOT .)
  4. When would you use bootstrap? (No analytic SE, ML metrics like AUC, paired model comparison.)
  5. What's the bias-variance tradeoff for estimators? (MSE = bias² + variance; biased estimators can win.)
  6. Why use Bessel's correction ()? (Sample variance with underestimates; unbiases it.)
  7. What's MAP and how does it relate to regularization? (MLE + log-prior; Gaussian prior = , Laplace = .)
  8. You ran 20 A/B tests, two were significant at . What do you do? (Multiple testing — apply Bonferroni or BH correction.)

10. Drill plan

  • For Bernoulli, Gaussian (both params), Poisson — derive MLE on paper. 5 minutes each.
  • For Beta-Bernoulli — derive posterior. Recite posterior mean.
  • Bootstrap loop in 30 lines of NumPy. AUC CI on a real dataset.
  • For each common test (z, t, chi-squared, two-prop), recite: assumptions, statistic, null distribution, when to use.
  • Interpret 5 different p-values and CI statements; flag the wrong ones.

11. Further reading

  • Casella & Berger, Statistical Inference — the canonical text.
  • Wasserman, All of Statistics — fast & broad, ML-friendly.
  • Efron & Hastie, Computer Age Statistical Inference — bootstrap, modern methods.
  • Gelman et al., Bayesian Data Analysis — Bayesian bible.
  • xkcd 882 (jelly beans) — the canonical multiple-testing comic.