Statistical Inference — Interview Grill

50 questions on estimators, MLE, CIs, bootstrap, hypothesis testing, Bayesian inference. Drill until you can answer 35+ cold.


A. Estimators

1. What's an estimator? A function of the data that approximates an unknown parameter: .

2. Define unbiased. — on average across repeated samples, the estimator hits the true value.

3. Define consistent. as .

4. Unbiased vs consistent — give an example of one but not the other. Sample mean of one observation: unbiased, not consistent. Estimator : consistent but biased for any finite .

5. State the bias-variance decomposition for MSE. . Implication: biased estimators can have lower MSE than unbiased ones.

6. What's the Cramér-Rao lower bound? where is Fisher information. Lower bound on variance for any unbiased estimator.

7. Why in sample variance (Bessel's correction)? underestimates because is closer to the data than . Dividing by corrects the bias.


B. MLE

8. Define MLE. .

9. Derive MLE for Bernoulli. . Set : .

10. Derive MLE for Gaussian (mean and variance). , . The MLE for variance is biased — Bessel's correction unbiases it.

11. Why is MLE biased for variance but consistent? Bias is — vanishes as . So MLE is consistent but not unbiased in finite samples.

12. Asymptotic properties of MLE? Consistent, asymptotically normal: , asymptotically efficient (achieves CRLB).

13. Invariance of MLE — what is it? If estimates , then estimates . E.g., MLE of is .

14. When does MLE fail? Small samples (high variance, biased), unbounded likelihood (e.g., Gaussian mixture with covariance shrinking to a point), non-identifiable models.


C. Confidence intervals

15. Define a 95% confidence interval. A random interval such that under repeated sampling, 95% of intervals contain . Frequency interpretation, not " is in [...] with 95% probability."

16. Wald CI formula? for 95%. Relies on asymptotic normality.

17. CI vs credible interval? CI is frequentist — interval random, fixed. Credible interval is Bayesian — interval fixed, has posterior probability mass. CredI supports the natural " in [...] with 95% probability" interpretation.

18. How do you compute a bootstrap CI? Resample data with replacement times, compute each time. CI = of (percentile method).

19. When can a CI go negative for a positive quantity? When CI is constructed without constraints (e.g., Wald CI for a probability close to 0 or 1). Use logit transform or bootstrap.


D. Hypothesis testing

20. State the components of a hypothesis test. Null , alternative , test statistic , null distribution, rejection region, significance .

21. What's a p-value? Probability under of observing a test statistic at least as extreme as the one observed. NOT .

22. Why is "p < 0.05 means the result is true" wrong? -value isn't . With multiple tests, alone is meaningless. Even with one test, low is "data is unlikely under ," not " is unlikely."

23. Type I vs Type II error? Type I: reject true (false positive, controlled by ). Type II: fail to reject false (false negative, controlled by power ).

24. What's statistical power? . Depends on effect size, , , variance.

25. When do you use a t-test vs z-test? -test: variance known (rare). -test: variance estimated from sample (almost always).

26. When do you use a chi-squared test? Goodness-of-fit, contingency tables (test of independence). Categorical data. Statistic: .

27. What's a one-sided vs two-sided test? One-sided: (or ). Two-sided: . One-sided has more power but you must commit to direction a priori.

28. Paired vs unpaired t-test? Paired: same subjects measured twice (before/after). Unpaired: independent groups. Paired has more power because it removes between-subject variation.


E. Multiple testing

29. The multiple testing problem? With independent tests at , FWER for small . Run 20 tests, expect ~1 false positive even with no real effect.

30. Bonferroni correction? Test each at instead of . Controls FWER. Conservative; loses power.

31. What's Benjamini-Hochberg? Controls false discovery rate (FDR = expected proportion of false positives among rejections). Order p-values; reject the largest for which . Less conservative than Bonferroni.

32. FWER vs FDR — when each? FWER: when any false positive is bad (e.g., medical diagnosis). FDR: when discovery is exploratory and some false positives are tolerable (e.g., gene expression).

33. Where does multiple testing show up in ML? Hyperparameter sweeps, A/B test farms, feature selection (test each feature), subgroup analysis.


F. Bootstrap

34. What's the bootstrap? Resample data with replacement times to approximate the sampling distribution of an estimator. Non-parametric, simple, broadly applicable.

35. When does bootstrap fail? Extreme order statistics (min/max), heavy-tailed distributions without enough data, time series (without block bootstrap), very small .

36. Bootstrap a confusion-matrix metric — how? Resample (predictions, labels) pairs with replacement. Compute metric on resample. Repeat 1000+ times. Quantiles of the resulting distribution give CI.

37. What's a paired bootstrap for model comparison? For each bootstrap sample, compute metric for both models on the same sample. Look at distribution of differences. Reject "no difference" if 0 not in CI.

38. Bagging is bootstrap of what? Bagging = "Bootstrap Aggregating." Train each tree on a bootstrap resample of data. Random Forests add feature subsampling.


G. Bayesian inference

39. State Bayes' theorem. .

40. What's a conjugate prior? Example? A prior whose posterior stays in the same family. Beta-Bernoulli: prior + successes / failures → posterior .

41. Beta-Bernoulli posterior mean? . Smoothing: prior acts like "pseudo-observations."

42. What's MAP estimation? . MLE + log-prior penalty.

43. Connection between MAP and regularization? Gaussian prior on weights → penalty (ridge). Laplace prior → penalty (lasso). Regularization is MAP with a particular prior.

44. What's the marginal likelihood / evidence and why does it matter? . Used for Bayesian model comparison (Bayes factors). Hard to compute in general.

45. MCMC vs variational inference? MCMC: sample from posterior; asymptotically exact, slow. VI: approximate posterior with a simpler distribution by minimizing KL; biased, fast. ML practitioners usually use VI when scale matters.


H. Practical ML stats

46. You report a model AUC of 0.85. How do you give it a CI? Bootstrap the test set 1000+ times; compute AUC on each; take 2.5%/97.5% quantiles.

47. Two models: AUC 0.85 vs 0.84. Is the difference significant? Paired bootstrap of AUC differences. CI for difference; reject "no difference" if 0 not in CI. Or DeLong's test for AUC specifically.

48. You run 50 A/B tests and 3 are "significant" at . Are any real? Probably 2.5 false positives expected by chance. Apply Bonferroni () or BH correction.

49. Model accuracy = 87% on test set of 1000. CI? Wald: . Or Wilson interval (better for proportions). Or bootstrap.

50. Train accuracy 95%, test 87%. Statistically significant gap? Compute CIs on each. Subtract. If CIs overlap heavily, gap might be noise. Better: paired bootstrap of differences, or test on multiple test splits.


Quick fire

51. MLE for Bernoulli? Sample mean. 52. Bessel's correction divisor? . 53. 95% z-value? 1.96. 54. CRLB lower-bounds what? Variance of unbiased estimator. 55. Conjugate of Bernoulli? Beta. 56. Conjugate of Poisson? Gamma. 57. Conjugate of multinomial? Dirichlet. 58. Bonferroni: divide by? Number of tests . 59. MAP equals MLE when? Uniform prior. 60. CLT statement? Sample mean is asymptotically Gaussian regardless of underlying distribution (with finite variance).


Self-grading

If you can't answer 1-15, you don't know basic statistics. If you can't answer 16-35, you'll get tripped up on every interview that probes ML evaluation rigor. If you can't answer 36-50, frontier-lab interviews on experimental rigor will go past you.

Aim for 40+/60 cold.