Statistical Inference — Interview Grill
50 questions on estimators, MLE, CIs, bootstrap, hypothesis testing, Bayesian inference. Drill until you can answer 35+ cold.
A. Estimators
1. What's an estimator? A function of the data that approximates an unknown parameter: .
2. Define unbiased. — on average across repeated samples, the estimator hits the true value.
3. Define consistent. as .
4. Unbiased vs consistent — give an example of one but not the other. Sample mean of one observation: unbiased, not consistent. Estimator : consistent but biased for any finite .
5. State the bias-variance decomposition for MSE. . Implication: biased estimators can have lower MSE than unbiased ones.
6. What's the Cramér-Rao lower bound? where is Fisher information. Lower bound on variance for any unbiased estimator.
7. Why in sample variance (Bessel's correction)? underestimates because is closer to the data than . Dividing by corrects the bias.
B. MLE
8. Define MLE. .
9. Derive MLE for Bernoulli. . Set : .
10. Derive MLE for Gaussian (mean and variance). , . The MLE for variance is biased — Bessel's correction unbiases it.
11. Why is MLE biased for variance but consistent? Bias is — vanishes as . So MLE is consistent but not unbiased in finite samples.
12. Asymptotic properties of MLE? Consistent, asymptotically normal: , asymptotically efficient (achieves CRLB).
13. Invariance of MLE — what is it? If estimates , then estimates . E.g., MLE of is .
14. When does MLE fail? Small samples (high variance, biased), unbounded likelihood (e.g., Gaussian mixture with covariance shrinking to a point), non-identifiable models.
C. Confidence intervals
15. Define a 95% confidence interval. A random interval such that under repeated sampling, 95% of intervals contain . Frequency interpretation, not " is in [...] with 95% probability."
16. Wald CI formula? for 95%. Relies on asymptotic normality.
17. CI vs credible interval? CI is frequentist — interval random, fixed. Credible interval is Bayesian — interval fixed, has posterior probability mass. CredI supports the natural " in [...] with 95% probability" interpretation.
18. How do you compute a bootstrap CI? Resample data with replacement times, compute each time. CI = of (percentile method).
19. When can a CI go negative for a positive quantity? When CI is constructed without constraints (e.g., Wald CI for a probability close to 0 or 1). Use logit transform or bootstrap.
D. Hypothesis testing
20. State the components of a hypothesis test. Null , alternative , test statistic , null distribution, rejection region, significance .
21. What's a p-value? Probability under of observing a test statistic at least as extreme as the one observed. NOT .
22. Why is "p < 0.05 means the result is true" wrong? -value isn't . With multiple tests, alone is meaningless. Even with one test, low is "data is unlikely under ," not " is unlikely."
23. Type I vs Type II error? Type I: reject true (false positive, controlled by ). Type II: fail to reject false (false negative, controlled by power ).
24. What's statistical power? . Depends on effect size, , , variance.
25. When do you use a t-test vs z-test? -test: variance known (rare). -test: variance estimated from sample (almost always).
26. When do you use a chi-squared test? Goodness-of-fit, contingency tables (test of independence). Categorical data. Statistic: .
27. What's a one-sided vs two-sided test? One-sided: (or ). Two-sided: . One-sided has more power but you must commit to direction a priori.
28. Paired vs unpaired t-test? Paired: same subjects measured twice (before/after). Unpaired: independent groups. Paired has more power because it removes between-subject variation.
E. Multiple testing
29. The multiple testing problem? With independent tests at , FWER for small . Run 20 tests, expect ~1 false positive even with no real effect.
30. Bonferroni correction? Test each at instead of . Controls FWER. Conservative; loses power.
31. What's Benjamini-Hochberg? Controls false discovery rate (FDR = expected proportion of false positives among rejections). Order p-values; reject the largest for which . Less conservative than Bonferroni.
32. FWER vs FDR — when each? FWER: when any false positive is bad (e.g., medical diagnosis). FDR: when discovery is exploratory and some false positives are tolerable (e.g., gene expression).
33. Where does multiple testing show up in ML? Hyperparameter sweeps, A/B test farms, feature selection (test each feature), subgroup analysis.
F. Bootstrap
34. What's the bootstrap? Resample data with replacement times to approximate the sampling distribution of an estimator. Non-parametric, simple, broadly applicable.
35. When does bootstrap fail? Extreme order statistics (min/max), heavy-tailed distributions without enough data, time series (without block bootstrap), very small .
36. Bootstrap a confusion-matrix metric — how? Resample (predictions, labels) pairs with replacement. Compute metric on resample. Repeat 1000+ times. Quantiles of the resulting distribution give CI.
37. What's a paired bootstrap for model comparison? For each bootstrap sample, compute metric for both models on the same sample. Look at distribution of differences. Reject "no difference" if 0 not in CI.
38. Bagging is bootstrap of what? Bagging = "Bootstrap Aggregating." Train each tree on a bootstrap resample of data. Random Forests add feature subsampling.
G. Bayesian inference
39. State Bayes' theorem. .
40. What's a conjugate prior? Example? A prior whose posterior stays in the same family. Beta-Bernoulli: prior + successes / failures → posterior .
41. Beta-Bernoulli posterior mean? . Smoothing: prior acts like "pseudo-observations."
42. What's MAP estimation? . MLE + log-prior penalty.
43. Connection between MAP and regularization? Gaussian prior on weights → penalty (ridge). Laplace prior → penalty (lasso). Regularization is MAP with a particular prior.
44. What's the marginal likelihood / evidence and why does it matter? . Used for Bayesian model comparison (Bayes factors). Hard to compute in general.
45. MCMC vs variational inference? MCMC: sample from posterior; asymptotically exact, slow. VI: approximate posterior with a simpler distribution by minimizing KL; biased, fast. ML practitioners usually use VI when scale matters.
H. Practical ML stats
46. You report a model AUC of 0.85. How do you give it a CI? Bootstrap the test set 1000+ times; compute AUC on each; take 2.5%/97.5% quantiles.
47. Two models: AUC 0.85 vs 0.84. Is the difference significant? Paired bootstrap of AUC differences. CI for difference; reject "no difference" if 0 not in CI. Or DeLong's test for AUC specifically.
48. You run 50 A/B tests and 3 are "significant" at . Are any real? Probably 2.5 false positives expected by chance. Apply Bonferroni () or BH correction.
49. Model accuracy = 87% on test set of 1000. CI? Wald: . Or Wilson interval (better for proportions). Or bootstrap.
50. Train accuracy 95%, test 87%. Statistically significant gap? Compute CIs on each. Subtract. If CIs overlap heavily, gap might be noise. Better: paired bootstrap of differences, or test on multiple test splits.
Quick fire
51. MLE for Bernoulli? Sample mean. 52. Bessel's correction divisor? . 53. 95% z-value? 1.96. 54. CRLB lower-bounds what? Variance of unbiased estimator. 55. Conjugate of Bernoulli? Beta. 56. Conjugate of Poisson? Gamma. 57. Conjugate of multinomial? Dirichlet. 58. Bonferroni: divide by? Number of tests . 59. MAP equals MLE when? Uniform prior. 60. CLT statement? Sample mean is asymptotically Gaussian regardless of underlying distribution (with finite variance).
Self-grading
If you can't answer 1-15, you don't know basic statistics. If you can't answer 16-35, you'll get tripped up on every interview that probes ML evaluation rigor. If you can't answer 36-50, frontier-lab interviews on experimental rigor will go past you.
Aim for 40+/60 cold.