Statistical Inference — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

Statistical inference is what separates "I trained a model and it has 87% accuracy" from "I have evidence that my model's true accuracy is 87% ± 1.2% and that's a statistically significant 0.4-point improvement over the baseline." Senior interviews probe this hard because production ML decisions hinge on it.

1. Estimators — what they are and what makes one "good"

An estimator is a function $\hat{θ} = T (X_{1}, \dots, X_{n})$ of the data that tries to recover an unknown parameter $θ$ .

Properties

Unbiased: $E [\hat{θ}] = θ$ . Sample mean is unbiased for population mean. Sample variance with $n - 1$ denominator is unbiased; with $n$ it isn't (Bessel's correction).

Consistent: $\hat{θ}_{n} \to_{p} θ$ as $n \to \infty$ . Most useful estimators are consistent. (Note: unbiased ≠ consistent in general; both are different properties.)

Efficient: minimum variance among unbiased estimators. Cramér-Rao lower bound:

$Var (\hat{θ}) \geq \frac{1}{I ( θ )}$

where $I (θ) = - E [\partial^{2} lo g p (X ∣ θ) / \partial θ^{2}]$ is the Fisher information. MLE is asymptotically efficient — achieves CRLB.

Bias-variance decomposition for MSE:

$MSE (\hat{θ}) = Bias (\hat{θ})^{2} + Var (\hat{θ})$

Important: a biased estimator with low variance can have lower MSE than an unbiased one with high variance. This is the whole point of regularization.

2. Maximum likelihood estimation

The likelihood: $L (θ) = \prod_{i} p (x_{i} ∣ θ)$ . The log-likelihood: $ℓ (θ) = \sum_{i} lo g p (x_{i} ∣ θ)$ .

MLE: $\hat{θ}_{MLE} = ar g max_{θ} ℓ (θ)$ .

Properties of MLE

Consistent: $\hat{θ}_{MLE} \to_{p} θ_{0}$
Asymptotically normal: $n (\hat{θ} - θ_{0}) \to N (0, I (θ_{0})^{- 1})$
Asymptotically efficient: variance hits CRLB
Invariant to reparameterization: $g (θ) = g (\hat{θ})$

Worked examples

Bernoulli (coin flip): $p (x ∣ θ) = θ^{x} (1 - θ)^{1 - x}$ . $ℓ (θ) = \sum x_{i} lo g θ + (n - \sum x_{i}) lo g (1 - θ)$ . $\hat{θ}_{MLE} = \overset{x}{ˉ}$ (sample mean).

Gaussian (known variance): $p (x ∣ μ) = N (μ, σ^{2})$ . $\overset{μ}{^}_{MLE} = \overset{x}{ˉ}$ .

Gaussian (both unknown): $\overset{μ}{^} = \overset{x}{ˉ}$ , $\overset{σ}{^}^{2} = \frac{1}{n} \sum (x_{i} - \overset{x}{ˉ})^{2}$ — biased! Unbiased estimator uses $1/ (n - 1)$ .

3. Confidence intervals

A $1 - α$ CI is a random interval $[L, U]$ with $P (L \leq θ \leq U) = 1 - α$ — over repeated sampling.

Common misinterpretation: "There's a 95% probability $θ$ is in [1.2, 3.4]." Wrong (under frequentist interpretation). $θ$ is fixed; the interval is random. The correct statement: "If we repeated this procedure many times, 95% of intervals would contain $θ$ ."

Wald CI (asymptotic)

For an asymptotically normal estimator:

$\hat{θ} \pm z_{α /2} \cdot SE (\hat{θ})$

with $z_{0.025} = 1.96$ for 95%. Standard error from Fisher information or sample variance.

Bootstrap CI

When you can't compute SE analytically: resample data with replacement $B$ times, compute $\hat{θ}^{(b)}$ for each, then take quantiles (percentile method) or use bootstrap-t.

for b in 1 .. B:
  sample X_b with replacement from X (size n)
  compute theta_b = T(X_b)
CI = [quantile(thetas, alpha/2), quantile(thetas, 1-alpha/2)]

Bootstrap is non-parametric, simple, and extremely useful in ML for things like AUC confidence intervals.

Bayesian credible interval

The interval that contains 95% of the posterior probability mass. A different concept than Wald CI — and the credible interval supports the natural-language " $θ$ is in [...] with 95% probability" interpretation, conditional on prior.

4. Hypothesis testing

Testing a claim $H_{0}$ vs alternative $H_{1}$ .

Components

Test statistic $T (X)$ : function of data.
Null distribution: distribution of $T$ under $H_{0}$ .
Rejection region: values of $T$ where we reject $H_{0}$ .
Significance level $α$ : $P (reject ∣ H_{0}) \leq α$ (Type I error).
Power $1 - β$ : $P (reject ∣ H_{1})$ .

p-value

$p$ -value = $P (T \geq t_{obs} ∣ H_{0})$ — probability of seeing data this extreme if $H_{0}$ is true.

Common interpretation errors:

$p$ -value is NOT $P (H_{0} ∣ data)$ .
A small $p$ -value doesn't mean a large effect — just that the effect is unlikely under $H_{0}$ .
$p > 0.05$ doesn't prove $H_{0}$ — just lack of evidence against it.

Standard tests

z-test: Gaussian, known variance. $z = (\overset{x}{ˉ} - μ_{0}) / (σ / n)$ .

t-test: Gaussian, unknown variance. Use sample SD; statistic follows $t_{n - 1}$ .

Chi-squared: categorical data goodness-of-fit, contingency tables. $χ^{2} = \sum (O - E)^{2} / E$ .

Mann-Whitney U / Wilcoxon: non-parametric two-sample.

A/B test (proportions): binomial / two-proportion z-test.

Type I vs Type II

Type I (false positive): reject $H_{0}$ when true. Controlled by $α$ .
Type II (false negative): fail to reject when $H_{1}$ true. $β$ , depends on effect size, $n$ , $α$ .

Power analysis picks $n$ to achieve target $1 - β$ (typically 80%) for a minimum detectable effect.

5. Multiple testing

When you run $m$ tests at $α = 0.05$ , the family-wise probability of any false rejection grows: under independence, $1 - (1 - α)^{m} \approx m α$ for small $α$ . With $m = 20$ tests at $α = 0.05$ , you expect 1 false positive.

Corrections

Bonferroni: use $α / m$ per test. Conservative; controls family-wise error rate (FWER).
Holm-Bonferroni: step-down version — less conservative.
Benjamini-Hochberg: controls false discovery rate (FDR = expected proportion of false positives among rejections). Less conservative; standard in genomics, A/B testing at scale.

When this matters in ML

Hyperparameter search: 100 hyperparam combos → some "win" by luck.
Many A/B tests on the same data: false positives.
Feature selection: testing each feature for significance inflates Type I.
Subgroup analysis ("but the model works better for users in California!") — almost always overstated without correction.

6. The bootstrap — workhorse for ML

The bootstrap (Efron 1979) lets you estimate sampling distributions when you can't derive them analytically.

Recipe (non-parametric bootstrap):

Resample $X^{(b)}$ from your data with replacement, size $n$ .
Compute $\hat{θ}^{(b)}$ .
Repeat $B$ times (typically 1000–10000).
The empirical distribution of ${\hat{θ}^{(b)}}$ approximates the sampling distribution.

What you can do:

SE estimate: SD of the bootstrap distribution.
CI: quantiles (percentile method) or bias-corrected accelerated (BCa).
Hypothesis test: reject if observed value falls in tail.

Bootstrap in ML practice:

AUC CI: bootstrap test set predictions.
Model comparison: paired bootstrap of metric differences.
Random forest internals: bagging is bootstrapping.

Limitations:

Doesn't work for extreme order statistics (e.g., min/max).
Doesn't work well for time series without block bootstrap.
Computationally expensive for large $n$ .

7. Bayesian inference

Frequentist: $θ$ is fixed, data is random. Bayesian: $θ$ has a probability distribution.

$p (θ ∣ x) = \frac{p ( x ∣ θ ) p ( θ )}{p ( x )} \propto p (x ∣ θ) p (θ)$

$p (θ)$ : prior — your belief before seeing data.
$p (x ∣ θ)$ : likelihood — same as in MLE.
$p (θ ∣ x)$ : posterior — updated belief.
$p (x) = \int p (x ∣ θ) p (θ) d θ$ : marginal likelihood / evidence.

Conjugate priors

Posterior in the same family as prior. Examples:

Beta prior + Bernoulli likelihood → Beta posterior.
Gamma prior + Poisson likelihood → Gamma posterior.
Dirichlet prior + multinomial likelihood → Dirichlet posterior.
Gaussian prior + Gaussian likelihood (known variance) → Gaussian posterior.

Beta-Bernoulli example: prior $θ \sim Beta (α, β)$ . After observing $s$ successes in $n$ trials: posterior $θ ∣ x \sim Beta (α + s, β + n - s)$ . Posterior mean: $(α + s) / (α + β + n)$ .

MAP

Maximum a posteriori: $\hat{θ}_{MAP} = ar g max_{θ} p (θ ∣ x) = ar g max_{θ} [lo g p (x ∣ θ) + lo g p (θ)]$ .

This is exactly MLE + log-prior penalty. The penalty is the regularizer.

Gaussian prior on weights → $ℓ_{2}$ regularization (ridge).
Laplace prior → $ℓ_{1}$ (lasso).

Posterior summaries

Posterior mean: $E [θ ∣ x]$
Posterior median, mode (MAP)
Credible interval: $[L, U]$ with $P (θ \in [L, U] ∣ x) = 0.95$

Bayesian inference in practice

Conjugate cases: closed-form (rare beyond simple models).
MCMC (Metropolis-Hastings, Gibbs, HMC): sample from posterior.
Variational inference: approximate posterior with simpler distribution; minimize KL.
Laplace approximation: Gaussian centered at MAP.

8. Common ML stats gotchas

Mistake	Why it's wrong	Fix
"p > 0.05 → no effect"	Absence of evidence ≠ evidence of absence	Report effect size + CI
"p = 0.001 → big effect"	Small p just means precise estimate, not large	Report effect size separately
"Train/test gap shows generalization"	Single split is noisy	Cross-validation or bootstrap
"AUC = 0.85 vs 0.84 → better model"	Without CI, can be noise	Bootstrap CIs, paired tests
"Multiple A/B tests at $α = 0.05$ "	FWER blows up	Bonferroni / BH correction
"Use confidence interval as 'probability $θ$ in interval'"	That's a credible interval	Be precise about interpretation
"MLE is always optimal"	Only asymptotically; can overfit, can be biased in finite samples	Consider MAP / regularization
"Bootstrap fixes any sample size problem"	Tiny $n$ → biased bootstrap	Need $n$ large enough for empirical to approximate true

9. Eight most-asked interview questions

What's the difference between a confidence interval and a credible interval? (Frequentist vs Bayesian; "interval random vs $θ$ random.")
Derive the MLE for a Gaussian. (Lock down log-likelihood + zero-derivative routine.)
What does a p-value mean exactly? (Probability of data this extreme under $H_{0}$ , NOT $P (H_{0} ∣ data)$ .)
When would you use bootstrap? (No analytic SE, ML metrics like AUC, paired model comparison.)
What's the bias-variance tradeoff for estimators? (MSE = bias² + variance; biased estimators can win.)
Why use Bessel's correction ( $n - 1$ )? (Sample variance with $n$ underestimates; $n - 1$ unbiases it.)
What's MAP and how does it relate to regularization? (MLE + log-prior; Gaussian prior = $ℓ_{2}$ , Laplace = $ℓ_{1}$ .)
You ran 20 A/B tests, two were significant at $p < 0.05$ . What do you do? (Multiple testing — apply Bonferroni or BH correction.)

10. Drill plan

For Bernoulli, Gaussian (both params), Poisson — derive MLE on paper. 5 minutes each.
For Beta-Bernoulli — derive posterior. Recite posterior mean.
Bootstrap loop in 30 lines of NumPy. AUC CI on a real dataset.
For each common test (z, t, chi-squared, two-prop), recite: assumptions, statistic, null distribution, when to use.
Interpret 5 different p-values and CI statements; flag the wrong ones.

11. Further reading

Casella & Berger, Statistical Inference — the canonical text.
Wasserman, All of Statistics — fast & broad, ML-friendly.
Efron & Hastie, Computer Age Statistical Inference — bootstrap, modern methods.
Gelman et al., Bayesian Data Analysis — Bayesian bible.
xkcd 882 (jelly beans) — the canonical multiple-testing comic.

ML & LLM Interview Prep — Deep Dives