Frontier Intuitive Probability / Statistics Questions — Deep Dive

Frontier-lab research-scientist interview-grade reference for the open-ended Bayesian / probabilistic reasoning questions OpenAI / DeepMind / Anthropic ask. Built around 25 worked examples — including the canonical "two distributions, new sample, which one is it from" question — plus the underlying frameworks.

These questions are not "memorize a formula." They test whether you can frame an open scenario in probabilistic terms, identify the right tool, and reason cleanly to an answer. Frontier interviewers love them because they reveal depth in seconds: the answer is rarely the point; the framing is.

The framing checklist
Framework 1 — Bayesian classification / hypothesis testing
Framework 2 — Maximum likelihood and method of moments
Framework 3 — Concentration and tail bounds
Framework 4 — KL divergence as test statistic / "distance"
Framework 5 — Sequential decision making and bandits
Framework 6 — Importance sampling, rejection sampling
Framework 7 — Stein's paradox and shrinkage
The DeepMind two-distribution question — fully worked
25 worked frontier-lab questions
Common follow-up probes
Senior-level signals
References

1. The framing checklist

When a probabilistic-scenario question lands, ask yourself in this order:

(a) What random variables are involved? Define them precisely.

(b) What are the candidate hypotheses or models? (Two distributions? A null and an alternative? A prior over models?)

(c) Is this a classification, an estimation, or a decision problem?

Classification → Bayes' rule, likelihood ratio.
Estimation → MLE/MAP/posterior mean.
Decision → expected loss minimization.

(d) Do I have a prior, or am I purely frequentist?

Prior available → Bayesian: posterior $\propto$ likelihood $\times$ prior.
No prior → frequentist: likelihood ratio test, confidence intervals, p-values.

(e) What's the loss function or success metric?

0-1 loss → MAP / mode of posterior.
Squared error → posterior mean.
Asymmetric loss → tilt the threshold.

(f) How much data do I have? What's the appropriate level of confidence?

Few samples → priors and tail-bound reasoning matter most.
Many samples → asymptotics, CLT, Fisher information.

(g) What can I compute and what's only conceptual? State explicitly when you'd compute numerically vs argue from principle.

This checklist is the difference between a flailing answer and a clean one. State it out loud as you start.

2. Framework 1 — Bayesian classification

The most-tested framework in frontier-lab probability questions.

2.1 The setup

Given hypotheses $H_{1}, H_{2}$ (e.g., "sample came from distribution $P$ " vs "sample came from $Q$ ") and observation $x$ :

$P (H_{i} ∣ x) = \frac{P ( x ∣ H _{i} ) P ( H _{i} )}{\sum _{j} P ( x ∣ H _{j} ) P ( H _{j} )} .$

For the binary case:

$P (H_{1} ∣ x) = \frac{1}{1 + \frac{P ( H _{2} )}{P ( H _{1} )} \cdot \frac{P ( x ∣ H _{2} )}{P ( x ∣ H _{1} )}} .$

Decision rule under 0-1 loss: pick $H_{1}$ if $P (H_{1} ∣ x) > P (H_{2} ∣ x)$ , equivalently:

$Λ (x) = \frac{P ( x ∣ H _{1} )}{P ( x ∣ H _{2} )} > \frac{P ( H _{2} )}{P ( H _{1} )} .$

The likelihood ratio $Λ (x)$ vs the prior odds ratio. The Neyman-Pearson lemma says: for a fixed false-positive rate, the likelihood ratio test is the most powerful test.

2.2 With multiple samples

For i.i.d. samples $x_{1}, ..., x_{n}$ :

$Λ_{n} = i = 1 \prod n \frac{P ( x _{i} ∣ H _{1} )}{P ( x _{i} ∣ H _{2} )} .$

Better in log-space:

$lo g Λ_{n} = i = 1 \sum n lo g \frac{P ( x _{i} ∣ H _{1} )}{P ( x _{i} ∣ H _{2} )} .$

The expected log-likelihood ratio under $H_{1}$ is the KL divergence:

$E_{x \sim P} [lo g \frac{P ( x )}{Q ( x )}] = KL (P ∥ Q) \geq 0.$

This is why two distributions with high KL are easy to distinguish; close-to-zero KL means hard.

2.3 Sample complexity

How many samples do you need to distinguish $H_{1}$ from $H_{2}$ with confidence $1 - δ$ ?

By the central limit theorem, the log-likelihood ratio's mean grows linearly in $n$ (slope = KL or $-$ KL depending on the truth) and its standard deviation grows like $n$ . So discriminability scales as $n$ . Specifically:

$n^{*} \approx \frac{( z _{α} + z _{β} ) ^{2} \cdot Var _{H_{1}} [ lo g Λ ( x ) ]}{( KL ( P ∥ Q ) ) ^{2}}$

(roughly; the precise formula depends on test type).

Memorize: distinguishing two distributions takes $O (1/ KL^{2})$ samples.

2.4 Connection to Chernoff information

The exponent of the optimal Bayes error rate is the Chernoff information:

$C (P, Q) = - 0 \leq λ \leq 1 min lo g \int p (x)^{λ} q (x)^{1 - λ} d x .$

The optimal classifier's error rate decays as $e^{- n C (P, Q)}$ . KL is a worse upper bound; Chernoff is tight.

3. Framework 2 — Maximum likelihood and method of moments

3.1 MLE

$\hat{θ}_{MLE} = ar g max_{θ} \prod_{i} p (x_{i} ∣ θ)$ .

Properties:

Consistent. $\hat{θ} \to θ_{0}$ as $n \to \infty$ .
Asymptotically normal. $n (\hat{θ} - θ_{0}) \to N (0, I (θ_{0})^{- 1})$ where $I$ is Fisher information.
Asymptotically efficient. Reaches the Cramér-Rao lower bound.
Sometimes biased in finite samples. Common interview gotcha.

3.2 MAP

$\hat{θ}_{MAP} = ar g max_{θ} \prod_{i} p (x_{i} ∣ θ) \cdot p (θ)$ .

Reduces to MLE under uniform prior. Useful when data is sparse and prior is informative.

3.3 Method of moments

Solve $\overset{μ}{^}_{k} (X) = μ_{k} (θ)$ for the first few moments.

Often less efficient than MLE.
Sometimes more robust.
Easier when likelihood is intractable.

3.4 Posterior mean vs MAP

For squared-error loss, posterior mean is optimal. For 0-1 loss on continuous $θ$ , the MAP is almost never the optimal Bayes estimator (zero-set issue) but is a reasonable approximation when posterior is unimodal.

4. Framework 3 — Concentration and tail bounds

For "with high probability X is small" questions.

4.1 Markov

$P (X \geq a) \leq E [X] / a$ for non-negative $X$ . Crude but always valid.

4.2 Chebyshev

$P (∣ X - μ ∣ \geq kσ) \leq 1/ k^{2}$ . Two-sided, no distributional assumption.

4.3 Hoeffding

For bounded i.i.d. $X_{i} \in [a, b]$ :

$P (\frac{1}{n} \sum X_{i} - μ \geq t) \leq 2 exp (\frac{- 2 n t ^{2}}{( b - a ) ^{2}}) .$

Sub-Gaussian tail; the workhorse for many concentration arguments.

4.4 Bernstein

Sharper than Hoeffding when variance is known and small. Gives sub-exponential tail.

4.5 Chernoff

Generalizes via moment generating function. Tightest bound from MGF.

4.6 When to reach for which

"I know $X$ is bounded" → Hoeffding.
"I know the variance" → Chebyshev (loose) / Bernstein (tight).
"I want a numerical CI without assumption" → CLT for $n ≳ 30$ .
"I have a rate / count" → Poisson / Chernoff.

4.7 Common gotcha

Hoeffding has the 2 in the numerator; some forms have it in the denominator. Memorize the version with $- 2 n t^{2} / (b - a)^{2}$ in the exponent.

5. Framework 4 — KL divergence

Already invoked in §2. Three uses:

5.1 As "distance" (asymmetric)

$KL (P ∥ Q) = E_{P} [lo g \frac{P ( x )}{Q ( x )}] \geq 0,$

$= 0$ iff $P = Q$ . Asymmetric, doesn't satisfy triangle inequality. Used as objective in distillation, alignment, regularization.

5.2 As Bayes-error exponent

The error rate of the optimal classifier decays at exponent $C (P, Q)$ (Chernoff), upper-bounded by KL.

5.3 As coding excess

If you encode $P$ -distributed data with a $Q$ -optimized code, expected excess bits per symbol = $KL (P ∥ Q)$ .

5.4 KL between two Gaussians

$KL (N (μ_{1}, Σ_{1}) ∥ N (μ_{2}, Σ_{2})) = \frac{1}{2} [lo g \frac{∣ Σ _{2} ∣}{∣ Σ _{1} ∣} - d + tr (Σ_{2}^{- 1} Σ_{1}) + (μ_{2} - μ_{1})^{⊤} Σ_{2}^{- 1} (μ_{2} - μ_{1})] .$

For 1D: $KL (N (μ_{1}, σ^{2}) ∥ N (μ_{2}, σ^{2})) = (μ_{1} - μ_{2})^{2} / (2 σ^{2})$ .

So distinguishing means $μ_{1}, μ_{2}$ at known variance $σ^{2}$ takes $n^{*} \propto (μ_{1} - μ_{2})^{- 2} σ^{2}$ samples — a classical result.

6. Framework 5 — Sequential decision making and bandits

For "design a strategy" questions.

6.1 Multi-armed bandit

$K$ arms, unknown reward distributions, sequential pulls, regret = best-arm-reward minus chosen-arm-reward summed.

UCB: pick arm with highest $\overset{μ}{^} + 2 lo g t / N_{a}$ .
Thompson sampling: sample $θ_{a}$ from posterior, pull $ar g max$ .
$ϵ$ -greedy: simplest; doesn't achieve $O (lo g T)$ regret in general.

Optimal regret is $O (lo g T)$ .

6.2 Best-arm identification

Different objective: minimize samples to confidently identify the best arm with prob $1 - δ$ . Different optimal algorithms (LUCB, Track-and-Stop).

6.3 Connections to RL

Bandit = stateless RL. Many ideas (exploration, regret) generalize.

7. Framework 6 — Importance sampling, rejection sampling

For "estimate this expectation under a hard distribution" questions.

7.1 Importance sampling

To estimate $E_{P} [f (X)]$ when $P$ is hard to sample but $Q$ is easy:

$E_{P} [f (X)] = E_{Q} [\frac{P ( X )}{Q ( X )} f (X)] .$

Variance of the estimator is small if $Q$ is well-matched to $∣ f ∣ P$ . Bad if $P$ has support where $Q$ has near-zero density (heavy tails of $P / Q$ ).

7.2 Rejection sampling

Sample from $Q$ , accept with probability $P (x) / (M \cdot Q (x))$ where $M = sup P / Q$ . Acceptance rate = $1/ M$ . Inefficient if $M$ is large.

7.3 In RLHF / alignment

Importance sampling is exactly how PPO computes policy gradients off-policy. The ratio $r (θ) = π_{θ} / π_{θ_{old}}$ is the importance weight.

8. Framework 7 — Stein's paradox and shrinkage

A classic "intuitive" topic.

8.1 The result

Estimate $K \geq 3$ Gaussian means $μ_{1}, ..., μ_{K}$ from one observation each $x_{k} \sim N (μ_{k}, 1)$ . The James-Stein estimator:

$\overset{μ}{^}_{k}^{JS} = (1 - \frac{K - 2}{\sum _{j} x _{j}^{2}}) x_{k}$

has strictly lower MSE than the obvious $\overset{μ}{^}_{k} = x_{k}$ , regardless of the truth, when $K \geq 3$ .

8.2 The intuition

The means need not be related, but averaging across them leverages the fact that any sample is far from the origin "by chance" with high probability in high dimensions, so shrinking toward the origin is uniformly better.

8.3 Connection to ML

Regularization, weight decay, and Bayesian priors are all flavors of shrinkage. The bias-variance tradeoff lives here.

9. The DeepMind two-distribution question — fully worked

The user's actual interview question:

"You have two arrays of numbers from two distributions. A new number comes. Describe how you determine from which distribution it came from."

This is the canonical two-class classification with empirical density estimation. A clean answer walks through:

9.1 Set up

Data. Two arrays $A = {a_{1}, ..., a_{n}}$ from distribution $P$ , $B = {b_{1}, ..., b_{m}}$ from $Q$ .
Observation. New value $x$ .
Question. Decide which distribution $x$ came from.

9.2 Bayes formulation

$P (from P ∣ x) = \frac{p ( x ) π _{P}}{p ( x ) π _{P} + q ( x ) π _{Q}}$

where $π_{P}, π_{Q}$ are priors (typically $n / (n + m)$ and $m / (n + m)$ if both arrays are samples in proportion to base rates).

Decision: classify as $P$ iff $P (from P ∣ x) > 0.5$ (under 0-1 loss with equal class weights), equivalently:

$Λ (x) = \frac{p ( x )}{q ( x )} > \frac{π _{Q}}{π _{P}} .$

9.3 Estimating $p$ and $q$

The interesting depth — this is where the interviewer probes.

Option 1: parametric. Assume both are Gaussian. Estimate $\overset{μ}{^}_{P}, \overset{σ}{^}_{P}$ from $A$ via MLE; same for $Q$ . Plug into Gaussian density. Fast, low-variance, biased if assumption is wrong.

Option 2: non-parametric (KDE). Kernel density estimate from $A$ and from $B$ . Bandwidth chosen via cross-validation or Silverman's rule. More flexible; needs more data.

Option 3: empirical CDF + smoothing. Compute empirical CDFs and use a smoothing kernel to estimate density. Variant of KDE.

Option 4: discriminative. Don't estimate $p, q$ separately; train a classifier (logistic regression, neural net) directly on $(A, 0)$ and $(B, 1)$ . Output the predicted probability for $x$ . Often better than density estimation (Hastie/Tibshirani: discriminative > generative when modeling assumptions are wrong).

9.4 Diagnostics and follow-up answers

Quantify confidence. $∣ lo g Λ (x) ∣$ measures evidence in nats. Convert to posterior probability via the Bayes formula above.
Sample complexity. How many samples do you need from each side? Roughly $O (1/ KL^{2})$ for distinguishability plus $O (1/ ϵ^{2})$ for density estimation accuracy. KL is between $P$ and $Q$ .
What if the new sample is in a region with no training data on either side? The likelihoods are both ~0 estimates; the answer is "I don't know" — and a robust system flags it as out-of-distribution. This is where you mention OOD detection (Mahalanobis distance, energy score, ensemble disagreement).
What if the priors are unknown? You can still compute the likelihood ratio; the prior is a multiplicative factor in the threshold.
What if A and B are huge but a new sample is one number? Fast lookup: nearest-neighbor density estimation in $O (lo g n)$ with a sorted array.
What if $P$ and $Q$ have heavy overlap? Even the optimal Bayes classifier will have high error. Quantify via Bayes error rate, $\int min (p, q)$ .
What loss are you optimizing? 0-1 loss → MAP. Asymmetric (false-A worse than false-B) → shift threshold. Multi-class extension is straightforward.
What if A and B are not independent of $x$ (covariate shift)? Doesn't apply if $x$ is just a sample; applies if there's structured dependence (time-series, locality).

9.5 The 90-second oral answer

This is binary classification: hypothesis $H_{P}$ that $x$ came from distribution $P$ (with array $A$ as samples) vs $H_{Q}$ from $Q$ (array $B$ ). Bayes-optimal under 0-1 loss is the likelihood ratio test: classify as $P$ if $p (x) / q (x) > π_{Q} / π_{P}$ , where $π$ 's are class priors estimated as $n / (n + m)$ and $m / (n + m)$ .

The interesting part is estimating $p$ and $q$ . Three approaches: parametric (assume Gaussian, fit MLE — fast, biased if wrong), non-parametric KDE (more flexible, needs more data, bandwidth via cross-validation), or discriminative (train a logistic regression / neural net on combined labeled data — often better than density estimation, per the discriminative-vs-generative literature).

I'd quantify confidence by $∣ lo g Λ (x) ∣$ ; flag out-of-distribution if both $p (x)$ and $q (x)$ are very low; and note that sample complexity for discriminability scales as $1/ KL (P ∥ Q)^{2}$ , so if the two distributions are very close, you need a lot of samples to be confident regardless of method.

This answer, in 90 seconds, hits: framing, Bayes rule, prior, likelihood ratio, three estimation strategies, OOD flagging, sample complexity, and KL connection. That's a frontier-lab answer.

10. 25 worked frontier-lab questions

Brief but enough to seed the thought.

Q1. "Two arrays from two distributions, classify a new sample." (above, §9)

Q2. "How many coin flips to confirm a coin is biased toward heads?"

Hypothesis test. $H_{0} : p = 0.5$ vs $H_{1} : p > 0.5$ . Use the binomial / normal approximation. For detecting $p = 0.6$ at $α = β = 0.05$ :

$n^{*} \approx \frac{( z _{0.05} + z _{0.05} ) ^{2} \cdot p ( 1 - p )}{( p - 0.5 ) ^{2}} = \frac{( 1.645 + 1.645 ) ^{2} \cdot 0.24}{0.01} \approx 260.$

KL framing: $KL (Bern (0.6) ∥ Bern (0.5)) \approx 0.0204$ , so $n \sim 1/ KL^{2} \sim 2400$ for distinguishability — order of magnitude consistency check.

Q3. "Estimate the mean of a normal distribution given 3 samples. What's your confidence interval?"

Use $t$ -distribution (small $n$ ): $\overset{x}{ˉ} \pm t_{n - 1, α /2} \cdot s / n$ . With $n = 3$ , $t_{2, 0.025} \approx 4.30$ — wide CI.

Q4. "You sample $X_{1}, ..., X_{n}$ i.i.d. from a distribution with bounded variance. How concentrated is the sample mean?"

Chebyshev: $P (∣ \overset{ˉ}{X} - μ ∣ \geq t) \leq σ^{2} / (n t^{2})$ . Or CLT for $n ≳ 30$ . Or Hoeffding if bounded support.

Q5. "What's the variance of the sample variance for a Gaussian?"

$Var (s^{2}) = 2 σ^{4} / (n - 1)$ .

Q6. "Why can't you just use empirical CDF for likelihood?"

Empirical CDF gives $\hat{F} (x)$ , but the density is $\hat{F}^{'} (x)$ — a sum of delta functions at observations. Useless for new points. Need smoothing.

Q7. "How would you test if a sample came from a normal distribution?"

Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov, Q-Q plots, Jarque-Bera.

Q8. "Two samples — same distribution test?"

Two-sample Kolmogorov-Smirnov. Or Mann-Whitney. Or $t$ -test if assuming Gaussian. Or permutation test (most flexible).

Q9. "Estimate KL between two empirical distributions."

KDE both, then numerically integrate; or use the $k$ -NN-based estimator (Pérez-Cruz); or train a discriminator and use the bound from the GAN/ $f$ -divergence literature.

Q10. "If KL between two distributions is $ϵ$ , how easy to discriminate?"

Sample complexity $\sim 1/ ϵ^{2}$ . Bayes error rate decays as $e^{- n C (P, Q)}$ where $C \leq KL$ .

Q11. "Coin flip game: I flip; you guess; I pay you 2× your bet if right; how much do you bet?"

Kelly criterion: $f^{*} = (b p - q) / b$ where $b$ is odds ratio, $p$ is win prob, $q = 1 - p$ . With $b = 2, p = 0.5$ : $f^{*} = (1 - 0.5) /2 = 0.25$ — bet 25% of bankroll.

Q12. " $X$ uniform on [0,1]. What's $E [max (X, 0.5)]$ ?"

Split: with prob 0.5 max=0.5; otherwise max= $X$ with $X \in [0.5, 1]$ , so $E [X ∣ X > 0.5] = 0.75$ , with prob 0.5. Total: $0.5 \cdot 0.5 + 0.5 \cdot 0.75 = 0.625$ .

Q13. "Two coins; one fair, one always heads. You pick one and flip 10 times, all heads. What's $P (fair)$ ?"

Bayes: $P (fair ∣ 10 H) = P (10 H ∣ fair) \cdot 0.5/ [P (10 H ∣ fair) \cdot 0.5 + P (10 H ∣ biased) \cdot 0.5] = (1/1024) \cdot 0.5/ [(1/1024) \cdot 0.5 + 1 \cdot 0.5] = 1/1025 \approx 0.001$ . Almost certainly biased.

Q14. " $X, Y$ i.i.d. uniform on $[0, 1]$ . What's $E [max (X, Y)]$ ?"

CDF of max: $F_{M} (z) = z^{2}$ . PDF: $2 z$ . $E [M] = \int_{0}^{1} z \cdot 2 z d z = 2/3$ .

Q15. "How many people for 50% birthday collision probability?"

Birthday problem. $P (no collision among n) = \prod_{k = 0}^{n - 1} (365 - k) /365 \approx e^{- n (n - 1) / (2 \cdot 365)}$ . Set $\approx 0.5$ → $n \approx 23$ .

Q16. "Three doors, prize behind one, you pick door 1, host opens door 3 (no prize). Switch?"

Monty Hall. Yes — prob of prize behind door 2 is 2/3 (host's action carries information).

Q17. "Why does Monty Hall break if host acts randomly?"

Then host's action carries less information; conditional probabilities change. You should still consider info-theoretic value of the action.

Q18. "Estimate $π$ via random sampling."

Sample $(x, y) \sim U [0, 1]^{2}$ . Fraction in unit quarter circle is $π /4$ . Multiply by 4. Variance $\propto 1/ n$ .

Q19. " $X \sim Exp (λ)$ . What's $P (X > a + b ∣ X > a)$ ?"

Memorylessness: $P (X > a + b ∣ X > a) = P (X > b)$ .

Q20. "Sum of two i.i.d. exponentials — what distribution?"

Gamma(2, $λ$ ). Sum of $k$ i.i.d. Exp( $λ$ ) is Gamma( $k, λ$ ).

Q21. "Why is the median more robust than the mean?"

Median has 50% breakdown point; mean has 0%. One outlier moves the mean unboundedly, doesn't move the median.

Q22. "Detect a change-point in a stream of values from a known distribution."

CUSUM, GLR (generalized likelihood ratio), Bayesian online change-point detection. Sequential framework — every time, compute likelihood ratio under "no change" vs "change at $t$ " hypothesis; if exceeds threshold, declare change.

Q23. "Estimate the size of a population with unique IDs from a single sample."

German tank problem. If max observed = $m$ from $n$ samples: MLE estimate $= m$ , but minimum-variance unbiased estimator $= m \cdot (n + 1) / n - 1 = m + m / n - 1$ .

Q24. "Two-sample mean test — but the variances differ."

Welch's $t$ -test. Adjusted degrees of freedom. Non-parametric: Mann-Whitney.

Q25. "AB test: significant at $p = 0.04$ , $n = 10000$ . Should you ship?"

Discuss: practical significance vs statistical, multiple testing, peeking, effect size, business cost of being wrong. Senior signal: don't take the p-value at face value.

11. Common follow-up probes

Frontier interviewers always probe one or two of these after your initial answer:

"What if your prior is wrong?" → Bayesian sensitivity analysis. Posterior dominated by data when $n$ is large; dominated by prior when $n$ is small.
"What's the variance of your estimator?" → Cramér-Rao, asymptotic variance via Fisher info.
"What if the distributions overlap heavily?" → Bayes error floor; quantify via $\int min (p, q)$ .
"What's your sample complexity?" → Concentration inequality + KL/Chernoff.
"What if you don't know the parametric family?" → Non-parametric (KDE, $k$ -NN) or discriminative.
"What's the asymmetric-loss version?" → Shift threshold; minimize expected loss.
"How would this fail in production?" → distribution shift, OOD, label noise, data drift.
"Compare your method with X." → Bias-variance tradeoff; sample efficiency.
"What's the connection to information theory / KL / Fisher info?" → Reach for the unifying theorem.
"Why are you confident in your estimator?" → CI, bootstrap, robustness.

12. Senior-level signals

You start with the framing checklist. Don't jump to a formula.
You name the framework (Bayes / MLE / Concentration / Bandit / Importance / Stein) explicitly.
You quantify confidence — $lo g Λ$ , posterior, CI, sample complexity in $1/ KL^{2}$ .
You discuss assumptions and what fails when they're wrong.
You name the connection to information theory (KL, Fisher, Chernoff).
You think about OOD / failure modes, not just the happy path.
You distinguish frequentist vs Bayesian when relevant.
You mention the production-grade variant (online estimation, drift detection, hypothesis testing under multiple comparisons).
You don't over-claim. "Optimal under 0-1 loss with these priors" — not "optimal."
You can pivot from the analytical answer to a programmatic one if asked.

13. References

Casella & Berger, Statistical Inference. The standard reference.
Cover & Thomas, Elements of Information Theory. KL, Fisher, Chernoff.
Wasserman, All of Statistics. Concise and broad.
Bishop, Pattern Recognition and Machine Learning. Bayesian flavor.
Hastie, Tibshirani, Friedman, Elements of Statistical Learning. Discriminative-vs-generative debate.
Robert, The Bayesian Choice. Decision theory.
Lattimore & Szepesvári, Bandit Algorithms. Sequential decision making.
Berger, Statistical Decision Theory and Bayesian Analysis. Stein's paradox, shrinkage.
Lehmann & Romano, Testing Statistical Hypotheses. Frequentist hypothesis testing.

How to use this chapter

Read §1 (framing checklist) until automatic.
Memorize the seven frameworks (§2-§8) at a level where you can name them and the canonical formula on demand.
Drill §10 — 25 worked questions — until each has a 30-second answer.
Memorize §9 (the DeepMind two-distribution question) verbatim as your "model answer" template.
Pair with INTERVIEW_GRILL.md for active recall.
Practice out loud — these are oral exams in real interviews.

Single sentence to remember: frame as Bayesian classification or MLE / decision / concentration, name the framework explicitly, quantify with KL or Fisher or Chernoff, discuss assumptions and OOD, and end with sample complexity.

ML & LLM Interview Prep — Deep Dives