Frontier Intuitive Probability / Statistics — Interview Grill

100+ active-recall questions calibrated for OpenAI / DeepMind / Anthropic research-scientist rounds. Each is a 60-second oral exam answer. Pair with INTUITIVE_QUESTIONS_DEEP_DIVE.md.

Section A — Framing checklist (Q1–7)

List the 7 framing checklist items in order.
When does a problem call for Bayesian vs frequentist framing?
When is a problem a classification vs estimation vs decision?
Why is it important to state the loss function before computing?
What's the difference between MAP and posterior mean as point estimates? Under what loss is each optimal?
What does "asymptotic" mean and when do you reach for it?
State a single-sentence summary of the framing approach.

Section B — Bayesian classification (Q8–18)

Sketch Bayes' rule with priors and likelihoods.
Define the likelihood ratio and the prior odds.
State the Neyman-Pearson lemma.
Why is the LRT optimal at fixed false-positive rate?
For i.i.d. samples, how does the log-likelihood ratio scale with $n$ ?
What's the expected log-likelihood ratio under $H_{1}$ ? (Hint: it's a familiar quantity.)
State the sample complexity formula for distinguishability.
Why is sample complexity $O (1/ KL^{2})$ rather than $O (1/ KL)$ ?
What's Chernoff information? How is it related to Bayes error rate?
Walk through the Bayes error rate formula $\int min (p, q)$ .
Two-class question — under asymmetric loss, how does the threshold shift?

Section C — MLE, MAP, method of moments (Q19–27)

State the MLE objective.
Three asymptotic properties of MLE.
What's Fisher information?
What's the Cramér-Rao lower bound?
When is MLE biased in finite samples? Give an example.
Compare MLE with MAP — when are they the same?
Why is MAP not the optimal Bayesian estimator under squared loss?
When does method of moments beat MLE?
What does "asymptotically efficient" mean?

Section D — Concentration and tail bounds (Q28–36)

State Markov's inequality.
State Chebyshev's inequality.
State Hoeffding's inequality (precise form).
When does Hoeffding apply but Bernstein doesn't?
When does Bernstein give a sharper bound than Hoeffding?
What's the moment generating function and why does it matter for Chernoff?
When does CLT apply?
CLT — what's the rate of convergence (Berry-Esseen)?
For binary outcomes with $n = 200$ , $p = 0.5$ , what's the 95% CI half-width?

Section E — KL divergence and information theory (Q37–46)

Define KL divergence and state two key properties.
Why is KL asymmetric?
Sketch KL between two univariate Gaussians with same variance.
Why does KL matter for distinguishability?
State the relationship between KL and Bayes error exponent.
What's the Fano inequality?
What's mutual information in one sentence?
Why is KL the "natural" loss in maximum-likelihood / VAE / diffusion?
Why is reverse-KL different from forward-KL in posterior approximation?
KL as coding excess — explain.

Section F — Sequential decision / bandits (Q47–53)

Define the multi-armed bandit problem.
What's UCB and what regret does it achieve?
What's Thompson sampling?
Why doesn't $ϵ$ -greedy achieve $O (lo g T)$ regret in general?
Distinguish regret minimization from best-arm identification.
What's the Track-and-Stop algorithm for?
How does bandit theory connect to RLHF?

Section G — Importance and rejection sampling (Q54–58)

State the importance-sampling identity.
When does importance sampling have high variance?
Why does importance sampling appear in PPO?
Walk through rejection sampling.
When is rejection sampling impractical (acceptance rate)?

Section H — Stein and shrinkage (Q59–62)

State the James-Stein result.
Why is James-Stein "paradoxical"?
How does shrinkage relate to Bayesian priors?
How does this connect to weight decay in deep learning?

Section I — The two-distribution scenario, fully drilled (Q63–75)

State the question in one sentence.
What's the Bayes-optimal decision rule?
Three approaches to estimating $p (x)$ and $q (x)$ from arrays.
Tradeoff between parametric (Gaussian) vs KDE.
When is discriminative classification (logistic regression on combined data) better than generative?
How do you quantify confidence in the classification of a new sample?
What if both $p (x)$ and $q (x)$ are tiny — how do you handle?
Sample complexity scaling: $1/ KL (P ∥ Q)^{2}$ — derive the intuition.
What if priors $π_{P}, π_{Q}$ are unknown?
What if the loss is asymmetric?
KDE bandwidth — how do you pick it?
What's Silverman's rule of thumb?
Walk me through the 90-second oral answer end to end.

Section J — Brain-teaser style (Q76–95)

Coin flip: 10 heads in a row. $P (biased)$ given prior $0.5$ on bias?
Two arrays of size $n$ from continuous distributions. New point. Decide source.
Birthday problem — formula and answer for 50%.
Monty Hall — and why it breaks under random host.
$X, Y$ uniform $[0, 1]$ — compute $E [max (X, Y)]$ .
$X \sim Exp (λ)$ — what's $P (X > a + b ∣ X > a)$ ?
Sum of $k$ i.i.d. exponentials — what distribution?
Why is median more robust than mean?
Estimate $π$ via Monte Carlo.
Detect a change-point in a Gaussian stream — algorithm?
German tank problem — MLE and MVUE.
Welch's $t$ -test — when?
AB test: $p = 0.04, n = 10000$ — should you ship?
Power calculation: detect $p = 0.6$ vs $p = 0.5$ at 5% Type-I, 5% Type-II — sample size?
Variance of sample variance for Gaussian — formula?
Estimate the mean from 3 samples — what's the CI?
Empirical CDF vs density estimation — what's the gotcha?
Test if a sample is normal — three methods.
Two-sample distribution test — Kolmogorov-Smirnov vs Mann-Whitney vs $t$ -test.
Estimate KL between two empirical distributions — three methods.

Section K — Common follow-up probes (Q96–105)

"What if your prior is wrong?"
"What's the variance of your estimator?"
"What if the distributions overlap heavily?"
"What's your sample complexity?"
"What if you don't know the parametric family?"
"What if the loss is asymmetric?"
"How would this fail in production?"
"Why are you confident in your estimator?"
"Compare with another method — bias-variance trade-off?"
"Connection to information theory?"

Quick fire (Q106–125)

One line: Bayes' rule.
One line: likelihood ratio test.
One line: Neyman-Pearson lemma.
One line: KL between two Gaussians.
One line: Cramér-Rao bound.
One line: Hoeffding inequality.
One line: CLT.
One line: UCB.
One line: Thompson sampling.
One line: importance sampling.
One line: James-Stein.
One line: Chernoff information.
One line: Bayes error rate.
One line: empirical CDF.
One line: KDE.
One line: Welch's $t$ -test.
One line: power of a test.
One line: change-point detection.
One line: German tank problem.
One line: discriminative vs generative classification.

Self-grading

110+ correct: ready for frontier-lab probability rounds.
80–109: re-read framework sections (§2–§8) and the worked examples (§10).
50–79: re-read full deep dive then redo.
<50: spend three days drilling the deep dive.

5-day drill plan

Day 1: §1 (framing) + §2 (Bayesian classification). Drill A, B.
Day 2: §3 (MLE) + §4 (concentration). Drill C, D.
Day 3: §5 (KL) + §6 (bandits) + §7 (importance) + §8 (Stein). Drill E, F, G, H.
Day 4: §9 (two-distribution scenario, memorize the 90-second answer) + §10 (25 worked questions). Drill I, J.
Day 5: §11 (follow-up probes) + §12 (senior signals) + Quick fire. Whiteboard 5 random questions end-to-end out loud.

ML & LLM Interview Prep — Deep Dives