Frontier Intuitive Probability / Statistics — Interview Grill
100+ active-recall questions calibrated for OpenAI / DeepMind / Anthropic research-scientist rounds. Each is a 60-second oral exam answer. Pair with
INTUITIVE_QUESTIONS_DEEP_DIVE.md.
Section A — Framing checklist (Q1–7)
- List the 7 framing checklist items in order.
- When does a problem call for Bayesian vs frequentist framing?
- When is a problem a classification vs estimation vs decision?
- Why is it important to state the loss function before computing?
- What's the difference between MAP and posterior mean as point estimates? Under what loss is each optimal?
- What does "asymptotic" mean and when do you reach for it?
- State a single-sentence summary of the framing approach.
Section B — Bayesian classification (Q8–18)
- Sketch Bayes' rule with priors and likelihoods.
- Define the likelihood ratio and the prior odds.
- State the Neyman-Pearson lemma.
- Why is the LRT optimal at fixed false-positive rate?
- For i.i.d. samples, how does the log-likelihood ratio scale with ?
- What's the expected log-likelihood ratio under ? (Hint: it's a familiar quantity.)
- State the sample complexity formula for distinguishability.
- Why is sample complexity rather than ?
- What's Chernoff information? How is it related to Bayes error rate?
- Walk through the Bayes error rate formula .
- Two-class question — under asymmetric loss, how does the threshold shift?
Section C — MLE, MAP, method of moments (Q19–27)
- State the MLE objective.
- Three asymptotic properties of MLE.
- What's Fisher information?
- What's the Cramér-Rao lower bound?
- When is MLE biased in finite samples? Give an example.
- Compare MLE with MAP — when are they the same?
- Why is MAP not the optimal Bayesian estimator under squared loss?
- When does method of moments beat MLE?
- What does "asymptotically efficient" mean?
Section D — Concentration and tail bounds (Q28–36)
- State Markov's inequality.
- State Chebyshev's inequality.
- State Hoeffding's inequality (precise form).
- When does Hoeffding apply but Bernstein doesn't?
- When does Bernstein give a sharper bound than Hoeffding?
- What's the moment generating function and why does it matter for Chernoff?
- When does CLT apply?
- CLT — what's the rate of convergence (Berry-Esseen)?
- For binary outcomes with , , what's the 95% CI half-width?
Section E — KL divergence and information theory (Q37–46)
- Define KL divergence and state two key properties.
- Why is KL asymmetric?
- Sketch KL between two univariate Gaussians with same variance.
- Why does KL matter for distinguishability?
- State the relationship between KL and Bayes error exponent.
- What's the Fano inequality?
- What's mutual information in one sentence?
- Why is KL the "natural" loss in maximum-likelihood / VAE / diffusion?
- Why is reverse-KL different from forward-KL in posterior approximation?
- KL as coding excess — explain.
Section F — Sequential decision / bandits (Q47–53)
- Define the multi-armed bandit problem.
- What's UCB and what regret does it achieve?
- What's Thompson sampling?
- Why doesn't -greedy achieve regret in general?
- Distinguish regret minimization from best-arm identification.
- What's the Track-and-Stop algorithm for?
- How does bandit theory connect to RLHF?
Section G — Importance and rejection sampling (Q54–58)
- State the importance-sampling identity.
- When does importance sampling have high variance?
- Why does importance sampling appear in PPO?
- Walk through rejection sampling.
- When is rejection sampling impractical (acceptance rate)?
Section H — Stein and shrinkage (Q59–62)
- State the James-Stein result.
- Why is James-Stein "paradoxical"?
- How does shrinkage relate to Bayesian priors?
- How does this connect to weight decay in deep learning?
Section I — The two-distribution scenario, fully drilled (Q63–75)
- State the question in one sentence.
- What's the Bayes-optimal decision rule?
- Three approaches to estimating and from arrays.
- Tradeoff between parametric (Gaussian) vs KDE.
- When is discriminative classification (logistic regression on combined data) better than generative?
- How do you quantify confidence in the classification of a new sample?
- What if both and are tiny — how do you handle?
- Sample complexity scaling: — derive the intuition.
- What if priors are unknown?
- What if the loss is asymmetric?
- KDE bandwidth — how do you pick it?
- What's Silverman's rule of thumb?
- Walk me through the 90-second oral answer end to end.
Section J — Brain-teaser style (Q76–95)
- Coin flip: 10 heads in a row. given prior on bias?
- Two arrays of size from continuous distributions. New point. Decide source.
- Birthday problem — formula and answer for 50%.
- Monty Hall — and why it breaks under random host.
- uniform — compute .
- — what's ?
- Sum of i.i.d. exponentials — what distribution?
- Why is median more robust than mean?
- Estimate via Monte Carlo.
- Detect a change-point in a Gaussian stream — algorithm?
- German tank problem — MLE and MVUE.
- Welch's -test — when?
- AB test: — should you ship?
- Power calculation: detect vs at 5% Type-I, 5% Type-II — sample size?
- Variance of sample variance for Gaussian — formula?
- Estimate the mean from 3 samples — what's the CI?
- Empirical CDF vs density estimation — what's the gotcha?
- Test if a sample is normal — three methods.
- Two-sample distribution test — Kolmogorov-Smirnov vs Mann-Whitney vs -test.
- Estimate KL between two empirical distributions — three methods.
Section K — Common follow-up probes (Q96–105)
- "What if your prior is wrong?"
- "What's the variance of your estimator?"
- "What if the distributions overlap heavily?"
- "What's your sample complexity?"
- "What if you don't know the parametric family?"
- "What if the loss is asymmetric?"
- "How would this fail in production?"
- "Why are you confident in your estimator?"
- "Compare with another method — bias-variance trade-off?"
- "Connection to information theory?"
Quick fire (Q106–125)
- One line: Bayes' rule.
- One line: likelihood ratio test.
- One line: Neyman-Pearson lemma.
- One line: KL between two Gaussians.
- One line: Cramér-Rao bound.
- One line: Hoeffding inequality.
- One line: CLT.
- One line: UCB.
- One line: Thompson sampling.
- One line: importance sampling.
- One line: James-Stein.
- One line: Chernoff information.
- One line: Bayes error rate.
- One line: empirical CDF.
- One line: KDE.
- One line: Welch's -test.
- One line: power of a test.
- One line: change-point detection.
- One line: German tank problem.
- One line: discriminative vs generative classification.
Self-grading
- 110+ correct: ready for frontier-lab probability rounds.
- 80–109: re-read framework sections (§2–§8) and the worked examples (§10).
- 50–79: re-read full deep dive then redo.
- <50: spend three days drilling the deep dive.
5-day drill plan
- Day 1: §1 (framing) + §2 (Bayesian classification). Drill A, B.
- Day 2: §3 (MLE) + §4 (concentration). Drill C, D.
- Day 3: §5 (KL) + §6 (bandits) + §7 (importance) + §8 (Stein). Drill E, F, G, H.
- Day 4: §9 (two-distribution scenario, memorize the 90-second answer) + §10 (25 worked questions). Drill I, J.
- Day 5: §11 (follow-up probes) + §12 (senior signals) + Quick fire. Whiteboard 5 random questions end-to-end out loud.