Frontier Intuitive Probability / Statistics Questions — Deep Dive
Frontier-lab research-scientist interview-grade reference for the open-ended Bayesian / probabilistic reasoning questions OpenAI / DeepMind / Anthropic ask. Built around 25 worked examples — including the canonical "two distributions, new sample, which one is it from" question — plus the underlying frameworks.
These questions are not "memorize a formula." They test whether you can frame an open scenario in probabilistic terms, identify the right tool, and reason cleanly to an answer. Frontier interviewers love them because they reveal depth in seconds: the answer is rarely the point; the framing is.
Table of contents
- The framing checklist
- Framework 1 — Bayesian classification / hypothesis testing
- Framework 2 — Maximum likelihood and method of moments
- Framework 3 — Concentration and tail bounds
- Framework 4 — KL divergence as test statistic / "distance"
- Framework 5 — Sequential decision making and bandits
- Framework 6 — Importance sampling, rejection sampling
- Framework 7 — Stein's paradox and shrinkage
- The DeepMind two-distribution question — fully worked
- 25 worked frontier-lab questions
- Common follow-up probes
- Senior-level signals
- References
1. The framing checklist
When a probabilistic-scenario question lands, ask yourself in this order:
(a) What random variables are involved? Define them precisely.
(b) What are the candidate hypotheses or models? (Two distributions? A null and an alternative? A prior over models?)
(c) Is this a classification, an estimation, or a decision problem?
- Classification → Bayes' rule, likelihood ratio.
- Estimation → MLE/MAP/posterior mean.
- Decision → expected loss minimization.
(d) Do I have a prior, or am I purely frequentist?
- Prior available → Bayesian: posterior likelihood prior.
- No prior → frequentist: likelihood ratio test, confidence intervals, p-values.
(e) What's the loss function or success metric?
- 0-1 loss → MAP / mode of posterior.
- Squared error → posterior mean.
- Asymmetric loss → tilt the threshold.
(f) How much data do I have? What's the appropriate level of confidence?
- Few samples → priors and tail-bound reasoning matter most.
- Many samples → asymptotics, CLT, Fisher information.
(g) What can I compute and what's only conceptual? State explicitly when you'd compute numerically vs argue from principle.
This checklist is the difference between a flailing answer and a clean one. State it out loud as you start.
2. Framework 1 — Bayesian classification
The most-tested framework in frontier-lab probability questions.
2.1 The setup
Given hypotheses (e.g., "sample came from distribution " vs "sample came from ") and observation :
For the binary case:
Decision rule under 0-1 loss: pick if , equivalently:
The likelihood ratio vs the prior odds ratio. The Neyman-Pearson lemma says: for a fixed false-positive rate, the likelihood ratio test is the most powerful test.
2.2 With multiple samples
For i.i.d. samples :
Better in log-space:
The expected log-likelihood ratio under is the KL divergence:
This is why two distributions with high KL are easy to distinguish; close-to-zero KL means hard.
2.3 Sample complexity
How many samples do you need to distinguish from with confidence ?
By the central limit theorem, the log-likelihood ratio's mean grows linearly in (slope = KL or KL depending on the truth) and its standard deviation grows like . So discriminability scales as . Specifically:
(roughly; the precise formula depends on test type).
Memorize: distinguishing two distributions takes samples.
2.4 Connection to Chernoff information
The exponent of the optimal Bayes error rate is the Chernoff information:
The optimal classifier's error rate decays as . KL is a worse upper bound; Chernoff is tight.
3. Framework 2 — Maximum likelihood and method of moments
3.1 MLE
.
Properties:
- Consistent. as .
- Asymptotically normal. where is Fisher information.
- Asymptotically efficient. Reaches the Cramér-Rao lower bound.
- Sometimes biased in finite samples. Common interview gotcha.
3.2 MAP
.
Reduces to MLE under uniform prior. Useful when data is sparse and prior is informative.
3.3 Method of moments
Solve for the first few moments.
- Often less efficient than MLE.
- Sometimes more robust.
- Easier when likelihood is intractable.
3.4 Posterior mean vs MAP
For squared-error loss, posterior mean is optimal. For 0-1 loss on continuous , the MAP is almost never the optimal Bayes estimator (zero-set issue) but is a reasonable approximation when posterior is unimodal.
4. Framework 3 — Concentration and tail bounds
For "with high probability X is small" questions.
4.1 Markov
for non-negative . Crude but always valid.
4.2 Chebyshev
. Two-sided, no distributional assumption.
4.3 Hoeffding
For bounded i.i.d. :
Sub-Gaussian tail; the workhorse for many concentration arguments.
4.4 Bernstein
Sharper than Hoeffding when variance is known and small. Gives sub-exponential tail.
4.5 Chernoff
Generalizes via moment generating function. Tightest bound from MGF.
4.6 When to reach for which
- "I know is bounded" → Hoeffding.
- "I know the variance" → Chebyshev (loose) / Bernstein (tight).
- "I want a numerical CI without assumption" → CLT for .
- "I have a rate / count" → Poisson / Chernoff.
4.7 Common gotcha
Hoeffding has the 2 in the numerator; some forms have it in the denominator. Memorize the version with in the exponent.
5. Framework 4 — KL divergence
Already invoked in §2. Three uses:
5.1 As "distance" (asymmetric)
iff . Asymmetric, doesn't satisfy triangle inequality. Used as objective in distillation, alignment, regularization.
5.2 As Bayes-error exponent
The error rate of the optimal classifier decays at exponent (Chernoff), upper-bounded by KL.
5.3 As coding excess
If you encode -distributed data with a -optimized code, expected excess bits per symbol = .
5.4 KL between two Gaussians
For 1D: .
So distinguishing means at known variance takes samples — a classical result.
6. Framework 5 — Sequential decision making and bandits
For "design a strategy" questions.
6.1 Multi-armed bandit
arms, unknown reward distributions, sequential pulls, regret = best-arm-reward minus chosen-arm-reward summed.
- UCB: pick arm with highest .
- Thompson sampling: sample from posterior, pull .
- -greedy: simplest; doesn't achieve regret in general.
Optimal regret is .
6.2 Best-arm identification
Different objective: minimize samples to confidently identify the best arm with prob . Different optimal algorithms (LUCB, Track-and-Stop).
6.3 Connections to RL
Bandit = stateless RL. Many ideas (exploration, regret) generalize.
7. Framework 6 — Importance sampling, rejection sampling
For "estimate this expectation under a hard distribution" questions.
7.1 Importance sampling
To estimate when is hard to sample but is easy:
Variance of the estimator is small if is well-matched to . Bad if has support where has near-zero density (heavy tails of ).
7.2 Rejection sampling
Sample from , accept with probability where . Acceptance rate = . Inefficient if is large.
7.3 In RLHF / alignment
Importance sampling is exactly how PPO computes policy gradients off-policy. The ratio is the importance weight.
8. Framework 7 — Stein's paradox and shrinkage
A classic "intuitive" topic.
8.1 The result
Estimate Gaussian means from one observation each . The James-Stein estimator:
has strictly lower MSE than the obvious , regardless of the truth, when .
8.2 The intuition
The means need not be related, but averaging across them leverages the fact that any sample is far from the origin "by chance" with high probability in high dimensions, so shrinking toward the origin is uniformly better.
8.3 Connection to ML
Regularization, weight decay, and Bayesian priors are all flavors of shrinkage. The bias-variance tradeoff lives here.
9. The DeepMind two-distribution question — fully worked
The user's actual interview question:
"You have two arrays of numbers from two distributions. A new number comes. Describe how you determine from which distribution it came from."
This is the canonical two-class classification with empirical density estimation. A clean answer walks through:
9.1 Set up
- Data. Two arrays from distribution , from .
- Observation. New value .
- Question. Decide which distribution came from.
9.2 Bayes formulation
where are priors (typically and if both arrays are samples in proportion to base rates).
Decision: classify as iff (under 0-1 loss with equal class weights), equivalently:
9.3 Estimating and
The interesting depth — this is where the interviewer probes.
Option 1: parametric. Assume both are Gaussian. Estimate from via MLE; same for . Plug into Gaussian density. Fast, low-variance, biased if assumption is wrong.
Option 2: non-parametric (KDE). Kernel density estimate from and from . Bandwidth chosen via cross-validation or Silverman's rule. More flexible; needs more data.
Option 3: empirical CDF + smoothing. Compute empirical CDFs and use a smoothing kernel to estimate density. Variant of KDE.
Option 4: discriminative. Don't estimate separately; train a classifier (logistic regression, neural net) directly on and . Output the predicted probability for . Often better than density estimation (Hastie/Tibshirani: discriminative > generative when modeling assumptions are wrong).
9.4 Diagnostics and follow-up answers
- Quantify confidence. measures evidence in nats. Convert to posterior probability via the Bayes formula above.
- Sample complexity. How many samples do you need from each side? Roughly for distinguishability plus for density estimation accuracy. KL is between and .
- What if the new sample is in a region with no training data on either side? The likelihoods are both ~0 estimates; the answer is "I don't know" — and a robust system flags it as out-of-distribution. This is where you mention OOD detection (Mahalanobis distance, energy score, ensemble disagreement).
- What if the priors are unknown? You can still compute the likelihood ratio; the prior is a multiplicative factor in the threshold.
- What if A and B are huge but a new sample is one number? Fast lookup: nearest-neighbor density estimation in with a sorted array.
- What if and have heavy overlap? Even the optimal Bayes classifier will have high error. Quantify via Bayes error rate, .
- What loss are you optimizing? 0-1 loss → MAP. Asymmetric (false-A worse than false-B) → shift threshold. Multi-class extension is straightforward.
- What if A and B are not independent of (covariate shift)? Doesn't apply if is just a sample; applies if there's structured dependence (time-series, locality).
9.5 The 90-second oral answer
This is binary classification: hypothesis that came from distribution (with array as samples) vs from (array ). Bayes-optimal under 0-1 loss is the likelihood ratio test: classify as if , where 's are class priors estimated as and .
The interesting part is estimating and . Three approaches: parametric (assume Gaussian, fit MLE — fast, biased if wrong), non-parametric KDE (more flexible, needs more data, bandwidth via cross-validation), or discriminative (train a logistic regression / neural net on combined labeled data — often better than density estimation, per the discriminative-vs-generative literature).
I'd quantify confidence by ; flag out-of-distribution if both and are very low; and note that sample complexity for discriminability scales as , so if the two distributions are very close, you need a lot of samples to be confident regardless of method.
This answer, in 90 seconds, hits: framing, Bayes rule, prior, likelihood ratio, three estimation strategies, OOD flagging, sample complexity, and KL connection. That's a frontier-lab answer.
10. 25 worked frontier-lab questions
Brief but enough to seed the thought.
Q1. "Two arrays from two distributions, classify a new sample." (above, §9)
Q2. "How many coin flips to confirm a coin is biased toward heads?"
Hypothesis test. vs . Use the binomial / normal approximation. For detecting at :
KL framing: , so for distinguishability — order of magnitude consistency check.
Q3. "Estimate the mean of a normal distribution given 3 samples. What's your confidence interval?"
Use -distribution (small ): . With , — wide CI.
Q4. "You sample i.i.d. from a distribution with bounded variance. How concentrated is the sample mean?"
Chebyshev: . Or CLT for . Or Hoeffding if bounded support.
Q5. "What's the variance of the sample variance for a Gaussian?"
.
Q6. "Why can't you just use empirical CDF for likelihood?"
Empirical CDF gives , but the density is — a sum of delta functions at observations. Useless for new points. Need smoothing.
Q7. "How would you test if a sample came from a normal distribution?"
Shapiro-Wilk, Anderson-Darling, Kolmogorov-Smirnov, Q-Q plots, Jarque-Bera.
Q8. "Two samples — same distribution test?"
Two-sample Kolmogorov-Smirnov. Or Mann-Whitney. Or -test if assuming Gaussian. Or permutation test (most flexible).
Q9. "Estimate KL between two empirical distributions."
KDE both, then numerically integrate; or use the -NN-based estimator (Pérez-Cruz); or train a discriminator and use the bound from the GAN/-divergence literature.
Q10. "If KL between two distributions is , how easy to discriminate?"
Sample complexity . Bayes error rate decays as where .
Q11. "Coin flip game: I flip; you guess; I pay you 2× your bet if right; how much do you bet?"
Kelly criterion: where is odds ratio, is win prob, . With : — bet 25% of bankroll.
Q12. " uniform on [0,1]. What's ?"
Split: with prob 0.5 max=0.5; otherwise max= with , so , with prob 0.5. Total: .
Q13. "Two coins; one fair, one always heads. You pick one and flip 10 times, all heads. What's ?"
Bayes: . Almost certainly biased.
Q14. " i.i.d. uniform on . What's ?"
CDF of max: . PDF: . .
Q15. "How many people for 50% birthday collision probability?"
Birthday problem. . Set → .
Q16. "Three doors, prize behind one, you pick door 1, host opens door 3 (no prize). Switch?"
Monty Hall. Yes — prob of prize behind door 2 is 2/3 (host's action carries information).
Q17. "Why does Monty Hall break if host acts randomly?"
Then host's action carries less information; conditional probabilities change. You should still consider info-theoretic value of the action.
Q18. "Estimate via random sampling."
Sample . Fraction in unit quarter circle is . Multiply by 4. Variance .
Q19. ". What's ?"
Memorylessness: .
Q20. "Sum of two i.i.d. exponentials — what distribution?"
Gamma(2, ). Sum of i.i.d. Exp() is Gamma().
Q21. "Why is the median more robust than the mean?"
Median has 50% breakdown point; mean has 0%. One outlier moves the mean unboundedly, doesn't move the median.
Q22. "Detect a change-point in a stream of values from a known distribution."
CUSUM, GLR (generalized likelihood ratio), Bayesian online change-point detection. Sequential framework — every time, compute likelihood ratio under "no change" vs "change at " hypothesis; if exceeds threshold, declare change.
Q23. "Estimate the size of a population with unique IDs from a single sample."
German tank problem. If max observed = from samples: MLE estimate , but minimum-variance unbiased estimator .
Q24. "Two-sample mean test — but the variances differ."
Welch's -test. Adjusted degrees of freedom. Non-parametric: Mann-Whitney.
Q25. "AB test: significant at , . Should you ship?"
Discuss: practical significance vs statistical, multiple testing, peeking, effect size, business cost of being wrong. Senior signal: don't take the p-value at face value.
11. Common follow-up probes
Frontier interviewers always probe one or two of these after your initial answer:
- "What if your prior is wrong?" → Bayesian sensitivity analysis. Posterior dominated by data when is large; dominated by prior when is small.
- "What's the variance of your estimator?" → Cramér-Rao, asymptotic variance via Fisher info.
- "What if the distributions overlap heavily?" → Bayes error floor; quantify via .
- "What's your sample complexity?" → Concentration inequality + KL/Chernoff.
- "What if you don't know the parametric family?" → Non-parametric (KDE, -NN) or discriminative.
- "What's the asymmetric-loss version?" → Shift threshold; minimize expected loss.
- "How would this fail in production?" → distribution shift, OOD, label noise, data drift.
- "Compare your method with X." → Bias-variance tradeoff; sample efficiency.
- "What's the connection to information theory / KL / Fisher info?" → Reach for the unifying theorem.
- "Why are you confident in your estimator?" → CI, bootstrap, robustness.
12. Senior-level signals
- You start with the framing checklist. Don't jump to a formula.
- You name the framework (Bayes / MLE / Concentration / Bandit / Importance / Stein) explicitly.
- You quantify confidence — , posterior, CI, sample complexity in .
- You discuss assumptions and what fails when they're wrong.
- You name the connection to information theory (KL, Fisher, Chernoff).
- You think about OOD / failure modes, not just the happy path.
- You distinguish frequentist vs Bayesian when relevant.
- You mention the production-grade variant (online estimation, drift detection, hypothesis testing under multiple comparisons).
- You don't over-claim. "Optimal under 0-1 loss with these priors" — not "optimal."
- You can pivot from the analytical answer to a programmatic one if asked.
13. References
- Casella & Berger, Statistical Inference. The standard reference.
- Cover & Thomas, Elements of Information Theory. KL, Fisher, Chernoff.
- Wasserman, All of Statistics. Concise and broad.
- Bishop, Pattern Recognition and Machine Learning. Bayesian flavor.
- Hastie, Tibshirani, Friedman, Elements of Statistical Learning. Discriminative-vs-generative debate.
- Robert, The Bayesian Choice. Decision theory.
- Lattimore & Szepesvári, Bandit Algorithms. Sequential decision making.
- Berger, Statistical Decision Theory and Bayesian Analysis. Stein's paradox, shrinkage.
- Lehmann & Romano, Testing Statistical Hypotheses. Frequentist hypothesis testing.
How to use this chapter
- Read §1 (framing checklist) until automatic.
- Memorize the seven frameworks (§2-§8) at a level where you can name them and the canonical formula on demand.
- Drill §10 — 25 worked questions — until each has a 30-second answer.
- Memorize §9 (the DeepMind two-distribution question) verbatim as your "model answer" template.
- Pair with
INTERVIEW_GRILL.mdfor active recall. - Practice out loud — these are oral exams in real interviews.
Single sentence to remember: frame as Bayesian classification or MLE / decision / concentration, name the framework explicitly, quantify with KL or Fisher or Chernoff, discuss assumptions and OOD, and end with sample complexity.