Mock Research Interview Questions

Use these as spoken-practice prompts.

Probability and Statistics

1. Two Arrays, One New Value

You have two arrays, each sampled from a different distribution. A new scalar value arrives. How do you determine which distribution it most likely came from?

Strong answer outline:

assume or estimate a distributional family
compute p(x | class) for each class
multiply by class priors if needed
choose larger posterior score
mention KDE or nearest-neighbor density if parametric assumptions are weak

2. Same Mean, Different Variance

If two Gaussian distributions have the same mean but different variance, can a single point still be classified?

What to discuss:

yes, by density
values near the center may favor the lower-variance distribution
far-away values may favor the higher-variance distribution

3. Overlapping Distributions

If the two class densities overlap heavily, what should you report besides the predicted class?

What to discuss:

posterior probability or confidence
expected error
ambiguity of the region

Experiment Judgment

4. One Metric Improved, Another Got Worse

Your model improves perplexity but hurts downstream exact match. What are your first hypotheses?

5. Better Retriever, Worse QA

Your retrieval recall improved but answer quality declined. Explain how that can happen and how you would debug it.

6. One Seed Works

A proposed method beats baseline on one seed only. What is the correct scientific conclusion?

Paper Discussion

7. Summarize a Paper in 5 Minutes

Use this structure:

problem
method
why it might work
main assumptions
missing ablations
likely failure modes

8. Strong Benchmark, Weak Evidence

What kinds of evidence are missing if a paper reports only one benchmark number?

What to discuss:

variance across seeds
slice metrics
compute/data controls
ablations
robustness checks

LLM-Specific

9. Why Did the Model Hallucinate?

Give a stage-by-stage diagnosis framework.

What to discuss:

retrieval miss
context truncation
poor ranking
model ignoring evidence
unsupported generation

10. Why Did Preference Tuning Hurt Factuality?

What to discuss:

reward misspecification
preference data not aligned with truthfulness
style improvements masking factual regressions
evaluation mismatch

ML & LLM Interview Prep — Deep Dives