LLM Evaluation — Interview Grill
70+ active-recall questions. Pair with
LLM_EVALUATION_DEEP_DIVE.md. Answer each in <60 seconds out loud. Mark any you can't answer cleanly and re-read the relevant section.
Section A — Why LLM eval is hard (Q1–8)
- Why is evaluating an LLM harder than evaluating a binary classifier?
- Give three reasons reference-based metrics like BLEU and ROUGE fail for instruction following.
- What is "Goodhart's law" and how does it apply to LLM benchmarks?
- Why does prompt sensitivity matter for benchmark reporting?
- What does it mean that "capability ≠ helpfulness"? Give an example.
- Why does an LLM-judge become useless when the testee approaches the judge in capability?
- Describe the offline / online gap and why benchmarks alone don't predict product success.
- Cost-and-latency-wise, what makes LLM eval different from traditional ML eval?
Section B — Taxonomy (Q9–14)
- Distinguish capability eval, product eval, and safety eval.
- Distinguish reference-based, reference-free, pairwise, and programmatic eval. Give one example of each.
- What's the difference between offline eval and shadow / canary deployment?
- What does "verifiable instruction following" mean? Why is IFEval valuable?
- When would you use a closed-form (multiple choice) eval vs an open-ended eval?
- What's the difference between token-level, output-level, conversation-level, and session-level evaluation?
Section C — Capability benchmarks (Q15–28)
- What does MMLU measure? Why is MMLU-Pro the modern replacement?
- What is GPQA-Diamond? What does it measure that MMLU-Pro doesn't?
- Why are GSM8K and HumanEval saturated? What replaced them?
- How is SWE-Bench-Verified different from SWE-Bench? Why does verification matter?
- Why is LiveCodeBench important relative to HumanEval?
- What does RULER measure? Why is it more informative than vanilla NIAH?
- What is "Lost in the Middle"? How would you test for it?
- Difference between MMMU and MM-Vet?
- What does GAIA measure? What's special about its construction?
- Why is TAU-bench an interesting agent eval?
- What is the difference between TruthfulQA and SimpleQA?
- What does XSTest measure? Why is over-refusal eval important?
- Roughly, what's a defensible capability eval suite for a frontier model in 2026?
- Why might you weight HumanEval+ over HumanEval?
Section D — Instruction following and chat quality (Q29–34)
- What's the difference between IFEval and MT-Bench?
- What does AlpacaEval 2 length-controlled correct for? Why is it necessary?
- Why does multi-turn evaluation reveal different weaknesses than single-turn?
- How do you test persona / system-prompt adherence?
- What's Arena-Hard-Auto and how does it relate to Chatbot Arena?
- Give three programmatic checks you would always include in a chat eval.
Section E — LLM-as-judge (Q35–46)
- What is LLM-as-judge? Why does it work at all?
- List five biases of LLM judges.
- How do you mitigate position bias in pairwise comparison?
- How do you mitigate length bias?
- How do you mitigate self-preference / family bias?
- Walk me through how you'd calibrate an LLM judge.
- What is a multi-judge ensemble and why use it?
- What is Prometheus / G-Eval / PandaLM and how do they differ from "ask GPT-4"?
- When does an LLM judge stop working?
- What's the typical structured output format for a pairwise judge?
- Why might you strip formatting (markdown, headers) before judging?
- Suppose your judge agreement with humans is κ=0.45 — what do you do?
Section F — Pairwise and ELO (Q47–53)
- Why is pairwise more reliable than absolute scoring for open-ended quality?
- Sketch the ELO update formula.
- How is ELO computed from pairwise comparisons in practice (Bradley-Terry)?
- What does Chatbot Arena measure? What are its limitations?
- Why does Arena-Hard-Auto correlate so well with Arena ELO at <1% the cost?
- To distinguish 50% from 55% pairwise win-rate at 95% confidence, roughly how many comparisons?
- Sketch a Bradley-Terry MLE in pseudo-code.
Section G — Open-ended generation eval (Q54–57)
- Why don't BLEU and ROUGE work for instruction following?
- When does BERTScore / COMET make sense?
- What rubric would you use for an LLM judge scoring open-ended responses?
- How do you measure diversity vs quality for creative tasks?
Section H — Factuality (Q58–66)
- Difference between TruthfulQA, SimpleQA, FactScore, LongFact?
- Walk through SAFE.
- What does RAGAS measure? List the four metrics.
- Distinguish citation existence from citation faithfulness.
- Why is calibration a factuality proxy?
- What is Expected Calibration Error?
- Why does RLHF often hurt calibration?
- What's FACTS Grounding?
- How would you eval the factuality of a long-form answer (no single ground truth)?
Section I — Contamination (Q67–73)
- What is benchmark contamination?
- List four ways contamination can happen.
- What is Min-K%-prob? How does it detect membership in training data?
- How do you build a contamination-resistant eval going forward?
- Why do "perturbation tests" detect memorization?
- What is a canary string and how is it used?
- What does it mean to "decontaminate" a benchmark?
Section J — Robustness and statistics (Q74–82)
- How do you measure prompt sensitivity?
- Why does few-shot ordering affect benchmark scores?
- What is BBQ? What does it measure?
- Approximately, the 95% CI half-width for accuracy on n=200, p=0.5?
- What's pass@k? When does it matter?
- Multiple-comparisons problem: if you eval on 20 benchmarks at α=0.05, how many false positives by chance?
- Why is reporting CIs alongside benchmark numbers important?
- If you sample 5 responses per prompt, what's the unit of analysis?
- Bootstrap CI vs Wilson interval — when would you use each?
Section K — Harnesses (Q83–86)
- What does lm-eval-harness do? Why is it the academic default?
- What's HELM and what makes it different from lm-eval-harness?
- What's Inspect (UK AISI) and when do you use it?
- Compare RAGAS, TruLens, DeepEval for RAG eval.
Section L — Online eval and A/B (Q87–95)
- What surrogate quality metrics would you log for a chat product?
- What does "regenerate rate" tell you?
- How do you sample production traffic for online eval?
- How do you size an A/B test for a chat product (binary success metric, p≈0.3, lift δ=2%)?
- What is CUPED? Why does it matter for LLM A/B tests?
- Why does latency matter as a guardrail in LLM A/B?
- Why is "selection bias from refusals" a concern?
- Walk through offline → shadow → canary → A/B for an LLM product.
- What's sequential testing (mSPRT)? When would you use it?
Section M — Product eval design (Q96–100)
- Walk me through designing the eval for a customer-support chatbot. Use the four-layer pattern.
- How do you build a 500-prompt golden set for a chatbot?
- How often do you refresh the golden set? Why?
- What does it mean to "calibrate the LLM judge to humans" for a product? Walk through.
- List five failure modes a good eval suite catches.
Quick fire (Q101–115)
- One line: what does IFEval measure?
- One line: what does RULER measure?
- One line: what does FactScore measure?
- One line: what is pass@k?
- One line: what is length-controlled win rate?
- One line: SimpleQA vs TruthfulQA.
- One line: SAFE.
- One line: Min-K%-prob.
- One line: ELO.
- One line: CUPED.
- One line: Lost in the Middle.
- One line: HELM.
- One line: Inspect framework.
- One line: Arena-Hard-Auto.
- One line: Bradley-Terry.
Self-grading
- 90+ correct: ready for frontier-lab eval rounds.
- 70–89: re-read §5 (judges), §11 (stats), §15 (case study).
- 50–69: re-read full deep dive then redo.
- <50: spend two days on the deep dive, then come back.
5-day drill plan
- Day 1: §1–4 (why hard, taxonomy, knowledge benchmarks). Drill A, B, C.
- Day 2: §5–7 (LLM judge, pairwise, open-ended). Drill E, F, G.
- Day 3: §8–9 (factuality, contamination). Drill H, I.
- Day 4: §11 + §13–14 (stats, online, A/B). Drill J, L.
- Day 5: §15 case study + §16 senior signals + Quick fire. Whiteboard a product eval suite end-to-end.