LLM Evaluation — Interview Grill

70+ active-recall questions. Pair with LLM_EVALUATION_DEEP_DIVE.md. Answer each in <60 seconds out loud. Mark any you can't answer cleanly and re-read the relevant section.

Section A — Why LLM eval is hard (Q1–8)

Why is evaluating an LLM harder than evaluating a binary classifier?
Give three reasons reference-based metrics like BLEU and ROUGE fail for instruction following.
What is "Goodhart's law" and how does it apply to LLM benchmarks?
Why does prompt sensitivity matter for benchmark reporting?
What does it mean that "capability ≠ helpfulness"? Give an example.
Why does an LLM-judge become useless when the testee approaches the judge in capability?
Describe the offline / online gap and why benchmarks alone don't predict product success.
Cost-and-latency-wise, what makes LLM eval different from traditional ML eval?

Section B — Taxonomy (Q9–14)

Distinguish capability eval, product eval, and safety eval.
Distinguish reference-based, reference-free, pairwise, and programmatic eval. Give one example of each.
What's the difference between offline eval and shadow / canary deployment?
What does "verifiable instruction following" mean? Why is IFEval valuable?
When would you use a closed-form (multiple choice) eval vs an open-ended eval?
What's the difference between token-level, output-level, conversation-level, and session-level evaluation?

Section C — Capability benchmarks (Q15–28)

What does MMLU measure? Why is MMLU-Pro the modern replacement?
What is GPQA-Diamond? What does it measure that MMLU-Pro doesn't?
Why are GSM8K and HumanEval saturated? What replaced them?
How is SWE-Bench-Verified different from SWE-Bench? Why does verification matter?
Why is LiveCodeBench important relative to HumanEval?
What does RULER measure? Why is it more informative than vanilla NIAH?
What is "Lost in the Middle"? How would you test for it?
Difference between MMMU and MM-Vet?
What does GAIA measure? What's special about its construction?
Why is TAU-bench an interesting agent eval?
What is the difference between TruthfulQA and SimpleQA?
What does XSTest measure? Why is over-refusal eval important?
Roughly, what's a defensible capability eval suite for a frontier model in 2026?
Why might you weight HumanEval+ over HumanEval?

Section D — Instruction following and chat quality (Q29–34)

What's the difference between IFEval and MT-Bench?
What does AlpacaEval 2 length-controlled correct for? Why is it necessary?
Why does multi-turn evaluation reveal different weaknesses than single-turn?
How do you test persona / system-prompt adherence?
What's Arena-Hard-Auto and how does it relate to Chatbot Arena?
Give three programmatic checks you would always include in a chat eval.

Section E — LLM-as-judge (Q35–46)

What is LLM-as-judge? Why does it work at all?
List five biases of LLM judges.
How do you mitigate position bias in pairwise comparison?
How do you mitigate length bias?
How do you mitigate self-preference / family bias?
Walk me through how you'd calibrate an LLM judge.
What is a multi-judge ensemble and why use it?
What is Prometheus / G-Eval / PandaLM and how do they differ from "ask GPT-4"?
When does an LLM judge stop working?
What's the typical structured output format for a pairwise judge?
Why might you strip formatting (markdown, headers) before judging?
Suppose your judge agreement with humans is κ=0.45 — what do you do?

Section F — Pairwise and ELO (Q47–53)

Why is pairwise more reliable than absolute scoring for open-ended quality?
Sketch the ELO update formula.
How is ELO computed from pairwise comparisons in practice (Bradley-Terry)?
What does Chatbot Arena measure? What are its limitations?
Why does Arena-Hard-Auto correlate so well with Arena ELO at <1% the cost?
To distinguish 50% from 55% pairwise win-rate at 95% confidence, roughly how many comparisons?
Sketch a Bradley-Terry MLE in pseudo-code.

Section G — Open-ended generation eval (Q54–57)

Why don't BLEU and ROUGE work for instruction following?
When does BERTScore / COMET make sense?
What rubric would you use for an LLM judge scoring open-ended responses?
How do you measure diversity vs quality for creative tasks?

Section H — Factuality (Q58–66)

Difference between TruthfulQA, SimpleQA, FactScore, LongFact?
Walk through SAFE.
What does RAGAS measure? List the four metrics.
Distinguish citation existence from citation faithfulness.
Why is calibration a factuality proxy?
What is Expected Calibration Error?
Why does RLHF often hurt calibration?
What's FACTS Grounding?
How would you eval the factuality of a long-form answer (no single ground truth)?

Section I — Contamination (Q67–73)

What is benchmark contamination?
List four ways contamination can happen.
What is Min-K%-prob? How does it detect membership in training data?
How do you build a contamination-resistant eval going forward?
Why do "perturbation tests" detect memorization?
What is a canary string and how is it used?
What does it mean to "decontaminate" a benchmark?

Section J — Robustness and statistics (Q74–82)

How do you measure prompt sensitivity?
Why does few-shot ordering affect benchmark scores?
What is BBQ? What does it measure?
Approximately, the 95% CI half-width for accuracy on n=200, p=0.5?
What's pass@k? When does it matter?
Multiple-comparisons problem: if you eval on 20 benchmarks at α=0.05, how many false positives by chance?
Why is reporting CIs alongside benchmark numbers important?
If you sample 5 responses per prompt, what's the unit of analysis?
Bootstrap CI vs Wilson interval — when would you use each?

Section K — Harnesses (Q83–86)

What does lm-eval-harness do? Why is it the academic default?
What's HELM and what makes it different from lm-eval-harness?
What's Inspect (UK AISI) and when do you use it?
Compare RAGAS, TruLens, DeepEval for RAG eval.

Section L — Online eval and A/B (Q87–95)

What surrogate quality metrics would you log for a chat product?
What does "regenerate rate" tell you?
How do you sample production traffic for online eval?
How do you size an A/B test for a chat product (binary success metric, p≈0.3, lift δ=2%)?
What is CUPED? Why does it matter for LLM A/B tests?
Why does latency matter as a guardrail in LLM A/B?
Why is "selection bias from refusals" a concern?
Walk through offline → shadow → canary → A/B for an LLM product.
What's sequential testing (mSPRT)? When would you use it?

Section M — Product eval design (Q96–100)

Walk me through designing the eval for a customer-support chatbot. Use the four-layer pattern.
How do you build a 500-prompt golden set for a chatbot?
How often do you refresh the golden set? Why?
What does it mean to "calibrate the LLM judge to humans" for a product? Walk through.
List five failure modes a good eval suite catches.

Quick fire (Q101–115)

One line: what does IFEval measure?
One line: what does RULER measure?
One line: what does FactScore measure?
One line: what is pass@k?
One line: what is length-controlled win rate?
One line: SimpleQA vs TruthfulQA.
One line: SAFE.
One line: Min-K%-prob.
One line: ELO.
One line: CUPED.
One line: Lost in the Middle.
One line: HELM.
One line: Inspect framework.
One line: Arena-Hard-Auto.
One line: Bradley-Terry.

Self-grading

90+ correct: ready for frontier-lab eval rounds.
70–89: re-read §5 (judges), §11 (stats), §15 (case study).
50–69: re-read full deep dive then redo.
<50: spend two days on the deep dive, then come back.

5-day drill plan

Day 1: §1–4 (why hard, taxonomy, knowledge benchmarks). Drill A, B, C.
Day 2: §5–7 (LLM judge, pairwise, open-ended). Drill E, F, G.
Day 3: §8–9 (factuality, contamination). Drill H, I.
Day 4: §11 + §13–14 (stats, online, A/B). Drill J, L.
Day 5: §15 case study + §16 senior signals + Quick fire. Whiteboard a product eval suite end-to-end.

ML & LLM Interview Prep — Deep Dives