Topic 60: Research Judgment Rounds

What You'll Learn

This topic focuses on the interview segment where you are shown a result, claim, or experiment and asked whether it is convincing.

You will practice:

judging evidence quality
asking for missing ablations
identifying leakage and confounds
checking metric validity
evaluating robustness and variance
proposing the next experiment

Why This Matters

This is where many strong coders underperform.

They describe methods well, but they do not interrogate evidence well.

A research scientist interview expects you to say:

what is missing
what is weak
what you would run next
what conclusion is and is not justified

Core Intuition

Research judgment is the ability to separate:

the result that was measured
the claim being made
the causal explanation being implied

Those are not the same thing.

A paper, experiment, or benchmark can show a real improvement while still not proving the story attached to it.

That is why good research answers sound careful but not vague.

You are trying to identify:

what the evidence supports
what alternative explanations remain
what experiment would reduce that uncertainty next

Files in This Topic

rounds.md: scenario-based research rounds
judgment_checklist.md: compact interview checklist

Technical Details Interviewers Often Want

Fair Comparison

Whenever a result improves, ask:

what changed?
what stayed fixed?

Important hidden confounds include:

more data
more compute
different decoding
different retrieval setup
different prompt template
better hyperparameter tuning

If multiple things changed, causal attribution is weak.

Metric Fit

A metric can be computed correctly and still be the wrong metric.

Examples:

perplexity for a task where factual grounding matters more
recall@k without checking whether generation used the evidence
average score hiding catastrophic failures on important slices

The question is not just "did the metric improve?"

It is "does this metric track the behavior we actually care about?"

Variance and Slicing

Average improvement is often not enough.

Interviewers want to know whether you would ask for:

multiple seeds
confidence intervals
subgroup analysis
failure-category analysis
long-tail examples

This is especially important for LLM systems where improvements may be concentrated in narrow prompt families.

Next-Experiment Logic

A good follow-up experiment should reduce uncertainty about the main claim.

That means the next experiment should be chosen because it tests a specific alternative explanation, not because it is merely interesting.

Common Failure Modes

1. Confusing Correlation with Mechanism

A method may correlate with improvement without being the true cause if other variables moved at the same time.

2. Accepting Headline Averages Too Quickly

Average gain can hide:

instability
regression on rare cases
benchmark artifacts
narrow prompt-template gains

Clarify the claim.
Ask what changed.
Ask what stayed fixed.
Check metric fit.
Ask for variance and slices.
State the strongest conclusion justified by the evidence.

What to Practice Saying Out Loud

What exact claim is this experiment trying to support?
What changed, and what must be held fixed for this comparison to be fair?
Which metric could improve while the true user outcome gets worse?
What is the single best next experiment to reduce uncertainty?
What conclusion is justified, and what conclusion is still too strong?

ML & LLM Interview Prep — Deep Dives

Topic 60: Research Judgment Rounds

What You'll Learn

Why This Matters

Core Intuition

Files in This Topic

Technical Details Interviewers Often Want

Fair Comparison

Metric Fit

Variance and Slicing

Next-Experiment Logic

Common Failure Modes

1. Confusing Correlation with Mechanism

2. Accepting Headline Averages Too Quickly

3. Asking for More Experiments Without Prioritizing

4. Ignoring Baseline Strength

Edge Cases and Follow-Up Questions

What if the new method is better on average but worse on critical slices?

What if one seed is much better than the others?

What if the method improves offline metrics but slows serving dramatically?

What if the gain disappears after controlling for compute?

Standard Response Pattern

What to Practice Saying Out Loud