Research Judgment Rounds
Round 1: One Metric Improved
Prompt
A new model improves perplexity by 3% but shows no gain on downstream QA.
Good answer structure
- objective metric and task metric are different
- possible mismatch between training objective and task success
- decoding or calibration may matter
- inspect slices and answer-type breakdown
- do not overclaim downstream improvement
Round 2: One Seed Only
Prompt
A method beats baseline on one seed but not the others.
Good answer structure
- do not claim robust improvement
- report mean and variance
- increase number of seeds
- inspect sensitivity to initialization or optimizer noise
Round 3: Strong Gain, Weak Baseline
Prompt
A paper reports a big gain, but the baseline is outdated and under-tuned.
Good answer structure
- the result is not yet convincing
- stronger baseline needed
- same data/compute budget needed
- isolate whether the gain is real or due to weak comparison
Round 4: Retrieval Metric Improves, Final System Gets Worse
Prompt
Recall@10 improved, but final answer accuracy dropped.
Good answer structure
- retrieved context may be noisier
- ordering and truncation may hurt
- generator may ignore evidence
- retrieval and generation objectives are not identical
- inspect failure stage explicitly
Round 5: Small Reported Improvement
Prompt
A paper reports a 0.2-point gain on a benchmark.
Good answer structure
- ask for variance across runs
- ask whether the metric is saturated
- ask whether the gain is consistent across slices
- check whether compute/data changed
Round 6: Bigger Model Wins
Prompt
A method improves results, but it also uses a much larger model.
Good answer structure
- capacity confounds method effect
- need matched-size comparison
- need compute-normalized or parameter-normalized evidence
Round 7: Preference Win Rate Improved
Prompt
Human preference win rate improved, but factuality declined.
Good answer structure
- preference signal may reward style over truth
- reward misspecification
- evaluation mismatch
- need factuality and robustness checks in parallel
Round 8: Benchmark Leakage Suspicion
Prompt
Results look unusually strong on one benchmark but not others.
Good answer structure
- check contamination
- check preprocessing overlap
- inspect dataset construction
- compare transfer to other benchmarks