Research Judgment Rounds

Round 1: One Metric Improved

Prompt

A new model improves perplexity by 3% but shows no gain on downstream QA.

Good answer structure

objective metric and task metric are different
possible mismatch between training objective and task success
decoding or calibration may matter
inspect slices and answer-type breakdown
do not overclaim downstream improvement

Round 2: One Seed Only

Prompt

A method beats baseline on one seed but not the others.

Good answer structure

do not claim robust improvement
report mean and variance
increase number of seeds
inspect sensitivity to initialization or optimizer noise

Round 3: Strong Gain, Weak Baseline

Prompt

A paper reports a big gain, but the baseline is outdated and under-tuned.

Good answer structure

the result is not yet convincing
stronger baseline needed
same data/compute budget needed
isolate whether the gain is real or due to weak comparison

Round 4: Retrieval Metric Improves, Final System Gets Worse

Prompt

Recall@10 improved, but final answer accuracy dropped.

Good answer structure

retrieved context may be noisier
ordering and truncation may hurt
generator may ignore evidence
retrieval and generation objectives are not identical
inspect failure stage explicitly

Round 5: Small Reported Improvement

Prompt

A paper reports a 0.2-point gain on a benchmark.

Good answer structure

ask for variance across runs
ask whether the metric is saturated
ask whether the gain is consistent across slices
check whether compute/data changed

Round 6: Bigger Model Wins

Prompt

A method improves results, but it also uses a much larger model.

Good answer structure

capacity confounds method effect
need matched-size comparison
need compute-normalized or parameter-normalized evidence

Round 7: Preference Win Rate Improved

Prompt

Human preference win rate improved, but factuality declined.

Good answer structure

preference signal may reward style over truth
reward misspecification
evaluation mismatch
need factuality and robustness checks in parallel

Round 8: Benchmark Leakage Suspicion

Prompt

Results look unusually strong on one benchmark but not others.

Good answer structure

check contamination
check preprocessing overlap
inspect dataset construction
compare transfer to other benchmarks