Frontier Alignment + RL — Interview Grill

130+ active-recall questions calibrated for OpenAI / DeepMind / Anthropic research-scientist rounds. Pair with REASONING_MODELS_DEEP_DIVE.md, FRONTIER_REWARD_MODELING.md, OPEN_SOURCE_POSTTRAIN_PLAYBOOKS.md in this folder. Answer each in <60 seconds aloud. Mark anything unclear and re-read the relevant section.

Section A — Reasoning paradigm and test-time compute (Q1–10)

What changed about LLM training between mid-2024 and 2025? Why is "reasoning RL" a paradigm shift?
State Snell et al.'s test-time compute scaling claim in one sentence.
Walk through three test-time compute strategies (best-of-N, sequential revision, search-via-verifier) and when each is best.
Why is the compute-optimal frontier task-difficulty-dependent?
A 7B model with 14× more inference compute can match a 100B model — what's the trick?
Sketch the test-time compute scaling curve from memory (axes, log-linear region, saturation).
What's the relationship between training-time compute and inference-time compute?
Why does a reasoning model's per-query cost matter for product design?
What's a "router" in a reasoning-model deployment, and why?
Why is reasoning RL different from classical RLHF?

Section B — RLVR (Q11–22)

Define RLVR. What makes a reward "verifiable"?
Give five examples of verifiable rewards. Five examples of non-verifiable.
Why is verifiable reward strictly preferred over preference reward when available?
Sketch the RLVR objective formula with KL.
What's a format reward? Why is it usually small relative to correctness?
What's a language-consistency reward? Why did R1 add one?
Why do verifiable rewards resist most reward-hacking patterns?
What can be hacked even with verifiable rewards? (verifier exploits)
Why is GRPO/RLOO often preferred over PPO in RLVR?
What's the role of $π_{ref}$ in RLVR?
What happens if the success rate on the training set is <1%?
Walk through curriculum design for RLVR.

Section C — PRMs vs ORMs (Q23–33)

Define PRM and ORM.
Why are PRMs theoretically attractive for long CoT?
Cite Lightman et al. 2023 — what dataset did OpenAI release?
How does Math-Shepherd auto-label step correctness?
How does OmegaPRM extend Math-Shepherd?
What did DeepSeek-R1 conclude about PRM vs ORM?
Two ways to use a PRM in RL training — what are they?
Why might PRM data be noisier than ORM data?
How can a policy hack a PRM more easily than an ORM?
What's a generative reward model? How does it differ from a scalar RM?
Mahan et al. 2024 — why did genRMs match scalar RMs on hard tasks?

Section D — Search + RL (Q34–43)

Walk through STaR (Zelikman 2022).
Why does rationalization work in STaR?
What's Quiet-STaR doing differently?
Walk through V-STaR.
Walk through ReST^EM.
Compare expert iteration (Anthony 2017) with ReST^EM.
Why is MCTS hard to combine with discrete-token LM state spaces?
Sketch how AlphaProof uses Lean for the value function.
What does AlphaGeometry use for the verifier?
Why is the rejection-sampling SFT pattern essentially "free signal"?

Section E — R1-Zero (Q44–52)

What was the starting point of R1-Zero?
What rewards did it use?
What's the "aha moment"?
Why is the aha moment surprising? (Note: the model wasn't trained on self-correction text.)
What's R1-Zero's headline AIME score curve?
What does R1-Zero prove about latent capability vs RL?
Three failure modes of R1-Zero.
Why didn't this work in 2022?
Could you replace verifiable reward with preference data? Why or why not?

Section F — R1 full pipeline (Q53–63)

List R1's four stages.
What's "cold-start SFT" and why is it needed?
What rewards are added in stage 2 vs R1-Zero?
How big is the rejection-sampling SFT dataset in stage 3?
What's the math/non-math split in stage 3?
Why re-SFT V3-base in stage 3 instead of stage-2's weights?
What does the final RLHF stage target?
Why four stages instead of one?
What's the data ratio between reasoning and chat in stage 3?
How does R1-Distill work?
R1-Distill-Qwen-32B beats GPT-4o on what benchmarks? Why does that matter?

Section G — Tülu 3 / Llama 3 / Qwen (Q64–73)

What's Tülu 3's three-stage recipe?
What's RLVR's contribution beyond standard SFT+DPO in Tülu 3?
Why does Tülu 3 use length-controlled DPO?
What's Llama 3's iterative SFT+DPO loop?
Why did Meta choose DPO over PPO at 405B?
What's "rejection-sampled SFT data" in Llama 3?
Why is no reasoning-RL stage in Llama 3.1?
What does QwQ-32B's recipe look like?
Why might PRMs help in Qwen but not in R1?
Compare R1's recipe with Tülu 3's stage by stage.

Section H — Reward modeling (Q74–86)

Sketch the BT loss for RM training.
Why does scalar RM score not have absolute meaning?
What's reward overoptimization (Gao et al. 2023)?
Sketch the overoptimization curve. What's on each axis?
What does it mean that "RM goes OOD as policy drifts"?
Why does ensembling RMs help?
Why does iterative RM refresh help?
Why does KL penalty bound the overoptimization?
What's RewardBench? What does it measure?
What's RLAIF? Cite the canonical paper.
Walk through Constitutional AI (Bai et al. 2022).
What's a self-rewarding LM (Yuan et al.)?
Why does self-rewarding plateau without external signal?

Section I — Reward hacking (Q87–96)

Define reward hacking and Goodhart's law in this context.
List five named reward-hack patterns and one mitigation each.
Length bias — diagnose and mitigate.
Sycophancy — diagnose and mitigate.
Format bias — diagnose and mitigate.
Refusal-rate bias — diagnose and mitigate.
Verifier hack — what is it and how do you defend?
Prompt-injection of a genRM — what is it and how do you defend?
How do you detect overoptimization in a production training run?
What's the role of "held-out judge from a different family" in monitoring?

Section J — Inference-time strategies (Q97–104)

What does self-consistency (Wang et al. 2022) do?
Best-of-N + RM — when is this strictly better than self-consistency?
What's MBR decoding and when is it better than best-of-N?
What's verifier-guided beam search?
What's compute-optimal inference allocation across difficulties?
What temperature does R1 default to and why?
Why does greedy decoding sometimes underperform on reasoning?
What's a "fast/slow" routing layer in a reasoning-model deployment?

Section K — Failure modes and safety (Q105–112)

What's overthinking? How do you mitigate?
Why are reasoning models worse-calibrated on factual QA than on math?
What's hallucinated reasoning? Why is it dangerous?
What's deliberative alignment (Guan et al. 2024, OpenAI)?
How does deliberative alignment differ from refusal training?
Why must safety operate over the CoT, not just the answer?
Why does Constitutional AI matter for reasoning models specifically?
How would you red-team a reasoning model?

Section L — Open frontier questions (Q113–120)

Can RL elicit capabilities the base model doesn't have?
Is the inference-compute scaling law universal? When does it break?
Should production frontier models use PRMs?
Will multi-agent debate scale as a reward source?
Will self-play (SPIN, Self-Rewarding) eventually plateau or keep climbing?
What's the moat in frontier labs — weights, data, or RL infrastructure?
How would you design RL on long-horizon agent trajectories?
What's the role of formal verifiers (Lean, Coq) in future reasoning RL?

Section M — Senior scenario questions (Q121–130)

Scenario. Design a 6-stage post-training pipeline for a 70B reasoning model from scratch.
Scenario. You're seeing length blow up over RL training. What's wrong and what do you ship?
Scenario. Your RM RewardBench score is 92% but your policy is regressing on chat-hard. Why?
Scenario. A red-teamer demonstrates a verifier hack. How do you fix it?
Scenario. Your reasoning model overthinks easy questions. Walk through routing + budget design.
Scenario. You only have 5k high-quality long-CoT examples. Can you train a reasoning model? How?
Scenario. Sketch out how you'd use an LLM-as-judge as an RL reward signal — including the fail-safes.
Scenario. Production telemetry shows refusal rate climbing 20% over the last week with no model update. Diagnose.
Scenario. Compare PPO + scalar RM vs DPO + iterative refresh vs GRPO + verifier — pick one for a math-only task and justify.
Scenario. You want to distill a 70B reasoning model into a 7B. Walk through the recipe and the key knobs.

Quick fire (Q131–150)

One line: RLVR.
One line: GRPO vs PPO.
One line: PRM vs ORM.
One line: STaR vs ReST^EM.
One line: R1-Zero vs R1.
One line: Cold-start SFT.
One line: Rejection-sampling SFT.
One line: R1-Distill.
One line: Math-Shepherd.
One line: OmegaPRM.
One line: Constitutional AI.
One line: RLAIF.
One line: Reward overoptimization.
One line: RewardBench.
One line: Length-controlled DPO.
One line: Self-Rewarding LMs.
One line: Deliberative alignment.
One line: Test-time compute scaling.
One line: Generative reward model.
One line: AlphaProof.

Self-grading

130+ correct: ready for OpenAI / DeepMind / Anthropic research-scientist rounds.
100–129: re-read REASONING_MODELS §1–7, FRONTIER_REWARD_MODELING §3, §8.
70–99: re-read all three deep dives once more, then redo.
<70: spend 4 days on the deep dives + read the actual R1 paper, then come back.

7-day drill plan

Day 1: REASONING_MODELS §1–4 (paradigm, test-time, RLVR, PRM/ORM). Drill A, B, C.
Day 2: REASONING_MODELS §5–7 (search+RL, R1-Zero, R1). Drill D, E, F.
Day 3: REASONING_MODELS §8–14 (o1 inferences, distillation, inference, failure modes, open Qs). Drill G, J, K.
Day 4: FRONTIER_REWARD_MODELING all sections. Drill H, I.
Day 5: OPEN_SOURCE_POSTTRAIN_PLAYBOOKS all sections. Memorize 60-90s answers for R1, Tülu 3, Llama 3. Drill F, G again.
Day 6: Read DeepSeek-R1 paper (arXiv 2501.12948) cover-to-cover.
Day 7: Drill M (scenarios) + Quick fire. Whiteboard the 6-stage recipe end-to-end.

ML & LLM Interview Prep — Deep Dives