Frontier Alignment + RL — Interview Grill
130+ active-recall questions calibrated for OpenAI / DeepMind / Anthropic research-scientist rounds. Pair with
REASONING_MODELS_DEEP_DIVE.md,FRONTIER_REWARD_MODELING.md,OPEN_SOURCE_POSTTRAIN_PLAYBOOKS.mdin this folder. Answer each in <60 seconds aloud. Mark anything unclear and re-read the relevant section.
Section A — Reasoning paradigm and test-time compute (Q1–10)
- What changed about LLM training between mid-2024 and 2025? Why is "reasoning RL" a paradigm shift?
- State Snell et al.'s test-time compute scaling claim in one sentence.
- Walk through three test-time compute strategies (best-of-N, sequential revision, search-via-verifier) and when each is best.
- Why is the compute-optimal frontier task-difficulty-dependent?
- A 7B model with 14× more inference compute can match a 100B model — what's the trick?
- Sketch the test-time compute scaling curve from memory (axes, log-linear region, saturation).
- What's the relationship between training-time compute and inference-time compute?
- Why does a reasoning model's per-query cost matter for product design?
- What's a "router" in a reasoning-model deployment, and why?
- Why is reasoning RL different from classical RLHF?
Section B — RLVR (Q11–22)
- Define RLVR. What makes a reward "verifiable"?
- Give five examples of verifiable rewards. Five examples of non-verifiable.
- Why is verifiable reward strictly preferred over preference reward when available?
- Sketch the RLVR objective formula with KL.
- What's a format reward? Why is it usually small relative to correctness?
- What's a language-consistency reward? Why did R1 add one?
- Why do verifiable rewards resist most reward-hacking patterns?
- What can be hacked even with verifiable rewards? (verifier exploits)
- Why is GRPO/RLOO often preferred over PPO in RLVR?
- What's the role of in RLVR?
- What happens if the success rate on the training set is <1%?
- Walk through curriculum design for RLVR.
Section C — PRMs vs ORMs (Q23–33)
- Define PRM and ORM.
- Why are PRMs theoretically attractive for long CoT?
- Cite Lightman et al. 2023 — what dataset did OpenAI release?
- How does Math-Shepherd auto-label step correctness?
- How does OmegaPRM extend Math-Shepherd?
- What did DeepSeek-R1 conclude about PRM vs ORM?
- Two ways to use a PRM in RL training — what are they?
- Why might PRM data be noisier than ORM data?
- How can a policy hack a PRM more easily than an ORM?
- What's a generative reward model? How does it differ from a scalar RM?
- Mahan et al. 2024 — why did genRMs match scalar RMs on hard tasks?
Section D — Search + RL (Q34–43)
- Walk through STaR (Zelikman 2022).
- Why does rationalization work in STaR?
- What's Quiet-STaR doing differently?
- Walk through V-STaR.
- Walk through ReST^EM.
- Compare expert iteration (Anthony 2017) with ReST^EM.
- Why is MCTS hard to combine with discrete-token LM state spaces?
- Sketch how AlphaProof uses Lean for the value function.
- What does AlphaGeometry use for the verifier?
- Why is the rejection-sampling SFT pattern essentially "free signal"?
Section E — R1-Zero (Q44–52)
- What was the starting point of R1-Zero?
- What rewards did it use?
- What's the "aha moment"?
- Why is the aha moment surprising? (Note: the model wasn't trained on self-correction text.)
- What's R1-Zero's headline AIME score curve?
- What does R1-Zero prove about latent capability vs RL?
- Three failure modes of R1-Zero.
- Why didn't this work in 2022?
- Could you replace verifiable reward with preference data? Why or why not?
Section F — R1 full pipeline (Q53–63)
- List R1's four stages.
- What's "cold-start SFT" and why is it needed?
- What rewards are added in stage 2 vs R1-Zero?
- How big is the rejection-sampling SFT dataset in stage 3?
- What's the math/non-math split in stage 3?
- Why re-SFT V3-base in stage 3 instead of stage-2's weights?
- What does the final RLHF stage target?
- Why four stages instead of one?
- What's the data ratio between reasoning and chat in stage 3?
- How does R1-Distill work?
- R1-Distill-Qwen-32B beats GPT-4o on what benchmarks? Why does that matter?
Section G — Tülu 3 / Llama 3 / Qwen (Q64–73)
- What's Tülu 3's three-stage recipe?
- What's RLVR's contribution beyond standard SFT+DPO in Tülu 3?
- Why does Tülu 3 use length-controlled DPO?
- What's Llama 3's iterative SFT+DPO loop?
- Why did Meta choose DPO over PPO at 405B?
- What's "rejection-sampled SFT data" in Llama 3?
- Why is no reasoning-RL stage in Llama 3.1?
- What does QwQ-32B's recipe look like?
- Why might PRMs help in Qwen but not in R1?
- Compare R1's recipe with Tülu 3's stage by stage.
Section H — Reward modeling (Q74–86)
- Sketch the BT loss for RM training.
- Why does scalar RM score not have absolute meaning?
- What's reward overoptimization (Gao et al. 2023)?
- Sketch the overoptimization curve. What's on each axis?
- What does it mean that "RM goes OOD as policy drifts"?
- Why does ensembling RMs help?
- Why does iterative RM refresh help?
- Why does KL penalty bound the overoptimization?
- What's RewardBench? What does it measure?
- What's RLAIF? Cite the canonical paper.
- Walk through Constitutional AI (Bai et al. 2022).
- What's a self-rewarding LM (Yuan et al.)?
- Why does self-rewarding plateau without external signal?
Section I — Reward hacking (Q87–96)
- Define reward hacking and Goodhart's law in this context.
- List five named reward-hack patterns and one mitigation each.
- Length bias — diagnose and mitigate.
- Sycophancy — diagnose and mitigate.
- Format bias — diagnose and mitigate.
- Refusal-rate bias — diagnose and mitigate.
- Verifier hack — what is it and how do you defend?
- Prompt-injection of a genRM — what is it and how do you defend?
- How do you detect overoptimization in a production training run?
- What's the role of "held-out judge from a different family" in monitoring?
Section J — Inference-time strategies (Q97–104)
- What does self-consistency (Wang et al. 2022) do?
- Best-of-N + RM — when is this strictly better than self-consistency?
- What's MBR decoding and when is it better than best-of-N?
- What's verifier-guided beam search?
- What's compute-optimal inference allocation across difficulties?
- What temperature does R1 default to and why?
- Why does greedy decoding sometimes underperform on reasoning?
- What's a "fast/slow" routing layer in a reasoning-model deployment?
Section K — Failure modes and safety (Q105–112)
- What's overthinking? How do you mitigate?
- Why are reasoning models worse-calibrated on factual QA than on math?
- What's hallucinated reasoning? Why is it dangerous?
- What's deliberative alignment (Guan et al. 2024, OpenAI)?
- How does deliberative alignment differ from refusal training?
- Why must safety operate over the CoT, not just the answer?
- Why does Constitutional AI matter for reasoning models specifically?
- How would you red-team a reasoning model?
Section L — Open frontier questions (Q113–120)
- Can RL elicit capabilities the base model doesn't have?
- Is the inference-compute scaling law universal? When does it break?
- Should production frontier models use PRMs?
- Will multi-agent debate scale as a reward source?
- Will self-play (SPIN, Self-Rewarding) eventually plateau or keep climbing?
- What's the moat in frontier labs — weights, data, or RL infrastructure?
- How would you design RL on long-horizon agent trajectories?
- What's the role of formal verifiers (Lean, Coq) in future reasoning RL?
Section M — Senior scenario questions (Q121–130)
- Scenario. Design a 6-stage post-training pipeline for a 70B reasoning model from scratch.
- Scenario. You're seeing length blow up over RL training. What's wrong and what do you ship?
- Scenario. Your RM RewardBench score is 92% but your policy is regressing on chat-hard. Why?
- Scenario. A red-teamer demonstrates a verifier hack. How do you fix it?
- Scenario. Your reasoning model overthinks easy questions. Walk through routing + budget design.
- Scenario. You only have 5k high-quality long-CoT examples. Can you train a reasoning model? How?
- Scenario. Sketch out how you'd use an LLM-as-judge as an RL reward signal — including the fail-safes.
- Scenario. Production telemetry shows refusal rate climbing 20% over the last week with no model update. Diagnose.
- Scenario. Compare PPO + scalar RM vs DPO + iterative refresh vs GRPO + verifier — pick one for a math-only task and justify.
- Scenario. You want to distill a 70B reasoning model into a 7B. Walk through the recipe and the key knobs.
Quick fire (Q131–150)
- One line: RLVR.
- One line: GRPO vs PPO.
- One line: PRM vs ORM.
- One line: STaR vs ReST^EM.
- One line: R1-Zero vs R1.
- One line: Cold-start SFT.
- One line: Rejection-sampling SFT.
- One line: R1-Distill.
- One line: Math-Shepherd.
- One line: OmegaPRM.
- One line: Constitutional AI.
- One line: RLAIF.
- One line: Reward overoptimization.
- One line: RewardBench.
- One line: Length-controlled DPO.
- One line: Self-Rewarding LMs.
- One line: Deliberative alignment.
- One line: Test-time compute scaling.
- One line: Generative reward model.
- One line: AlphaProof.
Self-grading
- 130+ correct: ready for OpenAI / DeepMind / Anthropic research-scientist rounds.
- 100–129: re-read REASONING_MODELS §1–7, FRONTIER_REWARD_MODELING §3, §8.
- 70–99: re-read all three deep dives once more, then redo.
- <70: spend 4 days on the deep dives + read the actual R1 paper, then come back.
7-day drill plan
- Day 1: REASONING_MODELS §1–4 (paradigm, test-time, RLVR, PRM/ORM). Drill A, B, C.
- Day 2: REASONING_MODELS §5–7 (search+RL, R1-Zero, R1). Drill D, E, F.
- Day 3: REASONING_MODELS §8–14 (o1 inferences, distillation, inference, failure modes, open Qs). Drill G, J, K.
- Day 4: FRONTIER_REWARD_MODELING all sections. Drill H, I.
- Day 5: OPEN_SOURCE_POSTTRAIN_PLAYBOOKS all sections. Memorize 60-90s answers for R1, Tülu 3, Llama 3. Drill F, G again.
- Day 6: Read DeepSeek-R1 paper (arXiv 2501.12948) cover-to-cover.
- Day 7: Drill M (scenarios) + Quick fire. Whiteboard the 6-stage recipe end-to-end.