Open-Source Post-Training Playbooks — Frontier Recipes

Memorizable recipes from the actual public reports of DeepSeek R1, AllenAI Tülu 3, Meta Llama 3, Alibaba Qwen 2.5, and HuggingFace Open-R1. For frontier-lab interviews, you should be able to walk an interviewer through any of these end-to-end.

This chapter is recipe-centric: each section is a stage-by-stage walkthrough of one published frontier recipe. If asked "design a post-training pipeline for a 70B reasoning model," you sketch one of these and adapt.


Table of contents

  1. DeepSeek-R1 (Jan 2025) — full reasoning recipe
  2. Tülu 3 (AllenAI, Nov 2024) — open SOTA general-purpose post-training
  3. Llama 3 (Meta, Jul 2024) — large-scale post-training
  4. Qwen 2.5 / Qwen3 — Alibaba's recipe
  5. Open-R1 (HuggingFace community, Jan 2025+) — reproduction notes
  6. The frontier "interview cookbook" — synthesized recipe

1. DeepSeek-R1

The single most-discussed post-training paper of 2025. Memorize this.

1.1 Starting point

  • DeepSeek-V3 base — 671B MoE, 37B activated; trained on 14.8T tokens; strong math/code priors.
  • No SFT, no instruction tuning at start.

1.2 Two parallel tracks

The R1 paper actually presents two models:

Track A: R1-Zero

Pure RL from base. Goal: see how far reasoning can go with no SFT.

  • Algorithm: GRPO (group-relative; no value head).
  • Reward: correctness (verifier on math/code/logic) + format reward only.
  • Prompt template: <think>...</think><answer>...</answer>.
  • KL: standard sequence-level penalty to base.
  • Group size: 16 rollouts per prompt for advantage normalization.

Result: AIME 2024 pass@1 climbs from 15.6% to 71.0% across training; CoT length grows from <100 to >2000 tokens. The "aha moment" — model spontaneously self-corrects ("Let me re-check") without ever seeing such phrases in training data.

Issues with R1-Zero:

  • Mid-CoT language mixing (English ↔ Chinese).
  • Illegible / non-standard formatting.
  • Doesn't generalize beyond reasoning tasks; bad as a chatbot.

Track B: R1 (production model)

Multi-stage pipeline:

Stage 1: Cold-start SFT
  • Curate a small (~few thousand) high-quality long-CoT dataset.
    • Sources: human-written, R1-Zero rejection-sampled, manual cleanup.
  • SFT V3-base on this data.
  • Output: legible long-CoT model, weak in pure capability but with good format.
Stage 2: Reasoning-oriented RL
  • Same RLVR setup as R1-Zero: GRPO, correctness + format rewards.
  • Plus language-consistency reward: penalize CoT that switches between English/Chinese.
  • Train until convergence on math/code/logic benchmarks.
  • Output: strong reasoning capability + legible CoTs.
Stage 3: Rejection-sampling SFT (data construction)
  • Generate 600k+ samples on math/code/reasoning from Stage-2 model.
    • Filter by verifier (correct only) and judge (helpful, well-formatted).
  • Generate 200k+ samples on writing, factual QA, role-play.
    • Filter by judge (helpfulness, no-harm).
  • Total: ~800k SFT examples.
Stage 4: SFT + Final RLHF
  • SFT V3-base again on the full 800k mixed dataset (overwrite Stage-2 weights).
  • Run RL for helpfulness, safety, persona — using preference data rather than verifiable rewards.

Output: R1 — reasoning capability of Stage-2 + general usefulness + safety.

1.3 R1-Distill family

  • Take ~800k generations from R1.
  • SFT on Llama / Qwen base models of various sizes (1.5B, 7B, 8B, 14B, 32B, 70B).
  • No RL.
  • Headline: R1-Distill-Qwen-32B beats GPT-4o-2024-05 on AIME and MATH despite ~5x smaller.

1.4 Why this paper matters

  • First public end-to-end frontier recipe with full numerical results.
  • First demonstrates outcome reward suffices for reasoning RL — no PRM needed.
  • First demonstrates distillation transfers reasoning at scale.
  • Key data + code + weights released, kicking off the Open-R1 reproductions.

1.5 Interview-ready 90-second answer

R1 is a four-stage post-training pipeline starting from DeepSeek-V3 base. Stage one is cold-start SFT on a small high-quality long-CoT dataset to give the model legible reasoning format. Stage two is reasoning-RL with GRPO and verifiable rewards on math, code, and logic — same recipe as R1-Zero but starting from the SFT'd model — with an added language-consistency reward to fix R1-Zero's English/Chinese mixing. Stage three regenerates ~800k SFT examples from the stage-two model, mixing 600k reasoning and 200k chat/writing/QA, and re-SFTs from V3-base. Stage four is final RLHF for helpfulness and safety. Distillation: SFT smaller bases on R1's generations, no RL needed, and you get strong reasoning models very cheaply.


2. Tülu 3

AllenAI's fully open recipe. Released November 2024. The most reproducible frontier recipe before R1 — and the cleanest one to reference for "open-source SOTA."

2.1 Starting point

  • Llama 3.1 base (8B, 70B). Pretrained but no instruction tuning.

2.2 Stages

Stage 1: SFT
  • Tülu 3 SFT mix: open and synthetic data with explicit per-skill curation (chat, math, code, reasoning, multilingual, persona).
  • ~939k examples after deduplication / decontamination.
  • Standard cross-entropy SFT.
Stage 2: DPO
  • Tülu 3 DPO Mix: ~270k preference pairs.
    • Both human-labeled (UltraFeedback-style sources) and synthetic.
  • Length-controlled DPO loss to combat length bias.
  • KL anchored to stage-1 SFT model.
Stage 3: RLVR (the new contribution)
  • Verifiable reward training on:
    • Math. GSM8K-style; final-answer extraction; sympy verification.
    • Multi-turn instruction following. IFEval-style verifiable constraints (length, JSON, etc.).
    • Code-graded rewards. When applicable.
  • Algorithm: PPO with value head (Tülu 3 uses PPO; later open-RL libraries like trl-lite and verl support GRPO).
  • KL anchored to DPO model.

2.3 Headline

  • 70B Tülu 3 matches GPT-3.5 / Claude 2 on standard benchmarks.
  • 8B Tülu 3 beats most open-source 8B models.
  • The full data, code, evals, and intermediate checkpoints are public — best teaching artifact for open-RL.

2.4 Why this matters for interviews

  • Fully reproducible. "How would you build an instruction model from scratch?" → walk through Tülu 3.
  • Demonstrates RLVR helps even without a reasoning-RL phase — verifiable rewards work for instruction-following too.
  • Shows DPO + RLVR can coexist (DPO for general preferences, RLVR for verifiable subsets).

2.5 Interview answer (60-second)

Tülu 3 is AllenAI's open-source post-training recipe. Three stages from Llama 3.1 base: SFT on a curated mix of ~939k examples, DPO on ~270k preference pairs with length-controlled loss, and RLVR with PPO on math final-answer verification and IFEval-style verifiable instruction-following constraints. The key contribution beyond standard SFT+DPO is RLVR — using verifiable rewards on a subset of skills where you can write a deterministic checker, even outside math/code. Headline is that 70B Tülu 3 matches GPT-3.5 with a fully reproducible pipeline.


3. Llama 3

Meta's recipe published in The Llama 3 Herd of Models (Jul 2024). The most thorough public industrial-scale post-training description as of mid-2024. No reasoning-RL stage (Llama 3.1 family, as of writing); strong on multi-round SFT + DPO.

3.1 Starting point

  • Llama 3.1 base (8B, 70B, 405B). Pretrained on 15.6T tokens.

3.2 Stages

Stage 1: SFT data construction
  • Rejection sampling. Sample many outputs from the previous (or initial) model; filter by RM; keep best.
  • Per-capability mixes. Separate data pipelines for code, math, reasoning, multilingual, tool-use, dialogue.
  • Iterate over 6+ rounds, each refining data quality.
Stage 2: SFT
  • Standard cross-entropy on the rejection-sampled data.
  • Run for multiple epochs.
Stage 3: DPO
  • DPO on preference pairs constructed from the rejection-sampled data and from human labelers.
  • Per-capability fine-tuning loops for safety, code, reasoning, etc.
Stage 4: Tool-use RL
  • Specifically for code interpreter, calculator, search, scoring tools.
  • The model learns to invoke tools, parse results, integrate.
Stage 5: Adversarial safety
  • Red-team-generated adversarial prompts.
  • DPO on safe-vs-unsafe pairs.

3.3 Why no PPO?

Meta explicitly chose DPO over PPO for cost / stability reasons at scale. With careful preference-pair construction (rejection-sampling with RM), DPO is competitive with PPO and far cheaper.

3.4 Iterative DPO

The recipe is iterated: improved model → better rejection-sampled SFT data → next-round SFT → next-round DPO. Multiple rounds. This is iterative DPO / online DPO, not PPO.

3.5 Interview answer (60-second)

Llama 3 is Meta's iterative SFT + DPO recipe. They construct SFT data via rejection sampling — sample many outputs, score with reward model, keep best — and iterate this for 6+ rounds, each round improving the model's outputs which improve the next round's data. Per-capability SFT pipelines for code, math, reasoning, multilingual, tool-use, and dialogue. After SFT, DPO with carefully constructed pairs, then a separate tool-use RL phase and an adversarial safety RL phase. They explicitly chose DPO over PPO for stability and cost at the 405B scale. No reasoning-RL stage in 3.1 — that's expected in Llama 4.


4. Qwen 2.5 / Qwen3

Alibaba's recipe. Most detailed in the QwQ-32B and Qwen 2.5 Math papers.

4.1 Qwen 2.5 base

  • Trained on 18T tokens.
  • Math/code-heavy mix.

4.2 Qwen 2.5 Math (specialized)

  • SFT on curated math reasoning data (200k high-quality CoTs).
  • DPO on math preference pairs.
  • RLHF with PPO on math + general tasks.
  • PRM evaluated; included in re-ranking but not in the RL reward (after experiments showed it didn't help training).
  • Tool-integrated reasoning: model can invoke a calculator mid-CoT.

4.3 QwQ-32B (Nov 2024) and Qwen3-thinking models

  • Cold-start SFT on long-CoT data (similar in spirit to R1).
  • RLVR for reasoning.
  • Final RLHF.

The Qwen team's published ablations report that PRMs helped modestly in some experiments but the canonical recipe used ORM only — agreeing with R1's experience.

4.4 Why interesting

  • Strong open-weights reasoning models (QwQ, Qwen3-thinking).
  • Tool-integrated reasoning (calculator + Python via execution).
  • Per-domain specialization recipes (Math, Code, Coder).

5. Open-R1 and community reproductions

Following R1's release, HuggingFace launched Open-R1 (Jan 2025+) — community reproduction of the R1 recipe.

5.1 What's open-source

  • R1's actual training code is public (GitHub).
  • R1's exact training data is not public.
  • Open-R1 is reproducing the data + recipe with open data sources.

5.2 Variants

  • OpenThinker (Stanford / Bespoke). Apply R1-distill methodology to open models with open data. Strong AIME numbers.
  • Bespoke-Stratos. Distillation reproduction at scale.
  • Sky-T1. Open reasoning model from Berkeley.
  • TÜlU3-RLVR variants. Add longer CoT to Tülu's RLVR phase.

5.3 Lessons from reproductions

  • R1's pipeline reproduces well at smaller scales.
  • Distillation works robustly: SFT on long-CoT generations transfers reasoning.
  • Pure-RL from a non-frontier base is hard — R1-Zero needed V3-base's quality. Open reproductions of pure-RL on Llama 3 base get smaller gains.

5.4 Implications

  • The data + RL infrastructure is the moat, not weights.
  • Open-source is ~3-6 months behind frontier, closing fast on reasoning.

6. The "interview cookbook" — synthesized recipe

If asked "design a post-training pipeline for a frontier reasoning model from scratch" — answer with this 6-stage recipe (synthesizing R1, Tülu 3, Llama 3 best practices):

Stage 0: Pretraining priors

  • Strong math / code / reasoning fraction in pretraining.
  • High-quality long-CoT documents in pretraining (textbooks, math papers, code with comments, problem-solution pairs).
  • This sets the ceiling of what RL can elicit.

Stage 1: Cold-start SFT (legibility + format)

  • Curate ~5-50k high-quality long-CoT examples.
    • Sources: human-written, generated by an existing reasoning model (rejection-sampled), manually cleaned.
  • SFT base on this.
  • Output: a model that knows the long-CoT format and can produce legible reasoning.

Stage 2: Reasoning-RL (capability)

  • Verifiable problem set: math (GSM8K, MATH, AIME-level), code (HumanEval, LiveCodeBench), logic (LogiQA-style).
  • Algorithm: GRPO (or RLOO / REINFORCE++).
  • Reward: correctness + format + language-consistency. No PRM (start simple; add later if helpful).
  • KL to Stage-1 SFT.
  • Train until reasoning saturates.
  • Output: strong reasoning, possibly weak chat.

Stage 3: Rejection-sampling SFT data

  • Generate ~500k-1M outputs from Stage-2 model.
  • Filter:
    • Verifiable subset (math, code): keep correct.
    • Non-verifiable subset (writing, factual QA, persona): filter by judge (genRM).
  • Mix: ~70% reasoning, ~30% chat.

Stage 4: SFT (broaden distribution)

  • Re-SFT base on the Stage-3 mix.
  • Output: reasoning + chat capable, but possibly weaker on safety / persona.

Stage 5: Final RLHF (helpfulness + safety)

  • Preference data: human + AI-labeled (RLAIF) + Constitutional revisions.
  • DPO or PPO with KL to Stage-4 SFT.
  • Targets: helpfulness, harmlessness, persona, refusal calibration.
  • Iterate 1-2 rounds.

Stage 6: Distillation (productization)

  • Sample ~1M generations from Stage-5 model.
  • SFT smaller bases.
  • Optionally: light DPO on top.

Stage 7: Production

  • Deploy reasoning model with test-time-compute knobs.
  • Routing layer: cheap LM detects when reasoning is needed.
  • Online telemetry, RM calibration tracking, safety monitoring.

What you can additionally mention to impress

  • PRMs / process supervision as an option, with the caveat that R1 reported ORM was sufficient.
  • MCTS at training data generation (AlphaProof-style) for math.
  • Generative reward models for non-verifiable subsets.
  • Deliberative alignment for reasoning-aware safety.
  • Iterative DPO as a cheaper alternative to PPO.
  • Test-time compute scaling laws (Snell et al.) — match pretraining compute with inference compute on hard tasks.
  • Multi-objective reward (Pareto / MORLHF) for helpful-honest-harmless tradeoffs.

How to use this chapter

  1. Memorize §1.5 (R1 90-second), §2.5 (Tülu 3 60-second), §3.5 (Llama 3 60-second). These are oral exam answers.
  2. For an open-ended "design from scratch" question, give the §6 synthesized recipe.
  3. Be ready for follow-ups on any single stage (data construction, KL choice, GRPO vs PPO, why ORM not PRM, etc.).
  4. Read the actual papers (R1, Tülu 3, Llama 3) once for ground truth.

Single sentence to remember: R1 is the canonical 4-stage reasoning recipe (cold-start SFT → reasoning-RL → rejection-sampling SFT → final RLHF); Tülu 3 is the canonical open SFT+DPO+RLVR recipe; Llama 3 is the canonical iterative-SFT+DPO recipe at 405B scale; combine the best of all three for any frontier interview "design a pipeline" question.