Topic 8: Training Techniques (RLHF, DPO, PPO, GRPO)

🔥 For interviews, read these first:

ALIGNMENT_DEEP_DIVE.md — frontier-lab interview deep dive: full RLHF math, Bradley-Terry preference model, complete DPO derivation (whiteboard-ready), the alphabet soup (IPO/KTO/ORPO/SimPO/GRPO), Constitutional AI, RLAIF, process vs outcome supervision, reward hacking, KL blowup, mode collapse, sycophancy, alignment tax, Goodhart curves.

INTERVIEW_GRILL.md — 60 active-recall questions with strong answers covering the full post-training stack. Drill until you can answer 40+ cold.

The README below is the conceptual overview. The two files above are where the interview-grade depth lives.

What You'll Learn

This topic teaches you advanced LLM training techniques:

RLHF (Reinforcement Learning from Human Feedback)
DPO (Direct Preference Optimization)
PPO (Proximal Policy Optimization)
GRPO (Group Relative Policy Optimization)
Theory and implementations

Why We Need This

Interview Importance

Hot topic: RLHF/DPO are cutting-edge
Understanding: Shows deep LLM knowledge
Implementation: May ask to implement

Real-World Application

ChatGPT training: Uses RLHF
Model alignment: Make models helpful, harmless
Preference learning: Learn from human preferences

Industry Use Cases

1. RLHF

Use Case: ChatGPT, Claude

Align models with human preferences
Make models helpful and safe
Improve response quality

2. DPO

Use Case: Modern LLM training

Simpler than RLHF
Direct optimization
No reward model needed

3. PPO

Use Case: Reinforcement learning

Stable policy updates
Used in RLHF pipeline
General RL algorithm

Core Intuition

These training techniques exist because next-token prediction alone does not fully capture desired assistant behavior.

A base language model can be:

fluent
knowledgeable
still not aligned with user preferences

Alignment methods try to push the model toward preferred behavior without letting it drift arbitrarily far from a useful reference policy.

RLHF

RLHF breaks the problem into stages:

supervised fine-tuning on demonstrations
reward modeling from preference data
policy optimization using the reward signal

The intuition is:

first learn how to answer at all
then learn what humans prefer
then optimize behavior against that preference signal

DPO

DPO skips the explicit reward-model-plus-RL loop and directly optimizes relative preference between chosen and rejected responses.

That makes it easier to train and reason about in many settings.

PPO

PPO is mainly about making policy updates stable.

Its clipping mechanism is trying to stop the new policy from moving too far in one step.

That is why PPO appears in RLHF even though RLHF is the bigger pipeline.

Technical Details Interviewers Often Want

Why a Reference Model Matters

Without a reference or KL-style constraint, the policy can drift too far:

exploit reward model quirks
become unstable
collapse into weird outputs

The reference model acts like an anchor.

Why Reward Models Are Risky

A reward model is only an approximation of human preference.

That means the policy can learn to:

game the reward
sound good without being correct
optimize style more than truth

This is one of the most important conceptual follow-ups in alignment interviews.

Why DPO Is Attractive

DPO is appealing because it:

avoids explicit on-policy RL optimization
is simpler to implement and train
often works well with preference pairs directly

But it is not "strictly better" in all cases. It just changes the optimization setup.

Common Failure Modes

reward hacking
over-optimizing preference style while hurting factuality
too-weak KL control causing drift
too-strong KL control preventing meaningful improvement
claiming DPO and RLHF are identical when they are not

Edge Cases and Follow-Up Questions

Why can preference optimization hurt factual accuracy?
Why do we need a reference model or KL penalty?
Why might DPO be simpler than PPO-based RLHF?
What happens if the reward model is misspecified?
Why is SFT still needed before preference optimization in many pipelines?

What to Practice Saying Out Loud

The three-stage RLHF pipeline
The conceptual difference between SFT, reward modeling, and PPO
Why DPO is simpler but not universally better
Why alignment metrics must be paired with truthfulness and robustness checks

Industry-Standard Boilerplate Code

RLHF (Simplified)

"""
RLHF: Reinforcement Learning from Human Feedback
Simplified implementation
"""
import numpy as np
import torch
import torch.nn as nn

class RewardModel(nn.Module):
    """
    Reward Model: Predicts how good a response is
    Trained on human preferences
    """
    def __init__(self, model_dim: int):
        super().__init__()
        self.linear = nn.Linear(model_dim, 1)
    
    def forward(self, hidden_states):
        return self.linear(hidden_states).squeeze(-1)

def rlhf_loss(policy_logprobs: torch.Tensor, 
              reference_logprobs: torch.Tensor,
              rewards: torch.Tensor,
              beta: float = 0.1) -> torch.Tensor:
    """
    RLHF Loss (PPO-style)
    
    Args:
        policy_logprobs: Log probabilities from current policy
        reference_logprobs: Log probabilities from reference model
        rewards: Reward from reward model
        beta: KL penalty coefficient
    """
    # Ratio: how much more/less likely is policy vs reference
    ratio = torch.exp(policy_logprobs - reference_logprobs)
    
    # Policy gradient term
    policy_loss = -ratio * rewards
    
    # KL penalty: prevent policy from deviating too much
    kl_penalty = beta * (policy_logprobs - reference_logprobs)
    
    # Total loss
    loss = policy_loss + kl_penalty
    
    return loss.mean()

DPO (Direct Preference Optimization)

"""
DPO: Direct Preference Optimization
Simpler alternative to RLHF
No reward model needed
"""
import torch
import torch.nn.functional as F

def dpo_loss(policy_logprobs_chosen: torch.Tensor,
             policy_logprobs_rejected: torch.Tensor,
             reference_logprobs_chosen: torch.Tensor,
             reference_logprobs_rejected: torch.Tensor,
             beta: float = 0.1) -> torch.Tensor:
    """
    DPO Loss
    
    Directly optimizes policy to prefer chosen over rejected responses
    No reward model needed - uses reference model instead
    
    Args:
        policy_logprobs_chosen: Log probs of chosen response (from policy)
        policy_logprobs_rejected: Log probs of rejected response (from policy)
        reference_logprobs_chosen: Log probs of chosen (from reference)
        reference_logprobs_rejected: Log probs of rejected (from reference)
        beta: Temperature parameter
    """
    # Log ratio for chosen
    log_ratio_chosen = policy_logprobs_chosen - reference_logprobs_chosen
    
    # Log ratio for rejected
    log_ratio_rejected = policy_logprobs_rejected - reference_logprobs_rejected
    
    # DPO objective: maximize (chosen - rejected)
    # With KL penalty to prevent deviation from reference
    loss = -F.logsigmoid(
        beta * (log_ratio_chosen - log_ratio_rejected)
    )
    
    return loss.mean()

PPO (Proximal Policy Optimization)

Detailed Implementation: See ppo.py for basic implementation and ppo_complete.py for complete version with all four models.

Complete Guide: See ppo_models_detailed.md for comprehensive explanation of all four models:

Policy Model (π_θ): Generates responses, being optimized
Critic Model (V_φ): Estimates values, computes advantages
Reference Model (π_ref): Frozen, provides KL penalty
Reward Model (r_ψ): Scores responses, provides learning signal

Key Concepts:

Clipping: Prevents large policy updates
Importance Sampling: Reuse old data
Advantages: How much better than average
Why used in RLHF: Stable, sample-efficient

Mathematical Details:

Policy gradient: ∇_θ L = E[r(θ) * A * ∇_θ log π_θ]
PPO clipping: L^CLIP = E[min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)]
Value loss: L^VF = E[(V_φ(s) - R)^2]
KL penalty: β * KL(π_θ || π_ref)

GRPO (Group Relative Policy Optimization)

Detailed Implementation: See grpo.py for complete implementation with:

Group-based optimization
Relative rewards
Multi-group handling

Key Concepts:

Relative optimization: Better than baseline, not absolute
Multiple groups: Different preferences per group
Fairness: All groups improve relative to average
Use case: When you have multiple user segments

Theory

RLHF Pipeline

Supervised Fine-tuning: Train on human demonstrations
Reward Model: Train on human preferences
RL Optimization: Use PPO to optimize policy with reward model

DPO vs RLHF

RLHF: Needs reward model, more complex
DPO: No reward model, simpler, direct optimization
Trade-off: DPO simpler but RLHF more flexible

Code Files

rlhf_dpo.py: RLHF and DPO loss implementations
ppo.py: Basic PPO implementation with detailed explanations
ppo_complete.py: Complete PPO with all four models (Policy, Critic, Reference, Reward)
ppo_models_detailed.md: Comprehensive guide explaining all four models, their roles, and mathematical details
ppo_process_explanation.md: NEW - Complete paragraph-style explanations of PPO, GRPO, and DPO processes for interviews
rlhf_pipeline_explanation.md: NEW - Complete paragraph-style explanation of the three-stage RLHF pipeline
grpo.py: GRPO implementation for group-based optimization
rl_alignment_qa.md: Detailed interview Q&A on RL alignment

Exercises

Implement DPO loss
Compare RLHF vs DPO
Implement PPO clipping
Test GRPO on multi-group data
Understand KL penalty role

Next Steps

Topic 9: Sampling techniques
Topic 10: Optimizers

ML & LLM Interview Prep — Deep Dives