RL Alignment Interview Q&A: Detailed Answers

Q1: Explain the RLHF (Reinforcement Learning from Human Feedback) pipeline in detail.

Answer:

RLHF is a three-stage process used to align language models with human preferences. Here's the detailed pipeline:

Stage 1: Supervised Fine-Tuning (SFT)

Purpose: Create a baseline model that can follow instructions
Data: Human-written demonstrations (prompt-response pairs)
Training: Standard supervised learning (cross-entropy loss)
Result: Model that can generate reasonable responses but may not align with human preferences

Stage 2: Reward Model Training

Purpose: Learn a function that scores how good a response is
Data: Human preference comparisons (chosen response vs rejected response)
Training: Binary classification - learn to rank chosen > rejected
Loss: Binary cross-entropy on preference pairs
Result: Reward model r(x, y) that scores response quality

Mathematical Formulation:

P(y_w > y_l | x) = σ(r(x, y_w) - r(x, y_l))

Where:
- y_w: Winning (chosen) response
- y_l: Losing (rejected) response
- σ: Sigmoid function

Stage 3: RL Optimization (PPO)

Purpose: Optimize policy to maximize reward while staying close to reference
Algorithm: PPO (Proximal Policy Optimization)
Objective: Maximize E[r(x, y)] - β * KL(π_θ || π_ref)
Result: Aligned model that generates preferred responses

Why this works:

SFT gives model capability
Reward model captures human preferences
RL optimization aligns model with preferences

Challenges:

Need large amounts of human feedback
Reward model may have biases
RL optimization can be unstable
Cost: Expensive to collect human preferences

Q2: How does DPO differ from RLHF? When would you use each?

Answer:

DPO (Direct Preference Optimization):

Key Difference:

RLHF: Needs separate reward model, uses RL (PPO) to optimize
DPO: No reward model, directly optimizes policy on preferences

DPO Mathematical Formulation:

L_DPO = -log σ(β * (log π_θ(y_w|x) - log π_θ(y_l|x) - log π_ref(y_w|x) + log π_ref(y_l|x)))

Where:
- y_w: Chosen response
- y_l: Rejected response
- π_θ: Current policy
- π_ref: Reference policy (frozen)
- β: Temperature parameter

How DPO Works:

Uses reference model instead of reward model
Directly optimizes policy to prefer chosen over rejected
KL penalty prevents deviation from reference
No RL needed - just supervised learning on preferences

Comparison:

Aspect	RLHF	DPO
Reward Model	Required	Not needed
Optimization	RL (PPO)	Supervised learning
Complexity	High (3 stages)	Lower (2 stages)
Flexibility	Can use any reward	Limited to preferences
Stability	Can be unstable	More stable
Data Needs	Preference + demonstrations	Just preferences

When to Use RLHF:

Need flexible reward shaping
Have complex reward structure
Want to iterate on reward model
Have resources for complex pipeline

When to Use DPO:

Want simpler pipeline
Have preference data but no demonstrations
Need faster training
Want more stable optimization

Trade-off:

DPO is simpler but less flexible
RLHF is more complex but more powerful

Q3: Explain PPO (Proximal Policy Optimization) in detail. Why is it used in RLHF?

Answer:

What is PPO? PPO is a policy gradient algorithm that prevents large policy updates by clipping the objective function.

The Four Models in PPO/RLHF:

1. Policy Model (π_θ):

Generates responses/actions
Outputs probability distribution: π_θ(a|s)
Being optimized during training
Used for: generation, policy gradient computation

2. Critic Model (V_φ):

Estimates state value: V_φ(s) = E[R | s]
Predicts expected future return
Used for: advantage computation (A = Q - V), baseline for variance reduction
Trained with: value loss L^VF = (V_φ(s) - R)^2

3. Reference Model (π_ref):

Frozen copy of policy before RL training
Typically the SFT (Supervised Fine-Tuned) model
Used for: KL penalty computation, importance sampling ratio
Mathematical role: KL(π_θ || π_ref) = E[log(π_θ/π_ref)]

4. Reward Model (r_ψ):

Scores responses: r_ψ(x, y)
Trained on human preferences before RL
Used for: computing rewards during RL training
Typically frozen during RL (can be updated)

Mathematical Formulation:

Standard Policy Gradient:

L_PG = E[r(θ) * A]

Where:
- r(θ) = π_θ(a|s) / π_θ_old(a|s) (importance sampling ratio)
- A: Advantage estimate

Problem with Standard PG:

Large updates can destabilize training
Policy can change too quickly
Can lead to poor performance

PPO Solution - Clipped Objective:

L^CLIP(θ) = E[min(r(θ)A, clip(r(θ), 1-ε, 1+ε)A)]

Where:
- ε: Clipping parameter (typically 0.1-0.3)
- clip(r(θ), 1-ε, 1+ε): Clips ratio to [1-ε, 1+ε]
- min: Takes pessimistic estimate

Why Clipping Works:

Prevents large updates: Ratio is clipped, so updates are bounded
Pessimistic: Taking minimum prevents over-optimization
Stable: Policy changes gradually
Sample efficient: Can use same data multiple times

PPO Algorithm:

1. Collect trajectories with current policy
2. Compute advantages A(s,a)
3. For K epochs:
   a. Compute r(θ) = π_θ(a|s) / π_θ_old(a|s)
   b. Compute clipped objective
   c. Update policy
4. Update old policy: π_θ_old = π_θ

Why PPO in RLHF:

Stability: Language models are sensitive - need stable updates
Sample efficiency: Human feedback is expensive - reuse data
KL constraint: Keeps policy close to reference (prevents mode collapse)
Proven: Works well in practice (ChatGPT, Claude)

PPO Loss Components:

L_PPO = L^CLIP + c_v * L^VF + β * KL(π_θ || π_ref)

Where:
- L^CLIP: Clipped policy loss (uses Policy Model π_θ)
- L^VF: Value function loss (uses Critic Model V_φ)
- KL: KL penalty (uses Reference Model π_ref)
- Rewards: From Reward Model r_ψ
- c_v, β: Coefficients

How All Four Models Work Together:

Training Loop:

Generate: Policy Model π_θ generates responses
Score: Reward Model r_ψ scores responses → rewards
Evaluate: Critic Model V_φ estimates values → V(s)
Compare: Reference Model π_ref provides logprobs → KL penalty
Compute: Advantages A = returns - V(s)
Update: Policy π_θ and Critic V_φ (Reference π_ref and Reward r_ψ frozen)

Mathematical Flow:

responses = π_θ.generate(prompts)
rewards = r_ψ(prompts, responses)
values = V_φ(prompts)
policy_logprobs = log π_θ(responses | prompts)
ref_logprobs = log π_ref(responses | prompts)

advantages = returns - values
ratio = exp(policy_logprobs - ref_logprobs)

L = min(ratio*A, clip(ratio)*A) + c_v*(V-R)² + β*KL(π_θ||π_ref)

See ppo_models_detailed.md for complete mathematical details!

Q4: What is GRPO (Group Relative Policy Optimization)? When is it useful?

Answer:

What is GRPO? GRPO extends PPO to handle multiple groups with different preferences. Instead of optimizing absolute reward, it optimizes relative to group baseline.

Mathematical Formulation:

L_GRPO = -E[r(θ) * (R_group - R_baseline)] + β * KL(π_θ || π_ref)

Where:
- R_group: Reward for specific group
- R_baseline: Average reward across all groups
- r(θ): Importance sampling ratio
- β: KL penalty coefficient

Why GRPO?

Multiple preferences: Different user groups have different preferences
Relative optimization: Optimize to be better than baseline, not absolute
Fairness: Ensures all groups improve relative to average
Prevents over-optimization: KL penalty keeps policy reasonable

Use Cases:

Demographic groups: Different age groups, regions, cultures
Use case groups: Different applications (coding, writing, analysis)
Skill level groups: Beginners vs experts
Domain groups: Different topics (science, literature, etc.)

Example:

Group A (young users): Prefer concise, casual responses
Group B (professionals): Prefer detailed, formal responses
Group C (students): Prefer educational, step-by-step responses

GRPO optimizes policy to be better than baseline for each group.

How it differs from PPO:

PPO: Optimizes absolute reward
GRPO: Optimizes relative reward (group - baseline)
GRPO: Handles multiple groups simultaneously
GRPO: Ensures fairness across groups

Implementation:

# Compute group rewards
group_rewards = [reward_model(group_responses) for group in groups]
baseline = mean(group_rewards)

# Relative advantages
relative_advantages = group_rewards - baseline

# Optimize with relative advantages
loss = -ratio * relative_advantages + β * KL_penalty

Q5: What are the main challenges in RL alignment? How do you address them?

Answer:

Challenge 1: Reward Hacking

Problem: Model finds ways to maximize reward that don't align with intent
Example: Model generates "I can't answer" to avoid negative reward
Solution:
- Careful reward design
- Multiple reward signals
- Human evaluation
- Regularization (KL penalty)

Challenge 2: Distribution Shift

Problem: Policy changes, but reward model trained on old distribution
Solution:
- Retrain reward model periodically
- Use on-policy data
- Regularization to prevent large shifts

Challenge 3: Mode Collapse

Problem: Policy collapses to single response pattern
Solution:
- KL penalty (keeps policy diverse)
- Entropy bonus
- Diverse training data

Challenge 4: Instability

Problem: Training can be unstable, performance can degrade
Solution:
- PPO clipping (prevents large updates)
- Gradient clipping
- Learning rate scheduling
- Checkpointing and rollback

Challenge 5: Human Feedback Quality

Problem: Inconsistent or biased human feedback
Solution:
- Multiple annotators
- Quality control
- Bias detection
- Diverse annotator pool

Challenge 6: Scalability

Problem: Need large amounts of human feedback
Solution:
- Active learning (prioritize important examples)
- Synthetic data generation
- Transfer learning
- Few-shot learning

Challenge 7: Evaluation

Problem: Hard to measure alignment
Solution:
- Multiple metrics (helpfulness, harmlessness, honesty)
- Human evaluation
- Red teaming
- Real-world testing

Q6: How do you prevent reward hacking in RLHF?

Answer:

What is Reward Hacking? Model finds unintended ways to maximize reward that don't align with human intent.

Examples:

Always says "I can't answer" to avoid negative reward
Generates very long responses (more tokens = higher reward)
Repeats high-reward phrases
Exploits reward model biases

Prevention Strategies:

1. Careful Reward Design

Multiple reward signals (not just one)
Penalize obvious hacks (length, repetition)
Reward diversity
Use human evaluation as ground truth

2. Regularization

KL Penalty: Prevents policy from deviating too much
```
L = E[r(θ)A] - β * KL(π_θ || π_ref)
```
Keeps policy reasonable
Prevents extreme behaviors

3. Reward Model Robustness

Train on diverse data
Detect and remove biases
Regular updates
Multiple reward models (ensemble)

4. Monitoring

Track reward distribution
Detect anomalies (sudden spikes)
Monitor response patterns
Human spot checks

5. Constrained Optimization

Hard constraints (max length, no repetition)
Soft constraints (penalties)
Multi-objective optimization

6. Iterative Refinement

Start with simple reward
Identify hacks
Refine reward
Repeat

Example Implementation:

def robust_reward(response, base_reward):
    # Base reward from reward model
    reward = base_reward
    
    # Penalize hacks
    if is_too_long(response):
        reward -= 0.1
    if has_repetition(response):
        reward -= 0.1
    if is_evasive(response):
        reward -= 0.2
    
    # Encourage diversity
    if is_diverse(response):
        reward += 0.05
    
    return reward

Q7: Explain the KL penalty in RLHF. Why is it important?

Answer:

What is KL Penalty? KL (Kullback-Leibler) divergence measures how different two probability distributions are. In RLHF, we penalize the policy for deviating from a reference policy.

Mathematical Formulation:

KL(π_θ || π_ref) = E[log(π_θ(a|s) / π_ref(a|s))]

In practice:
KL_penalty = β * (log π_θ - log π_ref)

Why KL Penalty?

1. Prevents Mode Collapse

Without KL: Policy might collapse to single response
With KL: Keeps policy diverse (similar to reference)

2. Prevents Reward Hacking

Without KL: Model finds hacks to maximize reward
With KL: Constrains model to reasonable behaviors

3. Maintains Capabilities

Reference model has good capabilities (from SFT)
KL penalty preserves these capabilities
Prevents catastrophic forgetting

4. Stability

Prevents large policy changes
More stable training
Gradual optimization

5. Trust Region

KL penalty creates trust region
Policy can't deviate too far
Similar to PPO clipping

How to Choose β (KL Coefficient):

Too small (β < 0.01): Policy can deviate too much, risk of hacks
Too large (β > 1.0): Policy can't learn, stays too close to reference
Typical (β = 0.1-0.5): Balance between learning and stability

In Practice:

# RLHF loss with KL penalty
ratio = exp(policy_logprob - reference_logprob)
policy_loss = -ratio * reward
kl_penalty = beta * (policy_logprob - reference_logprob)
total_loss = policy_loss + kl_penalty

Monitoring KL:

Track KL during training
If KL too high: Increase β
If KL too low: Decrease β
Target: KL ≈ 0.1-0.5 nats per token

Q8: How would you implement a complete RLHF pipeline?

Answer:

Complete Implementation Steps:

Step 1: Supervised Fine-Tuning

# Train on human demonstrations
def train_sft(model, demonstrations):
    for prompt, response in demonstrations:
        outputs = model(prompt)
        loss = cross_entropy(outputs, response)
        loss.backward()
        optimizer.step()

Step 2: Train Reward Model

# Train on preference pairs
def train_reward_model(reward_model, preferences):
    for prompt, chosen, rejected in preferences:
        chosen_score = reward_model(prompt, chosen)
        rejected_score = reward_model(prompt, rejected)
        
        # Binary classification: chosen > rejected
        loss = -log_sigmoid(chosen_score - rejected_score)
        loss.backward()
        optimizer.step()

Step 3: RL Optimization (PPO)

def rlhf_training(policy, reference, reward_model, preferences):
    optimizer = Adam(policy.parameters())
    
    for epoch in range(num_epochs):
        # Generate responses
        responses = policy.generate(prompts)
        
        # Score with reward model
        rewards = reward_model(prompts, responses)
        
        # Get logprobs
        policy_logprobs = policy.get_logprobs(prompts, responses)
        ref_logprobs = reference.get_logprobs(prompts, responses)
        
        # Compute advantages
        advantages = compute_advantages(rewards)
        
        # PPO loss with KL penalty
        ratio = exp(policy_logprobs - ref_logprobs)
        policy_loss = -min(ratio * advantages, 
                          clip(ratio, 1-ε, 1+ε) * advantages)
        kl_penalty = beta * (policy_logprobs - ref_logprobs)
        
        loss = policy_loss + kl_penalty
        loss.backward()
        optimizer.step()

Key Components:

Data: Demonstrations + preferences
Models: Policy, reference, reward model
Training: SFT → Reward → RL
Monitoring: Reward, KL, human evaluation

Summary

These questions cover:

RLHF pipeline (detailed)
DPO vs RLHF
PPO (mathematical details)
GRPO (group-based optimization)
Challenges and solutions
Reward hacking prevention
KL penalty importance
Complete implementation

All with detailed explanations, mathematical formulations, and code examples!

Additional Resources for Interview Preparation

For detailed paragraph-style explanations suitable for interviews, see:

ppo_process_explanation.md: Complete process explanations of:
- PPO training process (full paragraph style)
- GRPO training process (full paragraph style)
- DPO training process (full paragraph style)
- When to use each approach
- Complete mathematical flow in narrative form
rlhf_pipeline_explanation.md: Complete three-stage RLHF pipeline:
- Stage 1: Supervised Fine-Tuning (detailed process)
- Stage 2: Reward Model Training (detailed process)
- Stage 3: RL Optimization with PPO (detailed process)
- Challenges and solutions
- Evaluation and iteration

These documents provide comprehensive, flowing explanations that you can use directly in interviews to explain the complete processes from start to finish.

ML & LLM Interview Prep — Deep Dives