PPO Models: Detailed Explanation of All Components

Overview

In PPO (Proximal Policy Optimization) used for RLHF, there are four key models/components that work together. This document explains each one in detail: what they are, their mathematical role, how they're used, and where they appear in the training pipeline.

Part 1: The Four Models in PPO/RLHF

Model 1: Policy Model ( $π_{θ}$ )

What it is:

The main model being trained
Generates responses/actions
Outputs probability distribution over actions
This is what we're optimizing

Mathematical role. $π_{θ} (a ∣ s)$ is the probability of action $a$ given state $s$ .

In language models. $π_{θ} (y ∣ x)$ is the probability of generating response $y$ given prompt $x$ .

Outputs:

Log probabilities: $lo g π_{θ} (a ∣ s)$
Action probabilities: $π_{θ} (a ∣ s)$
Can also include value estimate (if using actor-critic architecture)

Where it's used:

Generation: generate responses during training.
Loss computation: compute policy gradient.
Importance sampling: compute the ratio $r (θ) = π_{θ} / π_{θ_{old}}$ .

Mathematical formulation in PPO:

$L^{CLIP} (θ) = E [min (r (θ) A, clip (r (θ), 1 - ϵ, 1 + ϵ) A)]$

where

$r (θ) = \frac{π _{θ} ( a ∣ s )}{π _{θ_{old}} ( a ∣ s )} .$

Key point. This is the model we're training. It learns to maximize reward, constrained by a KL penalty to stay close to the reference.

Model 2: Critic Model (Value Function $V_{ϕ}$ )

What it is:

Estimates the value of a state
Predicts expected future return
Used to compute advantages
Can be a separate model or share parameters with the policy

Mathematical role.

$V_{ϕ} (s) = E [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s] .$

In language models. $V_{ϕ} (x)$ is the expected reward for prompt $x$ .

Outputs:

Scalar value estimate $V (s)$ .
Used to compute advantages $A (s, a) = Q (s, a) - V (s)$ .

Where it's used:

Advantage computation: $A = Q - V$ .
Value loss: $L^{VF} = (V_{ϕ} (s) - R)^{2}$ .
Baseline: reduces variance in the policy gradient.

Mathematical formulation.

$A (s, a) = Q (s, a) - V (s),$

with

$Q (s, a) = E [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s, a_{0} = a], V (s) = E [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s] .$

Value loss.

$L^{VF} = E [(V_{ϕ} (s) - R)^{2}],$

where $R$ is the actual return (discounted sum of rewards).

Key point. Estimates how good a state is, used to compute advantages (how much better than average), trained with MSE loss against actual returns.

Architecture options:

Separate critic: independent model $V_{ϕ} (s)$ .
Shared base: policy and critic share base layers, separate heads.
Actor-critic: single model with policy and value heads.

Model 3: Reference Model ( $π_{ref}$ )

What it is:

Frozen copy of the policy before RL training
Used to compute the KL penalty
Prevents the policy from deviating too much
Typically the SFT (supervised fine-tuned) model

Mathematical role. $π_{ref} (a ∣ s)$ is the (frozen) reference policy. For language models, $π_{ref} (y ∣ x)$ is the reference model's probability of response $y$ .

Outputs:

Log probabilities $lo g π_{ref} (a ∣ s)$ .
Used to compute KL divergence.

Where it's used:

KL penalty computation: $KL (π_{θ} ∥ π_{ref})$ .
Importance sampling ratio: $r (θ) = π_{θ} / π_{ref}$ .
Regularization: prevents policy collapse.

Mathematical formulation.

$KL penalty = β \cdot KL (π_{θ} ∥ π_{ref}),$

where

$KL (π_{θ} ∥ π_{ref}) = E_{π_{θ}} [lo g \frac{π _{θ} ( a ∣ s )}{π _{ref} ( a ∣ s )}] = E_{π_{θ}} [lo g π_{θ} (a ∣ s) - lo g π_{ref} (a ∣ s)] .$

In the PPO loss.

$L_{total} = L^{CLIP} + β \cdot KL (π_{θ} ∥ π_{ref}),$

with

$L^{CLIP} = E [min (r (θ) A, clip (r (θ), 1 - ϵ, 1 + ϵ) A)], r (θ) = \frac{π _{θ} ( a ∣ s )}{π _{ref} ( a ∣ s )} .$

Key point. Frozen (not trained); provides stability, prevents mode collapse, and ensures the policy doesn't forget SFT capabilities.

Why important:

Prevents mode collapse: keeps the policy diverse.
Prevents reward hacking: constrains the policy.
Maintains capabilities: preserves SFT knowledge.
Stability: prevents large policy changes.

Model 4: Reward Model ( $r_{ψ}$ )

What it is:

Predicts a reward for a response
Trained on human preferences
Scores how good a response is
Used to compute rewards during RL training

Mathematical role. $r_{ψ} (x, y)$ is the scalar reward for response $y$ to prompt $x$ . Higher means better response.

Where it's used:

Reward computation: score generated responses.
Return computation: $R = \sum_{t} γ^{t} r_{t}$ .
Advantage computation: $A = Q - V$ .

Mathematical formulation.

$r_{t} = r_{ψ} (x_{t}, y_{t}), R = t = 0 \sum T γ^{t} r_{t}, A (s, a) = Q (s, a) - V (s) = E [R ∣ s, a] - V (s) .$

Training (before RL). Bradley–Terry preference loss:

$L_{reward} = - lo g σ (r_{ψ} (x, y_{w}) - r_{ψ} (x, y_{l})),$

where $y_{w}$ is the chosen (winning) response, $y_{l}$ is the rejected (losing) response, and $σ$ is the sigmoid function.

Key point. Trained separately before RL; captures human preferences; used to score responses during RL training; can be frozen or updated during RL.

Why important:

Human preferences: encodes what humans want.
Reward signal: provides the learning signal for the policy.
Quality assessment: measures response quality.

Part 2: How They Work Together in PPO Training

Complete PPO Training Loop

Step 1 — Generate responses. Using policy model $π_{θ}$ :

$responses = π_{θ} . generate (prompts) .$

Step 2 — Score with reward model. Using reward model $r_{ψ}$ :

$rewards = r_{ψ} (prompts, responses) .$

Step 3 — Get log probabilities.

$policy-logprobs = lo g π_{θ} (responses ∣ prompts), ref-logprobs = lo g π_{ref} (responses ∣ prompts) .$

Step 4 — Compute returns.

$returns = compute-discounted-returns (rewards) .$

Step 5 — Compute values. Using critic model $V_{ϕ}$ :

$values = V_{ϕ} (prompts) .$

Step 6 — Compute advantages.

$advantages = returns - values, A = Q - V .$

Step 7 — Compute PPO loss.

$ratio unclipped clipped policy-loss value-loss kl-penalty total-loss = exp (policy-logprobs - ref-logprobs), = ratio \cdot advantages, = clip (ratio, 1 - ϵ, 1 + ϵ) \cdot advantages, = - min (unclipped, clipped), = (values - returns)^{2}, = β \cdot (policy-logprobs - ref-logprobs), = policy-loss + c_{v} \cdot value-loss + kl-penalty .$

Step 8 — Update models.

Update policy $π_{θ}$ : optimize $total-loss$ .
Update critic $V_{ϕ}$ : optimize $value-loss$ .
Reference $π_{ref}$ : frozen (no update).
Reward $r_{ψ}$ : typically frozen (can be updated).

Part 3: Mathematical Details for Each Model

Policy Model ( $π_{θ}$ ) — detailed mathematics

Forward pass. Input prompt $x$ , output response $y$ with probability $π_{θ} (y ∣ x)$ . For each token:

$logits = π_{θ} (x, y_{< t}), probs = softmax (logits), y_{t} \sim Categorical (probs) .$

Log probability.

$lo g π_{θ} (y ∣ x) = t = 1 \sum T lo g π_{θ} (y_{t} ∣ x, y_{< t}) .$

Policy gradient.

$\nabla_{θ} L = E [r (θ) \cdot A \cdot \nabla_{θ} lo g π_{θ} (a ∣ s)],$

where $r (θ) = π_{θ} (a ∣ s) / π_{θ_{old}} (a ∣ s)$ and $A$ is the advantage.

PPO clipping.

$L^{CLIP} = E [min (r (θ) A, clip (r (θ), 1 - ϵ, 1 + ϵ) A)] .$

This prevents large policy updates, over-optimization, and training instability.

Critic Model ( $V_{ϕ}$ ) — detailed mathematics

Value function.

$V_{ϕ} (s) = E_{π} [t = 0 \sum \infty γ^{t} r_{t} s_{0} = s],$

where $γ$ is the discount factor, $r_{t}$ the reward at time $t$ , and $π$ the current policy.

Bellman equation.

$V_{ϕ} (s) = E [r + γ V_{ϕ} (s^{'}) s] ⟹ V_{ϕ} (s) \approx r + γ V_{ϕ} (s^{'}) .$

Value loss.

$L^{VF} = E [(V_{ϕ} (s) - R)^{2}], R = t = 0 \sum T γ^{t} r_{t} .$

Gradient.

$\nabla_{ϕ} L^{VF} = E [2 (V_{ϕ} (s) - R) \cdot \nabla_{ϕ} V_{ϕ} (s)] .$

Why a value function:

Baseline: reduces variance in the policy gradient.
Advantages: $A = Q - V$ — how much better than average.
Stability: more stable than raw returns.

Reference Model ( $π_{ref}$ ) — detailed mathematics

KL divergence.

$KL (π_{θ} ∥ π_{ref}) = E_{π_{θ}} [lo g \frac{π _{θ} ( a ∣ s )}{π _{ref} ( a ∣ s )}] = E_{π_{θ}} [lo g π_{θ} (a ∣ s) - lo g π_{ref} (a ∣ s)] .$

In practice.

$KL-penalty = β \cdot E [lo g π_{θ} - lo g π_{ref}] .$

Properties.

$KL \geq 0$ (always non-negative).
$KL = 0$ iff $π_{θ} = π_{ref}$ .
Asymmetric: $KL (π_{θ} ∥ π_{ref}) \neq = KL (π_{ref} ∥ π_{θ})$ .

Why a KL penalty:

Trust region: keeps the policy close to the reference.
Prevents collapse: maintains diversity.
Stability: prevents large changes.
Capability preservation: keeps SFT knowledge.

Typical values. $β \in [0.1, 0.5]$ ; target KL $\in [0.1, 0.5]$ nats per token. If KL is too high, increase $β$ ; if too low, decrease $β$ .

Reward Model ( $r_{ψ}$ ) — detailed mathematics

Reward function. $r_{ψ} (x, y) : X \times Y \to R$ — maps (prompt, response) to a scalar reward.

Training objective (Bradley–Terry).

$L_{reward} = - lo g σ (r_{ψ} (x, y_{w}) - r_{ψ} (x, y_{l})),$

where $y_{w}$ is the chosen (winning) response, $y_{l}$ the rejected (losing) response, and $σ$ the sigmoid function.

Interpretation.

$P (y_{w} ≻ y_{l} ∣ x) = σ (r_{ψ} (x, y_{w}) - r_{ψ} (x, y_{l})),$

the probability that the chosen response is preferred over the rejected one.

During RL. For a generated response $y$ :

$reward = r_{ψ} (x, y),$

used to compute returns $R = \sum_{t} γ^{t} r_{t}$ and advantages $A = Q - V$ .

Reward shaping (optional).

$r_{total} = r_{ψ} (x, y) + r_{KL} (x, y) + r_{length} (x, y),$

where $r_{KL}$ is a KL penalty (can live in the reward or in the loss) and $r_{length}$ is a length penalty.

Part 4: Architecture Details

Policy Model Architecture

Option 1 — separate policy network:

class PolicyModel(nn.Module):
    def __init__(self):
        self.base = Transformer(...)
        self.head = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        hidden = self.base(x)
        logits = self.head(hidden)
        return logits

Option 2 — actor–critic (shared base):

class ActorCritic(nn.Module):
    def __init__(self):
        self.base = Transformer(...)              # shared
        self.policy_head = nn.Linear(d_model, vocab_size)
        self.value_head  = nn.Linear(d_model, 1)

    def forward(self, x):
        hidden = self.base(x)
        logits = self.policy_head(hidden)
        values = self.value_head(hidden)
        return logits, values

Critic Model Architecture

Option 1 — separate critic:

class CriticModel(nn.Module):
    def __init__(self):
        self.base = Transformer(...)
        self.head = nn.Linear(d_model, 1)

    def forward(self, x):
        hidden = self.base(x)
        value = self.head(hidden)
        return value

Option 2 — shared with policy (actor–critic): same as above, but shares the base with the policy.

Reference Model Architecture

Same as the policy model — a copy of the policy before RL training. Frozen (no gradients), used only for log-probability computation.

# Initialize reference model
reference_model = copy.deepcopy(policy_model)
reference_model.eval()  # freeze
for param in reference_model.parameters():
    param.requires_grad = False

Reward Model Architecture

class RewardModel(nn.Module):
    def __init__(self, base_model):
        self.base = base_model        # can use policy base
        self.head = nn.Linear(d_model, 1)

    def forward(self, x, y):
        # Concatenate prompt and response
        input_ids = concat(x, y)
        hidden = self.base(input_ids)
        # Use last token or mean pooling
        reward = self.head(hidden[-1])  # or mean(hidden)
        return reward

Part 5: Training Phases

Phase 1: Supervised Fine-Tuning (SFT)

Models used. Policy model $π_{θ}$ (being trained).

Objective (standard language modeling loss).

$L_{SFT} = - lo g π_{θ} (y ∣ x) .$

Result. A policy model that can follow instructions; this becomes the reference model $π_{ref}$ .

Phase 2: Reward Model Training

Models used. Reward model $r_{ψ}$ (being trained).

Data. Preference pairs $(x, y_{w}, y_{l})$ .

Objective.

$L_{reward} = - lo g σ (r_{ψ} (x, y_{w}) - r_{ψ} (x, y_{l})) .$

Result. A reward model that scores responses, trained to prefer chosen over rejected.

Phase 3: RL Optimization (PPO)

Models used.

Policy model $π_{θ}$ (being trained).
Critic model $V_{ϕ}$ (being trained).
Reference model $π_{ref}$ (frozen).
Reward model $r_{ψ}$ (typically frozen).

Objective.

$L_{PPO} = L^{CLIP} + c_{v} \cdot L^{VF} + β \cdot KL (π_{θ} ∥ π_{ref}),$

with

$L^{CLIP} L^{VF} KL = E [min (r (θ) A, clip (r (θ), 1 - ϵ, 1 + ϵ) A)], = E [(V_{ϕ} (s) - R)^{2}], = E [lo g π_{θ} - lo g π_{ref}] .$

Training loop.

Generate responses with $π_{θ}$ .
Score with $r_{ψ}$ .
Get logprobs from $π_{θ}$ and $π_{ref}$ .
Compute values with $V_{ϕ}$ .
Compute advantages.
Update $π_{θ}$ and $V_{ϕ}$ .

Result. An aligned policy model, better at generating preferred responses.

Part 6: Summary Table

Model	Role	Trained?	Used for	Mathematical form
Policy $π_{θ}$	Generate responses	Yes	Generation, loss	$π_{θ} (a ∣ s)$
Critic $V_{ϕ}$	Estimate state value	Yes	Advantages	$V_{ϕ} (s) = E [R ∣ s]$
Reference $π_{ref}$	Regularization	No (frozen)	KL penalty	$π_{ref} (a ∣ s)$
Reward $r_{ψ}$	Score responses	Before RL	Rewards	$r_{ψ} (x, y)$

Key relationships.

Advantage: $A = Q - V = R - V_{ϕ}$ .
Ratio: $r (θ) = π_{θ} / π_{ref}$ .
KL: $KL = E [lo g π_{θ} - lo g π_{ref}]$ .
Reward: $r = r_{ψ} (x, y)$ .

Training.

SFT: train $π_{θ}$ .
Reward: train $r_{ψ}$ .
RL: train $π_{θ}$ and $V_{ϕ}$ ( $π_{ref}$ and $r_{ψ}$ frozen).

Conclusion

Understanding these four models is crucial for PPO/RLHF:

Policy model: what we're optimizing; generates responses.
Critic model: estimates values; computes advantages.
Reference model: provides stability; prevents collapse.
Reward model: scores responses; provides the learning signal.

Each has a specific mathematical role and is used at different stages of training. Together, they enable stable and effective RLHF training.

ML & LLM Interview Prep — Deep Dives