Reinforcement Learning Fundamentals — Deep Dive
Frontier-lab interview prep. Pair with
INTERVIEW_GRILL.md.
RL is the foundation underneath RLHF, agentic systems, and tool-use training. Frontier-lab interviews probe RL not because they want game-playing agents but because RLHF/PPO/GRPO fluency requires understanding the underlying machinery. This deep dive covers what you need.
1. The MDP framework
A Markov Decision Process is :
- : state space.
- : action space.
- : transition probability.
- (or ): reward function.
- : discount factor.
Markov property: . Future depends only on current state and action.
Policy : distribution over actions given state. Deterministic: . Stochastic: .
Trajectory: .
Return (cumulative discounted reward):
The agent maximizes .
2. Value functions
State-value — expected return starting from following .
Action-value — expected return from taking first, then .
Advantage:
How much better is action than the policy's average behavior in state ?
Bellman equations
satisfies (one-step decomposition):
Bellman optimality
For optimal policy :
These are fixed-point equations. The Bellman operator is a contraction → unique solution → value iteration converges.
3. Dynamic programming methods
When the model is known, you can compute and exactly.
Value iteration
Iterate the Bellman optimality operator:
Converges geometrically with rate . Optimal policy: .
Policy iteration
- Policy evaluation: solve (linear system).
- Policy improvement: .
- Repeat until convergence.
Each step strictly improves (or terminates). Often faster than value iteration in practice.
4. Model-free methods — when you don't know and
Monte Carlo
Run full episodes; average returns to estimate :
Pros: unbiased. Cons: high variance, requires episodic structure.
Temporal Difference (TD) learning
Bootstrap from current value estimate:
The bracketed quantity is the TD error . TD trades variance for bias.
Q-learning (off-policy)
Update toward the greedy next-action value, even if behavior policy was exploratory. Off-policy: learn while acting -greedy.
SARSA (on-policy)
Update toward the action actually taken. Learns for the behavior policy.
5. Function approximation and DQN
For continuous or huge state spaces, use a function approximator .
DQN (Deep Q-Network, Mnih et al. 2015)
Loss:
Tricks that made DQN work
- Experience replay: store transitions in a buffer; sample uniformly. Breaks temporal correlations.
- Target network : snapshot of updated infrequently. Prevents the target from chasing itself.
- Frame stacking + CNN: handles partial observability of single-frame Atari.
Improvements
- Double DQN: decouple action selection (online net) from evaluation (target net) to reduce overestimation bias.
- Dueling DQN: separate value and advantage heads.
- Prioritized experience replay: sample by TD error magnitude.
- Rainbow: combines all of these.
6. Policy gradient methods
Directly parameterize the policy and optimize via gradient ascent on expected return.
Policy gradient theorem
The gradient of the return equals the expectation of (gradient of log-probability) × (Q-value).
REINFORCE
Use Monte Carlo return as an unbiased estimator of :
Pros: simple, unbiased. Cons: high variance.
Variance reduction with baselines
For any baseline that doesn't depend on . Standard choice: , giving advantage:
Actor-critic
Train both:
- Actor: policy .
- Critic: value (or ).
Use the critic's advantage estimate in the policy gradient. Reduces variance vs Monte Carlo at cost of some bias.
A2C / A3C
Advantage Actor-Critic / Asynchronous A3C. Synchronous (A2C) and asynchronous (A3C) variants. Standard before PPO.
7. Trust-region and PPO
Vanilla policy gradient suffers from destructive updates: large step → policy collapses.
Natural policy gradient
Use the Fisher metric to control update magnitude:
Step size in the KL geometry, not the parameter geometry. Computationally expensive (Fisher matrix inversion).
TRPO (Schulman et al. 2015)
Constrained optimization: maximize the surrogate objective subject to . Solve via conjugate gradient + line search.
PPO (Schulman et al. 2017)
Replace the constraint with a clipped surrogate. The clean way to write it (and to code it):
r = pi_theta(a|s) / pi_old(a|s) # importance ratio
surr1 = r * A
surr2 = clip(r, 1 - eps, 1 + eps) * A # clipped version
loss = -min(surr1, surr2).mean() # negate for gradient ascent
Equivalent formula:
Standard . When the new policy moves too far in the direction the advantage points, the clip kills the gradient — that's the trust-region effect.
PPO is simpler than TRPO, more stable than vanilla PG, and the workhorse of modern RL — including RLHF.
GAE (Generalized Advantage Estimation)
A flexible advantage estimator:
with TD error . trades bias and variance:
- : pure TD (low variance, high bias).
- : Monte Carlo (high variance, low bias).
- Standard: .
8. Exploration vs exploitation
Without exploration, the agent can be stuck on suboptimal policies.
- -greedy: with prob , random; else greedy.
- Boltzmann (softmax): sample from .
- UCB: bonus to less-tried actions: .
- Thompson sampling: maintain posterior over ; sample and act greedily w.r.t. sample.
- Entropy bonus: add to the objective. Used in PPO for LLM alignment.
- Curiosity / intrinsic motivation: reward novelty. Useful in sparse-reward tasks.
In LLM RLHF, the KL penalty serves as a regularizer that prevents over-specialization (a form of soft exploration constraint).
9. RL for LLMs (RLHF connection)
In RLHF:
- State: prompt + tokens generated so far.
- Action: next token.
- Reward: from a learned reward model (or rule-based for verifiable tasks like math).
- Policy: the LLM itself, .
- Reference policy: , the SFT model. KL penalty prevents drift.
The PPO objective for RLHF:
GRPO (DeepSeekMath/R1) is a simplification: drops the learned value/critic network. Advantage is computed from group-relative reward normalization (sample responses per prompt; advantage is ).
def grpo_advantage(rewards):
"""rewards: [B, K] — K sampled responses per prompt. Returns [B, K] advantages."""
mu = rewards.mean(dim=-1, keepdim=True)
sigma = rewards.std(dim=-1, keepdim=True) + 1e-8
return (rewards - mu) / sigma # group-relative, no critic needed
Recent follow-ups (DAPO, Dr. GRPO, 2025) drop the normalization to reduce length bias.
10. Common interview gotchas
| Question | Common wrong answer | Right answer |
|---|---|---|
| Q-learning is on or off policy? | On | Off — uses max over next actions, regardless of behavior |
| SARSA — on or off? | Off | On — uses the action actually taken |
| Why discount? | "Convention" | Stationary fixed-point of Bellman; bounded value when reward is bounded; preference for sooner rewards |
| Why not just use return as Q? | "It's biased" | Monte Carlo is unbiased but high variance; bootstrap reduces variance |
| Why does PPO clip the ratio? | "Why not?" | Prevents destructive policy updates; stable training |
| Advantage = return - baseline. Any baseline works? | Yes | Any baseline that doesn't depend on doesn't change the gradient's expectation |
| RLHF uses what RL algo? | DQN | Usually PPO; sometimes DPO (which isn't RL); GRPO in DeepSeek-R1 |
11. Eight most-asked interview questions
- State the Bellman equation for and explain. (Recursive expectation; one-step decomposition.)
- Q-learning vs SARSA — what's the difference? (Off-policy max vs on-policy actual action.)
- Why does DQN need a target network? (Stabilize the target; prevent oscillation.)
- Derive the policy gradient theorem. (Log-derivative trick; expectation of .)
- Why use a baseline in REINFORCE? (Reduce variance without changing bias.)
- What does PPO clip and why? (Probability ratio; prevent destructive updates.)
- GAE — what does control? (Bias-variance: 0 = TD, 1 = Monte Carlo.)
- In RLHF, what role does the KL penalty play? (Prevents the policy from drifting too far from SFT/reference; soft constraint.)
12. Drill plan
- Memorize Bellman equations (V, Q, optimal V, optimal Q).
- Derive policy gradient theorem on paper. 5 minutes.
- For each algorithm (Q-learning, SARSA, REINFORCE, A2C, PPO), recite: update rule, on/off-policy, key properties.
- Trace one episode of Q-learning with -greedy on a 2-state MDP.
- For RLHF, write the full PPO objective with KL penalty.
13. Further reading
- Sutton & Barto, Reinforcement Learning: An Introduction — the canonical text.
- Mnih et al. (2015), Human-level control through deep reinforcement learning — DQN.
- Schulman et al. (2015), Trust Region Policy Optimization.
- Schulman et al. (2017), Proximal Policy Optimization Algorithms.
- Schulman et al. (2016), High-Dimensional Continuous Control Using Generalized Advantage Estimation — GAE.
- Christiano et al. (2017), Deep RL from Human Preferences — RLHF foundation.