Reinforcement Learning Fundamentals — Deep Dive

Frontier-lab interview prep. Pair with INTERVIEW_GRILL.md.

RL is the foundation underneath RLHF, agentic systems, and tool-use training. Frontier-lab interviews probe RL not because they want game-playing agents but because RLHF/PPO/GRPO fluency requires understanding the underlying machinery. This deep dive covers what you need.

1. The MDP framework

A Markov Decision Process is $(S, A, P, R, γ)$ :

$S$ : state space.
$A$ : action space.
$P (s^{'} ∣ s, a)$ : transition probability.
$R (s, a)$ (or $R (s, a, s^{'})$ ): reward function.
$γ \in [0, 1)$ : discount factor.

Markov property: $P (s_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, \dots) = P (s_{t + 1} ∣ s_{t}, a_{t})$ . Future depends only on current state and action.

Policy $π$ : distribution over actions given state. Deterministic: $a = π (s)$ . Stochastic: $π (a ∣ s)$ .

Trajectory: $τ = (s_{0}, a_{0}, r_{0}, s_{1}, a_{1}, r_{1}, \dots)$ .

Return (cumulative discounted reward):

$G_{t} = k = 0 \sum \infty γ^{k} r_{t + k}$

The agent maximizes $E_{π} [G_{0}]$ .

2. Value functions

State-value $V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s]$ — expected return starting from $s$ following $π$ .

Action-value $Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]$ — expected return from $s$ taking $a$ first, then $π$ .

Advantage:

$A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$

How much better is action $a$ than the policy's average behavior in state $s$ ?

Bellman equations

$V^{π}$ satisfies (one-step decomposition):

$V^{π} (s) = a \sum π (a ∣ s) s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{π} (s^{'})]$

$Q^{π} (s, a) = s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} \sum π (a^{'} ∣ s^{'}) Q^{π} (s^{'}, a^{'})]$

Bellman optimality

For optimal policy $π^{*}$ :

$V^{*} (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ V^{*} (s^{'})]$

$Q^{*} (s, a) = s^{'} \sum P (s^{'} ∣ s, a) [R (s, a, s^{'}) + γ a^{'} max Q^{*} (s^{'}, a^{'})]$

These are fixed-point equations. The Bellman operator $T^{*}$ is a contraction → unique solution → value iteration converges.

3. Dynamic programming methods

When the model is known, you can compute $V^{*}$ and $Q^{*}$ exactly.

Value iteration

Iterate the Bellman optimality operator:

$V_{k + 1} (s) = a max s^{'} \sum P (s^{'} ∣ s, a) [R + γ V_{k} (s^{'})]$

Converges geometrically with rate $γ$ . Optimal policy: $π^{*} (s) = ar g max_{a} Q^{*} (s, a)$ .

Policy iteration

Policy evaluation: solve $V^{π} = T^{π} V^{π}$ (linear system).
Policy improvement: $π^{'} (s) = ar g max_{a} Q^{π} (s, a)$ .
Repeat until convergence.

Each step strictly improves (or terminates). Often faster than value iteration in practice.

4. Model-free methods — when you don't know $P$ and $R$

Monte Carlo

Run full episodes; average returns to estimate $V^{π} (s)$ :

$V^{π} (s) \leftarrow V^{π} (s) + α (G_{t} - V^{π} (s))$

Pros: unbiased. Cons: high variance, requires episodic structure.

Temporal Difference (TD) learning

Bootstrap from current value estimate:

$V (s_{t}) \leftarrow V (s_{t}) + α [r_{t} + γV (s_{t + 1}) - V (s_{t})]$

The bracketed quantity is the TD error $δ_{t}$ . TD trades variance for bias.

Q-learning (off-policy)

$Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ a^{'} max Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$

Update toward the greedy next-action value, even if behavior policy was exploratory. Off-policy: learn $Q^{*}$ while acting $ϵ$ -greedy.

SARSA (on-policy)

$Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]$

Update toward the action actually taken. Learns $Q^{π}$ for the behavior policy.

5. Function approximation and DQN

For continuous or huge state spaces, use a function approximator $Q_{θ}$ .

DQN (Deep Q-Network, Mnih et al. 2015)

Loss:

$L (θ) = E_{(s, a, r, s^{'})} [(r + γ a^{'} max Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2}]$

Tricks that made DQN work

Experience replay: store transitions in a buffer; sample uniformly. Breaks temporal correlations.
Target network $θ^{-}$ : snapshot of $θ$ updated infrequently. Prevents the target from chasing itself.
Frame stacking + CNN: handles partial observability of single-frame Atari.

Improvements

Double DQN: decouple action selection (online net) from evaluation (target net) to reduce overestimation bias.
Dueling DQN: separate value $V (s)$ and advantage $A (s, a)$ heads.
Prioritized experience replay: sample by TD error magnitude.
Rainbow: combines all of these.

6. Policy gradient methods

Directly parameterize the policy $π_{θ} (a ∣ s)$ and optimize via gradient ascent on expected return.

Policy gradient theorem

$\nabla_{θ} J (θ) = E_{π} [\nabla_{θ} lo g π_{θ} (a ∣ s) Q^{π} (s, a)]$

The gradient of the return equals the expectation of (gradient of log-probability) × (Q-value).

REINFORCE

Use Monte Carlo return $G_{t}$ as an unbiased estimator of $Q$ :

$\nabla_{θ} J \approx \frac{1}{N} i \sum \nabla_{θ} lo g π_{θ} (a_{i} ∣ s_{i}) G_{i}$

Pros: simple, unbiased. Cons: high variance.

Variance reduction with baselines

$\nabla_{θ} J = E [\nabla_{θ} lo g π_{θ} (a ∣ s) (Q^{π} (s, a) - b (s))]$

For any baseline $b (s)$ that doesn't depend on $a$ . Standard choice: $b (s) = V^{π} (s)$ , giving advantage:

$\nabla_{θ} J = E [\nabla_{θ} lo g π_{θ} (a ∣ s) A^{π} (s, a)]$

Actor-critic

Train both:

Actor: policy $π_{θ}$ .
Critic: value $V_{ϕ}$ (or $Q_{ϕ}$ ).

Use the critic's advantage estimate $A^{π}$ in the policy gradient. Reduces variance vs Monte Carlo at cost of some bias.

A2C / A3C

Advantage Actor-Critic / Asynchronous A3C. Synchronous (A2C) and asynchronous (A3C) variants. Standard before PPO.

7. Trust-region and PPO

Vanilla policy gradient suffers from destructive updates: large step → policy collapses.

Natural policy gradient

Use the Fisher metric to control update magnitude:

$θ \leftarrow θ + α F (θ)^{- 1} \nabla J (θ)$

Step size in the KL geometry, not the parameter geometry. Computationally expensive (Fisher matrix inversion).

TRPO (Schulman et al. 2015)

Constrained optimization: maximize the surrogate objective subject to $KL (π_{old} ∥ π_{θ}) \leq δ$ . Solve via conjugate gradient + line search.

PPO (Schulman et al. 2017)

Replace the constraint with a clipped surrogate. The clean way to write it (and to code it):

r = pi_theta(a|s) / pi_old(a|s)             # importance ratio
surr1 = r * A
surr2 = clip(r, 1 - eps, 1 + eps) * A         # clipped version
loss = -min(surr1, surr2).mean()              # negate for gradient ascent

Equivalent formula:

$L^{CLIP} (θ) = E [min (r_{t} A_{t}, clip (r_{t}, 1 - ϵ, 1 + ϵ) A_{t})]$

Standard $ϵ = 0.2$ . When the new policy moves too far in the direction the advantage points, the clip kills the gradient — that's the trust-region effect.

PPO is simpler than TRPO, more stable than vanilla PG, and the workhorse of modern RL — including RLHF.

GAE (Generalized Advantage Estimation)

A flexible advantage estimator:

$A_{t}^{GAE (λ)} = l = 0 \sum \infty (γλ)^{l} δ_{t + l}$

with TD error $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ . $λ$ trades bias and variance:

$λ = 0$ : pure TD (low variance, high bias).
$λ = 1$ : Monte Carlo (high variance, low bias).
Standard: $λ \approx 0.95$ .

8. Exploration vs exploitation

Without exploration, the agent can be stuck on suboptimal policies.

$ϵ$ -greedy: with prob $ϵ$ , random; else greedy.
Boltzmann (softmax): sample from $π (a ∣ s) \propto exp (Q (s, a) / T)$ .
UCB: bonus to less-tried actions: $a = ar g max [Q (s, a) + c lo g t / N (s, a)]$ .
Thompson sampling: maintain posterior over $Q$ ; sample and act greedily w.r.t. sample.
Entropy bonus: add $β H (π (\cdot ∣ s))$ to the objective. Used in PPO for LLM alignment.
Curiosity / intrinsic motivation: reward novelty. Useful in sparse-reward tasks.

In LLM RLHF, the KL penalty serves as a regularizer that prevents over-specialization (a form of soft exploration constraint).

9. RL for LLMs (RLHF connection)

In RLHF:

State: prompt + tokens generated so far.
Action: next token.
Reward: from a learned reward model (or rule-based for verifiable tasks like math).
Policy: the LLM itself, $π_{θ} (token ∣ context)$ .
Reference policy: $π_{ref}$ , the SFT model. KL penalty $β KL (π_{θ} ∥ π_{ref})$ prevents drift.

The PPO objective for RLHF:

$L (θ) = E [clip surrogate (θ) - β KL (π_{θ} ∥ π_{ref})]$

GRPO (DeepSeekMath/R1) is a simplification: drops the learned value/critic network. Advantage is computed from group-relative reward normalization (sample $K$ responses per prompt; advantage is $(r_{i} - μ_{group}) / σ_{group}$ ).

def grpo_advantage(rewards):
    """rewards: [B, K] — K sampled responses per prompt. Returns [B, K] advantages."""
    mu = rewards.mean(dim=-1, keepdim=True)
    sigma = rewards.std(dim=-1, keepdim=True) + 1e-8
    return (rewards - mu) / sigma     # group-relative, no critic needed

Recent follow-ups (DAPO, Dr. GRPO, 2025) drop the $σ$ normalization to reduce length bias.

10. Common interview gotchas

Question	Common wrong answer	Right answer
Q-learning is on or off policy?	On	Off — uses max over next actions, regardless of behavior
SARSA — on or off?	Off	On — uses the action actually taken
Why discount?	"Convention"	Stationary fixed-point of Bellman; bounded value when reward is bounded; preference for sooner rewards
Why not just use return as Q?	"It's biased"	Monte Carlo $G_{t}$ is unbiased but high variance; bootstrap reduces variance
Why does PPO clip the ratio?	"Why not?"	Prevents destructive policy updates; stable training
Advantage = return - baseline. Any baseline works?	Yes	Any baseline that doesn't depend on $a$ doesn't change the gradient's expectation
RLHF uses what RL algo?	DQN	Usually PPO; sometimes DPO (which isn't RL); GRPO in DeepSeek-R1

11. Eight most-asked interview questions

State the Bellman equation for $V^{π}$ and explain. (Recursive expectation; one-step decomposition.)
Q-learning vs SARSA — what's the difference? (Off-policy max vs on-policy actual action.)
Why does DQN need a target network? (Stabilize the target; prevent oscillation.)
Derive the policy gradient theorem. (Log-derivative trick; expectation of $\nabla lo g π \cdot Q$ .)
Why use a baseline in REINFORCE? (Reduce variance without changing bias.)
What does PPO clip and why? (Probability ratio; prevent destructive updates.)
GAE — what does $λ$ control? (Bias-variance: 0 = TD, 1 = Monte Carlo.)
In RLHF, what role does the KL penalty play? (Prevents the policy from drifting too far from SFT/reference; soft constraint.)

12. Drill plan

Memorize Bellman equations (V, Q, optimal V, optimal Q).
Derive policy gradient theorem on paper. 5 minutes.
For each algorithm (Q-learning, SARSA, REINFORCE, A2C, PPO), recite: update rule, on/off-policy, key properties.
Trace one episode of Q-learning with $ϵ$ -greedy on a 2-state MDP.
For RLHF, write the full PPO objective with KL penalty.

13. Further reading

Sutton & Barto, Reinforcement Learning: An Introduction — the canonical text.
Mnih et al. (2015), Human-level control through deep reinforcement learning — DQN.
Schulman et al. (2015), Trust Region Policy Optimization.
Schulman et al. (2017), Proximal Policy Optimization Algorithms.
Schulman et al. (2016), High-Dimensional Continuous Control Using Generalized Advantage Estimation — GAE.
Christiano et al. (2017), Deep RL from Human Preferences — RLHF foundation.

ML & LLM Interview Prep — Deep Dives