RL Fundamentals — Interview Grill

50 questions on MDPs, value functions, Q-learning, policy gradients, PPO. Drill until you can answer 35+ cold.

A. MDPs and value functions

1. State the components of an MDP. $(S, A, P, R, γ)$ — states, actions, transitions, reward, discount.

2. State the Markov property. $P (s_{t + 1} ∣ s_{t}, a_{t}, s_{t - 1}, \dots) = P (s_{t + 1} ∣ s_{t}, a_{t})$ . Future depends only on current state-action.

3. Define discounted return. $G_{t} = \sum_{k = 0}^{\infty} γ^{k} r_{t + k}$ .

4. Why discount? Bounded value when rewards bounded; favors sooner rewards; mathematical convenience (Bellman fixed point unique with $γ < 1$ ).

5. State-value $V^{π}$ vs action-value $Q^{π}$ ? $V^{π} (s) = E_{π} [G_{t} ∣ s_{t} = s]$ . $Q^{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]$ .

6. Define advantage. $A^{π} (s, a) = Q^{π} (s, a) - V^{π} (s)$ . How much better is action $a$ than the policy's average.

7. Bellman equation for $V^{π}$ ? $V^{π} (s) = E_{a} [R + γ V^{π} (s^{'})]$ — expectation over policy and dynamics.

8. Bellman optimality for $V^{*}$ ? $V^{*} (s) = max_{a} E [R + γ V^{*} (s^{'})]$ . Take max over actions.

9. Why does value iteration converge? Each iteration shrinks the error by a factor of $γ$ , so it converges geometrically. Formally: the Bellman optimality operator is a $γ$ -contraction in sup-norm — Banach fixed-point theorem then guarantees a unique fixed point and convergence from any start.

B. Dynamic programming

10. Value iteration update? $V_{k + 1} (s) = max_{a} E [R + γ V_{k} (s^{'})]$ .

11. Convergence rate of value iteration? Geometric, rate $γ$ .

12. Policy iteration steps? (1) Policy evaluation — solve $V^{π}$ as linear system. (2) Policy improvement — $π^{'} (s) = ar g max_{a} Q^{π} (s, a)$ .

13. Value vs policy iteration — when each? Both find optimal policy. Policy iteration often converges in fewer iterations but each iteration is more expensive (exact policy evaluation).

C. Model-free TD methods

14. TD(0) update for $V$ ? $V (s_{t}) \leftarrow V (s_{t}) + α [r_{t} + γV (s_{t + 1}) - V (s_{t})]$ .

15. What's the TD error? $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ .

16. Q-learning update? $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$ .

17. SARSA update? $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ Q (s_{t + 1}, a_{t + 1}) - Q (s_{t}, a_{t})]$ . Uses next action actually taken.

18. Q-learning vs SARSA: on or off-policy? Q-learning: off-policy (uses max regardless of behavior). SARSA: on-policy (uses behavior policy's action).

19. Why might SARSA learn safer policies? SARSA accounts for the actual exploration (e.g., $ϵ$ -greedy) → may avoid risky paths. Q-learning learns optimal regardless.

20. Monte Carlo vs TD — bias and variance? MC unbiased high variance (uses full return). TD biased lower variance (uses bootstrap).

D. DQN

21. DQN loss? $L = E [(r + γ max_{a^{'}} Q_{θ^{-}} (s^{'}, a^{'}) - Q_{θ} (s, a))^{2}]$ .

22. Why experience replay? Breaks temporal correlation between consecutive samples; allows reuse of data; more iid-like batches for SGD.

23. Why a target network? Stabilizes training. Without it, the target $Q_{θ^{-}}$ shifts with each update — chasing your own tail. Update target slowly (every $K$ steps or Polyak average).

24. Q-learning overestimates — why? $max_{a} Q$ tends to overestimate due to noise. Sampling errors get amplified by max.

25. Double DQN fix? Use online net to select action, target net to evaluate: $r + γ Q_{θ^{-}} (s^{'}, ar g max_{a^{'}} Q_{θ} (s^{'}, a^{'}))$ . Decouples selection and evaluation.

26. Dueling DQN — what does it split? Network outputs $V (s)$ and $A (s, a)$ separately, then $Q (s, a) = V (s) + (A (s, a) - mean_{a} A (s, a))$ . Better when only some actions matter.

27. Prioritized replay? Sample high-TD-error transitions more often. Importance weights correct the bias.

E. Policy gradient

28. State the policy gradient theorem. $\nabla_{θ} J (θ) = E_{π} [\nabla_{θ} lo g π_{θ} (a ∣ s) \cdot Q^{π} (s, a)]$ . Intuition (the whole point): push up the log-probability of actions, weighted by how good they were. Good action → push it up; bad action → push it down. That's it.

29. Log-derivative trick — what is it? $\nabla lo g p (x; θ) = \nabla p (x; θ) / p (x; θ)$ . Lets you write expectation gradient as expectation of (log-prob gradient × value).

30. REINFORCE estimator? $\nabla J \approx \frac{1}{N} \sum_{i} \nabla lo g π (a_{i} ∣ s_{i}) G_{i}$ with $G_{i}$ the empirical return.

31. Why use a baseline? Reduces variance without bias. $E [\nabla lo g π \cdot b (s)] = b (s) E [\nabla lo g π] = 0$ for any state-only baseline.

32. What's the optimal baseline? $b^{*} (s) = E [Q^{π} (s, a) ∣ s] = V^{π} (s)$ minimizes variance of the gradient estimator.

33. Actor-critic — actor and critic do what? Actor: policy $π_{θ}$ . Critic: value function $V_{ϕ}$ (or $Q_{ϕ}$ ). Critic provides advantage estimates.

34. A2C vs A3C? A2C: synchronous (one update from all parallel actors). A3C: asynchronous (workers update parameters independently).

F. PPO

35. Why does naive policy gradient fail with large updates? Policy can collapse — large step takes you to a region where $π$ assigns near-zero probability to actions you're trying to reinforce. Hard to recover.

36. TRPO constraint? Maximize surrogate subject to $KL (π_{old} ∥ π_{θ}) \leq δ$ . Update step in KL geometry.

37. PPO clipped surrogate? $L = E [min (r A, clip (r, 1 - ϵ, 1 + ϵ) A)]$ with $r = π_{θ} / π_{old}$ . Standard $ϵ = 0.2$ .

38. Why clip ratio $r$ instead of constraining KL? Simpler, no Lagrangian. Heuristic but works extremely well in practice.

39. What's GAE and what does $λ$ control? Intuition: GAE blends short-horizon TD (low variance, bootstrapped from value estimate) and long-horizon Monte Carlo (high variance, true returns). $λ$ slides between them — trade bias vs variance.

Formula: $A_{t}^{GAE (λ)} = \sum_{l \geq 0} (γλ)^{l} δ_{t + l}$ where $δ_{t} = r_{t} + γV (s_{t + 1}) - V (s_{t})$ . $λ = 0$ → pure TD; $λ = 1$ → Monte Carlo. Standard for PPO: $λ \approx 0.95$ .

40. Standard $λ$ for PPO? 0.95.

G. Exploration

41. $ϵ$ -greedy? With prob $ϵ$ , random action; else greedy. Simple but widely used.

42. Boltzmann exploration? $π (a ∣ s) \propto exp (Q (s, a) / T)$ . $T$ controls exploration; $T \to 0$ greedy, $T \to \infty$ uniform.

43. UCB principle? Optimism in the face of uncertainty. Add bonus to less-tried actions: $a = ar g max [Q + c lo g t / N (s, a)]$ .

44. Entropy bonus — what does it do? Adds $β H (π (\cdot ∣ s))$ to the loss. Encourages diverse actions; prevents premature collapse to deterministic policy.

45. Curiosity-driven exploration? Reward novelty (unpredicted states). Useful in sparse-reward problems where extrinsic reward signal is rare.

H. RL for LLMs

46. RLHF state, action, reward? State: prompt + generated tokens so far. Action: next token. Reward: from learned reward model at end of sequence (or rule-based for verifiable tasks).

47. Why KL penalty in RLHF? Prevents the policy from drifting too far from the SFT model. Acts as regularization; prevents reward hacking.

48. PPO objective for RLHF? $L = E [clip surrogate - β KL (π_{θ} ∥ π_{ref})]$ .

49. GRPO simplification over PPO? Drops value/critic network. Computes advantage via group-relative reward normalization (sample $K$ responses per prompt, compare rewards within group). Used in DeepSeekMath, DeepSeek-R1.

50. Reward hacking in RLHF? Policy finds high-reward outputs that don't correspond to truly good behavior — exploits reward model errors. Mitigated by KL penalty, robust reward modeling, evaluation on held-out tasks.

Quick fire

51. Q-learning is on/off-policy? Off. 52. SARSA is on/off-policy? On. 53. Discount factor $γ$ range? $[0, 1)$ . 54. DQN target network update? Slowly (every $K$ steps or Polyak). 55. Policy gradient log trick? $\nabla p = p \nabla lo g p$ . 56. PPO standard $ϵ$ ? 0.2. 57. GAE $λ$ ? Trade variance vs bias. 58. RLHF main RL algo? PPO (or GRPO). 59. Bellman optimality is fixed point of? $T^{*}$ operator. 60. DPO is RL? No — direct preference optimization, no RL loop.

Self-grading

If you can't answer 1-15, you don't know RL basics. If you can't answer 16-35, you'll struggle on RLHF/PPO interview questions. If you can't answer 36-50, frontier-lab interviews on alignment will go past you.

Aim for 40+/60 cold.

ML & LLM Interview Prep — Deep Dives