RL Fundamentals — Interview Grill

50 questions on MDPs, value functions, Q-learning, policy gradients, PPO. Drill until you can answer 35+ cold.


A. MDPs and value functions

1. State the components of an MDP. — states, actions, transitions, reward, discount.

2. State the Markov property. . Future depends only on current state-action.

3. Define discounted return. .

4. Why discount? Bounded value when rewards bounded; favors sooner rewards; mathematical convenience (Bellman fixed point unique with ).

5. State-value vs action-value ? . .

6. Define advantage. . How much better is action than the policy's average.

7. Bellman equation for ? — expectation over policy and dynamics.

8. Bellman optimality for ? . Take max over actions.

9. Why does value iteration converge? Each iteration shrinks the error by a factor of , so it converges geometrically. Formally: the Bellman optimality operator is a -contraction in sup-norm — Banach fixed-point theorem then guarantees a unique fixed point and convergence from any start.


B. Dynamic programming

10. Value iteration update? .

11. Convergence rate of value iteration? Geometric, rate .

12. Policy iteration steps? (1) Policy evaluation — solve as linear system. (2) Policy improvement — .

13. Value vs policy iteration — when each? Both find optimal policy. Policy iteration often converges in fewer iterations but each iteration is more expensive (exact policy evaluation).


C. Model-free TD methods

14. TD(0) update for ? .

15. What's the TD error? .

16. Q-learning update? .

17. SARSA update? . Uses next action actually taken.

18. Q-learning vs SARSA: on or off-policy? Q-learning: off-policy (uses max regardless of behavior). SARSA: on-policy (uses behavior policy's action).

19. Why might SARSA learn safer policies? SARSA accounts for the actual exploration (e.g., -greedy) → may avoid risky paths. Q-learning learns optimal regardless.

20. Monte Carlo vs TD — bias and variance? MC unbiased high variance (uses full return). TD biased lower variance (uses bootstrap).


D. DQN

21. DQN loss? .

22. Why experience replay? Breaks temporal correlation between consecutive samples; allows reuse of data; more iid-like batches for SGD.

23. Why a target network? Stabilizes training. Without it, the target shifts with each update — chasing your own tail. Update target slowly (every steps or Polyak average).

24. Q-learning overestimates — why? tends to overestimate due to noise. Sampling errors get amplified by max.

25. Double DQN fix? Use online net to select action, target net to evaluate: . Decouples selection and evaluation.

26. Dueling DQN — what does it split? Network outputs and separately, then . Better when only some actions matter.

27. Prioritized replay? Sample high-TD-error transitions more often. Importance weights correct the bias.


E. Policy gradient

28. State the policy gradient theorem. . Intuition (the whole point): push up the log-probability of actions, weighted by how good they were. Good action → push it up; bad action → push it down. That's it.

29. Log-derivative trick — what is it? . Lets you write expectation gradient as expectation of (log-prob gradient × value).

30. REINFORCE estimator? with the empirical return.

31. Why use a baseline? Reduces variance without bias. for any state-only baseline.

32. What's the optimal baseline? minimizes variance of the gradient estimator.

33. Actor-critic — actor and critic do what? Actor: policy . Critic: value function (or ). Critic provides advantage estimates.

34. A2C vs A3C? A2C: synchronous (one update from all parallel actors). A3C: asynchronous (workers update parameters independently).


F. PPO

35. Why does naive policy gradient fail with large updates? Policy can collapse — large step takes you to a region where assigns near-zero probability to actions you're trying to reinforce. Hard to recover.

36. TRPO constraint? Maximize surrogate subject to . Update step in KL geometry.

37. PPO clipped surrogate? with . Standard .

38. Why clip ratio instead of constraining KL? Simpler, no Lagrangian. Heuristic but works extremely well in practice.

39. What's GAE and what does control? Intuition: GAE blends short-horizon TD (low variance, bootstrapped from value estimate) and long-horizon Monte Carlo (high variance, true returns). slides between them — trade bias vs variance.

Formula: where . → pure TD; → Monte Carlo. Standard for PPO: .

40. Standard for PPO? 0.95.


G. Exploration

41. -greedy? With prob , random action; else greedy. Simple but widely used.

42. Boltzmann exploration? . controls exploration; greedy, uniform.

43. UCB principle? Optimism in the face of uncertainty. Add bonus to less-tried actions: .

44. Entropy bonus — what does it do? Adds to the loss. Encourages diverse actions; prevents premature collapse to deterministic policy.

45. Curiosity-driven exploration? Reward novelty (unpredicted states). Useful in sparse-reward problems where extrinsic reward signal is rare.


H. RL for LLMs

46. RLHF state, action, reward? State: prompt + generated tokens so far. Action: next token. Reward: from learned reward model at end of sequence (or rule-based for verifiable tasks).

47. Why KL penalty in RLHF? Prevents the policy from drifting too far from the SFT model. Acts as regularization; prevents reward hacking.

48. PPO objective for RLHF? .

49. GRPO simplification over PPO? Drops value/critic network. Computes advantage via group-relative reward normalization (sample responses per prompt, compare rewards within group). Used in DeepSeekMath, DeepSeek-R1.

50. Reward hacking in RLHF? Policy finds high-reward outputs that don't correspond to truly good behavior — exploits reward model errors. Mitigated by KL penalty, robust reward modeling, evaluation on held-out tasks.


Quick fire

51. Q-learning is on/off-policy? Off. 52. SARSA is on/off-policy? On. 53. Discount factor range? . 54. DQN target network update? Slowly (every steps or Polyak). 55. Policy gradient log trick? . 56. PPO standard ? 0.2. 57. GAE ? Trade variance vs bias. 58. RLHF main RL algo? PPO (or GRPO). 59. Bellman optimality is fixed point of? operator. 60. DPO is RL? No — direct preference optimization, no RL loop.


Self-grading

If you can't answer 1-15, you don't know RL basics. If you can't answer 16-35, you'll struggle on RLHF/PPO interview questions. If you can't answer 36-50, frontier-lab interviews on alignment will go past you.

Aim for 40+/60 cold.