RL Fundamentals — Interview Grill
50 questions on MDPs, value functions, Q-learning, policy gradients, PPO. Drill until you can answer 35+ cold.
A. MDPs and value functions
1. State the components of an MDP. — states, actions, transitions, reward, discount.
2. State the Markov property. . Future depends only on current state-action.
3. Define discounted return. .
4. Why discount? Bounded value when rewards bounded; favors sooner rewards; mathematical convenience (Bellman fixed point unique with ).
5. State-value vs action-value ? . .
6. Define advantage. . How much better is action than the policy's average.
7. Bellman equation for ? — expectation over policy and dynamics.
8. Bellman optimality for ? . Take max over actions.
9. Why does value iteration converge? Each iteration shrinks the error by a factor of , so it converges geometrically. Formally: the Bellman optimality operator is a -contraction in sup-norm — Banach fixed-point theorem then guarantees a unique fixed point and convergence from any start.
B. Dynamic programming
10. Value iteration update? .
11. Convergence rate of value iteration? Geometric, rate .
12. Policy iteration steps? (1) Policy evaluation — solve as linear system. (2) Policy improvement — .
13. Value vs policy iteration — when each? Both find optimal policy. Policy iteration often converges in fewer iterations but each iteration is more expensive (exact policy evaluation).
C. Model-free TD methods
14. TD(0) update for ? .
15. What's the TD error? .
16. Q-learning update? .
17. SARSA update? . Uses next action actually taken.
18. Q-learning vs SARSA: on or off-policy? Q-learning: off-policy (uses max regardless of behavior). SARSA: on-policy (uses behavior policy's action).
19. Why might SARSA learn safer policies? SARSA accounts for the actual exploration (e.g., -greedy) → may avoid risky paths. Q-learning learns optimal regardless.
20. Monte Carlo vs TD — bias and variance? MC unbiased high variance (uses full return). TD biased lower variance (uses bootstrap).
D. DQN
21. DQN loss? .
22. Why experience replay? Breaks temporal correlation between consecutive samples; allows reuse of data; more iid-like batches for SGD.
23. Why a target network? Stabilizes training. Without it, the target shifts with each update — chasing your own tail. Update target slowly (every steps or Polyak average).
24. Q-learning overestimates — why? tends to overestimate due to noise. Sampling errors get amplified by max.
25. Double DQN fix? Use online net to select action, target net to evaluate: . Decouples selection and evaluation.
26. Dueling DQN — what does it split? Network outputs and separately, then . Better when only some actions matter.
27. Prioritized replay? Sample high-TD-error transitions more often. Importance weights correct the bias.
E. Policy gradient
28. State the policy gradient theorem. . Intuition (the whole point): push up the log-probability of actions, weighted by how good they were. Good action → push it up; bad action → push it down. That's it.
29. Log-derivative trick — what is it? . Lets you write expectation gradient as expectation of (log-prob gradient × value).
30. REINFORCE estimator? with the empirical return.
31. Why use a baseline? Reduces variance without bias. for any state-only baseline.
32. What's the optimal baseline? minimizes variance of the gradient estimator.
33. Actor-critic — actor and critic do what? Actor: policy . Critic: value function (or ). Critic provides advantage estimates.
34. A2C vs A3C? A2C: synchronous (one update from all parallel actors). A3C: asynchronous (workers update parameters independently).
F. PPO
35. Why does naive policy gradient fail with large updates? Policy can collapse — large step takes you to a region where assigns near-zero probability to actions you're trying to reinforce. Hard to recover.
36. TRPO constraint? Maximize surrogate subject to . Update step in KL geometry.
37. PPO clipped surrogate? with . Standard .
38. Why clip ratio instead of constraining KL? Simpler, no Lagrangian. Heuristic but works extremely well in practice.
39. What's GAE and what does control? Intuition: GAE blends short-horizon TD (low variance, bootstrapped from value estimate) and long-horizon Monte Carlo (high variance, true returns). slides between them — trade bias vs variance.
Formula: where . → pure TD; → Monte Carlo. Standard for PPO: .
40. Standard for PPO? 0.95.
G. Exploration
41. -greedy? With prob , random action; else greedy. Simple but widely used.
42. Boltzmann exploration? . controls exploration; greedy, uniform.
43. UCB principle? Optimism in the face of uncertainty. Add bonus to less-tried actions: .
44. Entropy bonus — what does it do? Adds to the loss. Encourages diverse actions; prevents premature collapse to deterministic policy.
45. Curiosity-driven exploration? Reward novelty (unpredicted states). Useful in sparse-reward problems where extrinsic reward signal is rare.
H. RL for LLMs
46. RLHF state, action, reward? State: prompt + generated tokens so far. Action: next token. Reward: from learned reward model at end of sequence (or rule-based for verifiable tasks).
47. Why KL penalty in RLHF? Prevents the policy from drifting too far from the SFT model. Acts as regularization; prevents reward hacking.
48. PPO objective for RLHF? .
49. GRPO simplification over PPO? Drops value/critic network. Computes advantage via group-relative reward normalization (sample responses per prompt, compare rewards within group). Used in DeepSeekMath, DeepSeek-R1.
50. Reward hacking in RLHF? Policy finds high-reward outputs that don't correspond to truly good behavior — exploits reward model errors. Mitigated by KL penalty, robust reward modeling, evaluation on held-out tasks.
Quick fire
51. Q-learning is on/off-policy? Off. 52. SARSA is on/off-policy? On. 53. Discount factor range? . 54. DQN target network update? Slowly (every steps or Polyak). 55. Policy gradient log trick? . 56. PPO standard ? 0.2. 57. GAE ? Trade variance vs bias. 58. RLHF main RL algo? PPO (or GRPO). 59. Bellman optimality is fixed point of? operator. 60. DPO is RL? No — direct preference optimization, no RL loop.
Self-grading
If you can't answer 1-15, you don't know RL basics. If you can't answer 16-35, you'll struggle on RLHF/PPO interview questions. If you can't answer 36-50, frontier-lab interviews on alignment will go past you.
Aim for 40+/60 cold.