Topic 45: Reinforcement Learning Fundamentals
🔥 For interviews, read these first:
RL_DEEP_DIVE.md— frontier-lab deep dive: MDPs, Bellman equations, value/policy iteration, Q-learning vs SARSA (on vs off-policy), DQN tricks (replay, target net, double/dueling), policy gradient theorem with derivation, REINFORCE + baselines, actor-critic, A2C, TRPO/PPO with clipped surrogate, GAE, RLHF connection, GRPO simplification.INTERVIEW_GRILL.md— 60 active-recall questions.
What You'll Learn
This topic covers the RL foundation underneath modern alignment (RLHF, PPO, GRPO):
- MDP formalism and Bellman equations
- Dynamic programming (value/policy iteration)
- Model-free TD learning (Q-learning, SARSA)
- Function approximation and DQN
- Policy gradient methods (REINFORCE, actor-critic)
- Trust regions and PPO
- Exploration strategies
- RL applied to LLMs (RLHF, GRPO)
Why This Matters
Frontier-lab interviews probe RL not because they want game-playing agents but because RLHF/PPO/GRPO fluency requires understanding the underlying machinery. Bellman equations, advantage estimation, KL regularization — these aren't alignment-specific tricks; they're standard RL.
Next Steps
- Topic 8: Post-training and alignment (
08_training_techniques) — RLHF, PPO, DPO, GRPO in depth. - Topic 33: Information theory — KL divergence machinery used in RLHF.