Topic 45: Reinforcement Learning Fundamentals

🔥 For interviews, read these first:

RL_DEEP_DIVE.md — frontier-lab deep dive: MDPs, Bellman equations, value/policy iteration, Q-learning vs SARSA (on vs off-policy), DQN tricks (replay, target net, double/dueling), policy gradient theorem with derivation, REINFORCE + baselines, actor-critic, A2C, TRPO/PPO with clipped surrogate, GAE, RLHF connection, GRPO simplification.

INTERVIEW_GRILL.md — 60 active-recall questions.

What You'll Learn

This topic covers the RL foundation underneath modern alignment (RLHF, PPO, GRPO):

MDP formalism and Bellman equations
Dynamic programming (value/policy iteration)
Model-free TD learning (Q-learning, SARSA)
Function approximation and DQN
Policy gradient methods (REINFORCE, actor-critic)
Trust regions and PPO
Exploration strategies
RL applied to LLMs (RLHF, GRPO)

Frontier-lab interviews probe RL not because they want game-playing agents but because RLHF/PPO/GRPO fluency requires understanding the underlying machinery. Bellman equations, advantage estimation, KL regularization — these aren't alignment-specific tricks; they're standard RL.

Next Steps

Topic 8: Post-training and alignment (08_training_techniques) — RLHF, PPO, DPO, GRPO in depth.
Topic 33: Information theory — KL divergence machinery used in RLHF.

ML & LLM Interview Prep — Deep Dives

Topic 45: Reinforcement Learning Fundamentals

What You'll Learn

Why This Matters

Next Steps