Topic 45: Reinforcement Learning Fundamentals
What You'll Learn
This topic teaches you reinforcement learning fundamentals in easy language:
- Markov Decision Process (MDP)
- Monte Carlo Sampling
- Multi-Armed Bandit
- Q-Learning
- Policy Gradient
- Value Iteration
- Temporal Difference Learning
- Easy-to-understand explanations
Why We Need This
Interview Importance
- Common question: "Explain MDP", "What is Q-learning?"
- RL understanding: Foundation for RLHF, PPO
- Implementation: May ask to implement Q-learning
Real-World Application
- RLHF: Uses RL concepts
- Game playing: AlphaGo, game AI
- Robotics: Control systems
- Recommendation: Multi-armed bandit
Industry Use Cases
1. Markov Decision Process
Use Case: Modeling decision problems
- Framework for RL problems
- States, actions, rewards
- Foundation of RL
2. Q-Learning
Use Case: Value-based RL
- Learn optimal action values
- Used in game playing
- Foundation for deep Q-learning
3. Multi-Armed Bandit
Use Case: Exploration vs exploitation
- Online learning
- Recommendation systems
- A/B testing
4. Monte Carlo
Use Case: Policy evaluation
- Estimate values from experience
- Model-free learning
- Used in many RL algorithms
Core Intuition
Reinforcement learning is about learning from interaction rather than from fixed labeled targets.
The core challenge is:
- actions affect future states
- rewards can be delayed
- exploration matters
That is why RL feels different from ordinary supervised learning.
MDP
The MDP is the formal framework for sequential decision-making.
It defines:
- states
- actions
- transitions
- rewards
Q-Learning
Q-learning learns action values:
- how good is action
ain statesif I continue optimally afterward?
Multi-Armed Bandit
Bandits are the simplest version of the exploration-exploitation problem.
They are useful because the core idea appears in larger RL systems too.
Technical Details Interviewers Often Want
Exploration vs Exploitation
This is one of the most common RL interview themes.
You need to balance:
- using what seems best now
- gathering information that might improve decisions later
Why Q-Learning Is Off-Policy
Q-learning updates toward the greedy future value regardless of the behavior policy that collected the transition.
That is the key reason it is called off-policy.
Monte Carlo vs Temporal Difference
Monte Carlo waits until episode end for full returns.
TD methods bootstrap from current value estimates earlier.
That distinction is a very common follow-up.
Common Failure Modes
- confusing supervised labels with delayed reward signals
- not being able to explain exploration vs exploitation
- forgetting what makes Q-learning off-policy
- treating bandits and full RL as identical problems
Edge Cases and Follow-Up Questions
- Why is RL harder than supervised learning?
- Why do delayed rewards make credit assignment difficult?
- Why is Q-learning off-policy?
- What is the difference between a bandit and a full MDP?
- Why do Monte Carlo and TD methods differ?
What to Practice Saying Out Loud
- The components of an MDP
- Why exploration is necessary
- The conceptual difference between Monte Carlo, TD, and Q-learning
Theory
Markov Decision Process (MDP)
What it is:
- Framework for decision-making under uncertainty
- States, actions, rewards, transitions
- Markov property: future depends only on current state
Q-Learning
What it is:
- Learn action values (Q-values)
- Off-policy learning
- Update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
Multi-Armed Bandit
What it is:
- Simplest RL problem
- Multiple actions (arms), choose best
- Exploration vs exploitation trade-off
Monte Carlo
What it is:
- Learn from complete episodes
- Average returns to estimate values
- Model-free, uses actual experience
Industry-Standard Boilerplate Code
See detailed files for complete implementations:
rl_fundamentals.py: Complete implementations from scratchrl_explanations.md: Easy-to-understand explanations in simple languagerl_qa.md: Comprehensive interview Q&A
Exercises
- Implement Q-learning
- Implement multi-armed bandit
- Implement Monte Carlo policy evaluation
- Solve simple MDP
- Compare different RL algorithms
Next Steps
- Review PPO and RLHF
- Explore deep RL
- Understand policy gradients