Reinforcement Learning Fundamentals: Interview Q&A
Q1: Explain Markov Decision Process (MDP) in simple terms.
Answer:
A Markov Decision Process is like a framework for describing decision-making problems. Imagine you're playing a video game: at each moment, you're in some situation (state), you can take actions, you get rewards, and the game changes. An MDP is just a mathematical way to describe this.
The four key components are:
States: All the possible situations you can be in. In a game, this might be your position, health, items. Think of it as "where you are."
Actions: All the things you can do from each state. In a game: move, jump, shoot. Think of it as "what you can do."
Rewards: When you take an action, you get a reward that tells you if it was good (positive) or bad (negative). Think of it as "how good" your action was.
Transitions: When you take an action, the world changes. The transition tells you: "If I'm in state s and take action a, what's the probability I'll end up in state s'?"
The Markov Property is crucial: it says the future only depends on the current state, not the past. If you know where you are now, you don't need to remember how you got there. This makes problems much simpler.
Why it matters: Almost every RL problem can be described as an MDP. When we "solve an MDP," we find the best way to act (policy) that maximizes total reward over time.
Q2: What is a Multi-Armed Bandit? Explain the exploration vs exploitation trade-off.
Answer:
A Multi-Armed Bandit is the simplest reinforcement learning problem. Imagine you're in a casino with several slot machines. Each machine gives you money when you play it, but you don't know which gives the most. You have limited plays. What should you do?
The Problem:
- Multiple actions (arms/machines)
- Each gives random rewards
- Goal: Find the best one
- Challenge: You can only try them and see what happens
Exploration vs Exploitation:
Exploration means trying different machines to learn which ones are good. You might try a machine you haven't tried before, or one you're not sure about. This helps you discover better options.
Exploitation means playing the machine you think is best based on what you've learned. This maximizes your immediate reward.
The Trade-off:
- If you explore too much, you waste plays on bad machines
- If you exploit too much, you might miss a better machine
- You need to balance both
Simple Strategies:
Epsilon-Greedy: Most of the time (1-ε), play the best machine you know. Sometimes (ε), randomly try a different one. This ensures you keep exploring while mostly playing the best option.
UCB (Upper Confidence Bound): Play machines that either have high average rewards OR haven't been tried much. This automatically balances exploration and exploitation.
Why it matters: Multi-armed bandits are everywhere: A/B testing, recommendation systems, clinical trials, online advertising. They teach you the fundamental trade-off between exploration and exploitation that appears in all RL.
Q3: Explain Q-Learning. How does it work?
Answer:
Q-Learning is a way to learn the best action to take in each situation. The "Q" stands for "Quality" - it learns the quality (value) of taking each action in each state.
The Q-Value:
A Q-value, written as Q(s, a), answers: "If I'm in state s and take action a, then follow the best policy from there, how much total reward will I get?" It's like a score that tells you how good an action is.
How Q-Learning Works:
- Start with random Q-values (you don't know anything yet)
- Try an action, see what reward you get and what state you end up in
- Update your Q-value: "If I was in state s and took action a, I got reward r and ended up in state s'. The best I can do from s' is max Q(s', a'). So the total value should be r + max Q(s', a')."
- Repeat many times
The Update Rule:
Q(s, a) ← Q(s, a) + α [r + γ * max Q(s', a') - Q(s, a)]
Breaking it down:
- Q(s, a): Current estimate of how good action a is in state s
- r: Reward you got
- γ (gamma): Discount factor - how much you care about future rewards
- max Q(s', a'): Best you can do from the new state
- r + γ * max Q(s', a'): What the Q-value should be (target)
- α (alpha): Learning rate - how much to update
Why It Works:
Q-Learning is "off-policy" - it learns the best policy even while following a different policy (like an exploratory one). You can explore randomly but still learn the optimal policy. Over time, Q-values converge to true values, and you just pick the action with the highest Q-value in each state.
Q4: What is Monte Carlo in reinforcement learning? How does it differ from Q-Learning?
Answer:
Monte Carlo methods are named after the famous casino because they use randomness to solve problems. In RL, Monte Carlo methods learn by playing out complete episodes and then looking back at what happened.
How Monte Carlo Works:
Instead of trying to predict what will happen (which is hard), Monte Carlo methods just try things and see what actually happens. You play a complete game (episode), collect all the rewards you got, and then use that experience to learn.
Example: Learning to play chess. A Monte Carlo approach: play a complete game, see if you won or lost, then go back through the game and think "those moves when I won were probably good, those moves when I lost were probably bad."
Monte Carlo Policy Evaluation:
To figure out how good a policy is:
- Follow the policy and play many episodes
- For each state you visited, collect the total reward from that point onward
- Average these rewards - that's your estimate of how good that state is
Key Property: Model-free - you don't need to know how the environment works, just be able to play it and see results.
Difference from Q-Learning:
Monte Carlo:
- Learns from complete episodes (waits until episode ends)
- Uses actual returns (total reward from state to end)
- Can have high variance (depends on episode outcomes)
- Simple but can be slow (need to wait for episodes)
Q-Learning:
- Learns from single steps (updates immediately)
- Uses bootstrapping (estimates future rewards)
- Lower variance (updates more frequently)
- Faster learning (doesn't wait for episodes)
When to Use:
- Monte Carlo: When you have episodic tasks and want simple, model-free learning
- Q-Learning: When you want faster learning and can use bootstrapping
Q5: What is the difference between value-based and policy-based RL methods?
Answer:
Value-Based Methods (like Q-Learning):
Value-based methods learn "how good" each action is in each state (Q-values), then derive the policy from these values. The policy is: "In each state, take the action with the highest Q-value."
Advantages:
- Learn optimal Q-values
- Can be off-policy (learn while exploring)
- Often more sample-efficient
Disadvantages:
- Need to derive policy from values
- Hard to use with continuous actions
- Can only learn deterministic policies (unless combined with other techniques)
Policy-Based Methods (like Policy Gradients):
Policy-based methods learn the policy directly - a function that tells you what action to take in each state. Instead of learning values and deriving actions, you learn actions directly.
Advantages:
- Learn policy directly (what you actually want)
- Can learn stochastic (random) policies
- Work with continuous actions
- More flexible
Disadvantages:
- Often less sample-efficient
- Can have high variance
- Harder to learn optimal policy
Actor-Critic Methods:
These combine both: an "actor" (policy) that learns how to act, and a "critic" (value function) that evaluates how good actions are. This combines benefits of both approaches.
Examples:
- Value-based: Q-Learning, DQN
- Policy-based: REINFORCE, Policy Gradients
- Actor-Critic: A3C, PPO (used in RLHF)
Q6: Explain Value Iteration and Policy Iteration. When would you use each?
Answer:
Value Iteration:
Value Iteration finds the optimal value function (how good each state is), then derives the optimal policy from values.
How it works:
- Start with random values for each state
- For each state, update its value: "My value = best action I can take = max over actions of (immediate_reward + value_of_next_state)"
- Repeat until values stop changing
- Then: optimal policy = in each state, take action that leads to best value
Policy Iteration:
Policy Iteration directly finds the optimal policy:
- Start with a random policy
- Evaluate it: figure out how good each state is under this policy
- Improve it: in each state, switch to the action that's best according to current values
- Repeat until policy stops changing
Key Difference:
- Value Iteration: Updates values until convergence, then derives policy once
- Policy Iteration: Alternates between evaluating policy and improving it
When to Use:
Value Iteration:
- Faster convergence in many cases
- Simpler to implement
- Good when you just need values
Policy Iteration:
- Often converges in fewer iterations
- More intuitive (directly improves policy)
- Good when you care about the policy
Both require: Knowing the environment model (transition probabilities, rewards). If you don't know the model, use model-free methods like Q-Learning or Monte Carlo.
Q7: What is Temporal Difference (TD) Learning?
Answer:
Temporal Difference learning is like learning from your mistakes in real-time, rather than waiting until the end. It's a middle ground between Monte Carlo (wait until episode ends) and dynamic programming (need the model).
The Key Idea:
Instead of waiting to see the complete outcome, TD learning makes a prediction, then immediately updates when it gets new information. It's like guessing the answer to a question, then immediately correcting yourself when you get feedback, rather than waiting until the end of the test.
TD(0) - The Simplest TD:
This updates the value estimate based on the immediate reward and the next state's value:
V(s) ← V(s) + α [r + γ V(s') - V(s)]
The term in brackets is the "TD error" - how wrong your prediction was. If it's positive, the state was better than you thought, so increase its value. If negative, decrease it.
Why TD Learning Matters:
TD learning combines the best of both worlds:
- Model-free like Monte Carlo (doesn't need environment model)
- Learns faster because it doesn't wait for episodes to end
- Lower variance than Monte Carlo (updates more frequently)
Q-Learning is actually a form of TD learning - it uses TD updates to learn Q-values. Understanding TD learning helps you understand many modern RL algorithms.
Summary
These fundamental RL concepts form the foundation for understanding more advanced methods:
- MDP: Framework for describing RL problems
- Multi-Armed Bandit: Simplest RL problem, teaches exploration vs exploitation
- Q-Learning: Value-based method, learns action values
- Monte Carlo: Model-free method, learns from complete episodes
- Policy Gradients: Policy-based method, learns policy directly
- Value/Policy Iteration: Planning methods when you know the model
- TD Learning: Learns from immediate feedback, combines ideas from Monte Carlo and dynamic programming
All of these are tools in the RL toolbox. Modern methods like PPO (used in RLHF) combine ideas from many of these fundamental concepts.