Reinforcement Learning Fundamentals: Easy Explanations

Part 1: Markov Decision Process (MDP)

What is an MDP? (Simple Explanation)

Imagine you're playing a video game. At each moment, you're in a certain situation (state), you can take some action (like move left, jump, shoot), and based on your action, you get a reward (points, health, etc.) and the game moves to a new situation. A Markov Decision Process is just a fancy way of describing this setup mathematically.

The Key Components:

States (S): These are all the possible situations you can be in. In a game, this might be your position, health, what items you have, etc. In a robot, it might be its location and battery level. Think of states as "where you are" in the problem.

Actions (A): These are all the things you can do. In a game: move, jump, shoot. In a robot: turn left, move forward, stop. Actions are "what you can do" from each state.

Rewards (R): When you take an action, you get a reward. This tells you if the action was good (positive reward) or bad (negative reward). In a game, you might get +10 for collecting a coin, -5 for taking damage. Rewards are "how good" your action was.

Transitions (P): When you take an action, the world changes. The transition probability tells you: "If I'm in state s and take action a, what's the probability I'll end up in state s'?" In a deterministic world, this is always 1 for one state and 0 for others. In a stochastic (random) world, you might have probabilities.

The Markov Property: This is a key assumption. It says: "The future only depends on the current state, not the past." In other words, if you know where you are now, you don't need to remember how you got there to decide what to do next. This makes the problem much simpler.

Why MDPs Matter

MDPs are the foundation of reinforcement learning. Almost every RL problem can be described as an MDP. When we say "solve an MDP," we mean: find the best way to act (policy) that maximizes the total reward over time. This is what RL algorithms do - they learn to solve MDPs from experience.


Part 2: Multi-Armed Bandit (Simplest RL Problem)

What is a Multi-Armed Bandit? (Simple Explanation)

Imagine you're in a casino with several slot machines (called "arms"). Each machine gives you money when you play it, but you don't know which machine gives the most money. You have a limited number of plays. What should you do?

This is the multi-armed bandit problem - the simplest reinforcement learning problem. You have multiple actions (playing different machines), and you want to figure out which one is best, but you can only try them and see what happens.

The Exploration vs Exploitation Trade-off:

Exploration: Trying different machines to learn which ones are good. You might try a machine you haven't tried before, or one you're not sure about.

Exploitation: Playing the machine you think is best based on what you've learned so far.

The challenge is: if you explore too much, you waste plays on bad machines. If you exploit too much, you might miss a better machine. You need to balance both.

Simple Strategies:

Epsilon-Greedy: Most of the time (1-ε), play the best machine you know. Sometimes (ε), randomly try a different one. This ensures you keep exploring while mostly playing the best option.

Upper Confidence Bound (UCB): Play machines that either have high average rewards OR haven't been tried much. This automatically balances exploration and exploitation.

Thompson Sampling: Use probability - play machines that have a high probability of being the best, based on what you've seen so far.

Why Multi-Armed Bandits Matter

Multi-armed bandits are everywhere in real life: A/B testing (which website design is better?), recommendation systems (which product to show?), clinical trials (which treatment to test?), online advertising (which ad to show?). They're the simplest RL problem, but they teach you the fundamental trade-off between exploration and exploitation that appears in all RL.


Part 3: Monte Carlo Methods

What is Monte Carlo? (Simple Explanation)

Monte Carlo methods are named after the famous casino in Monaco, because they use randomness (like rolling dice) to solve problems. In reinforcement learning, Monte Carlo methods learn by playing out complete episodes and then looking back at what happened.

How It Works:

Instead of trying to predict what will happen (which is hard), Monte Carlo methods just try things and see what actually happens. You play a complete game (episode), collect all the rewards you got, and then use that experience to learn.

Example: Imagine learning to play chess. A Monte Carlo approach would be: play a complete game, see if you won or lost, and then go back through the game and think "those moves I made when I won were probably good, those moves when I lost were probably bad."

Key Idea: You don't need to know the rules of the game perfectly. You just need to be able to play it and see the final result. Then you can learn from that experience.

Monte Carlo Policy Evaluation:

This is about figuring out how good a policy (way of acting) is. You do this by:

  1. Following the policy and playing many episodes
  2. For each state you visited, collect the total reward you got from that point onward
  3. Average these rewards - that's your estimate of how good that state is under this policy

Why It's Called "Monte Carlo":

The name comes from using randomness (like the randomness in a casino) to solve problems. In RL, the randomness comes from the environment (random outcomes) and sometimes from the policy (random actions for exploration).

Why Monte Carlo Matters

Monte Carlo methods are model-free - you don't need to know how the environment works, you just need to be able to interact with it. This makes them very practical. They're also simple to understand and implement. Many RL algorithms use Monte Carlo ideas, and understanding Monte Carlo helps you understand more complex methods.


Part 4: Q-Learning

What is Q-Learning? (Simple Explanation)

Q-Learning is a way to learn the best action to take in each situation. The "Q" stands for "Quality" - it learns the quality (value) of taking each action in each state.

The Q-Value:

A Q-value, written as Q(s, a), answers the question: "If I'm in state s and take action a, then follow the best policy from there, how much total reward will I get?" It's like a score that tells you how good an action is.

How Q-Learning Works:

Q-Learning learns by trying actions and seeing what happens. Here's the simple idea:

  1. Start with random Q-values (you don't know anything yet)
  2. Try an action, see what reward you get and what state you end up in
  3. Update your Q-value: "If I was in state s and took action a, I got reward r and ended up in state s'. The best I can do from s' is max Q(s', a'). So the total value should be r + max Q(s', a')."
  4. Repeat many times

The Update Rule:

Q(s, a) ← Q(s, a) + α [r + γ * max Q(s', a') - Q(s, a)]

Let's break this down:

  • Q(s, a): Current estimate of how good action a is in state s
  • r: Reward you got
  • γ (gamma): Discount factor - how much you care about future rewards (0 = only care about now, 1 = care equally about future)
  • max Q(s', a'): Best you can do from the new state
  • r + γ * max Q(s', a'): What the Q-value should be (target)
  • α (alpha): Learning rate - how much to update (0 = don't change, 1 = completely replace)

Why It Works:

Q-Learning is "off-policy" - it learns the best policy even while following a different policy (like an exploratory one). This is powerful because you can explore randomly but still learn the optimal policy. Over time, as you try different actions, the Q-values converge to the true values, and then you just pick the action with the highest Q-value in each state.

Why Q-Learning Matters

Q-Learning is one of the most important RL algorithms. It's simple, works well, and is the foundation for Deep Q-Networks (DQN) which use neural networks to approximate Q-values. Understanding Q-Learning helps you understand value-based RL methods and how they differ from policy-based methods like policy gradients.


Part 5: Policy Gradient Methods

What are Policy Gradients? (Simple Explanation)

While Q-Learning learns "which action is best" (value-based), policy gradients learn "how to act" directly (policy-based). Instead of learning Q-values and then picking the best action, policy gradients directly learn a policy - a function that tells you what action to take in each state.

The Policy:

A policy π(a|s) is a probability distribution over actions given a state. It answers: "In state s, what's the probability I should take action a?" A good policy gives high probability to good actions and low probability to bad actions.

How Policy Gradients Work:

The idea is simple: try actions, see which ones led to good outcomes, and increase their probability. Try actions that led to bad outcomes, and decrease their probability.

The Process:

  1. Follow your current policy and collect experience (states, actions, rewards)
  2. For each action you took, look at the total reward you got from that point onward
  3. If the reward was good, increase the probability of that action in that state
  4. If the reward was bad, decrease the probability

The Math (Simplified):

The policy gradient tells you: "To increase the total reward, adjust the policy parameters in this direction." It's like saying "if taking action a in state s led to high reward, make π(a|s) bigger."

Why Policy Gradients Matter:

Policy gradients are flexible - they can learn stochastic (random) policies, work with continuous actions, and are the foundation for methods like PPO used in RLHF. They're also more direct - you're learning exactly what you want (how to act), not learning values and then deriving actions.


Part 6: Value Iteration and Policy Iteration

What are These? (Simple Explanation)

These are methods for solving MDPs when you know everything about the environment (the model). They're like having a map and planning the best route, rather than exploring and learning.

Value Iteration:

This finds the optimal value function (how good each state is). The idea:

  1. Start with random values for each state
  2. For each state, update its value: "My value is the best I can do - take the action that gives me the best immediate reward plus the value of the next state"
  3. Repeat until values stop changing

Once you have optimal values, you can derive the optimal policy: in each state, take the action that leads to the best value.

Policy Iteration:

This directly finds the optimal policy:

  1. Start with a random policy
  2. Evaluate it: figure out how good each state is under this policy
  3. Improve it: in each state, switch to the action that's best according to the current values
  4. Repeat until the policy stops changing

When to Use:

These methods work when you know the environment model (transition probabilities, rewards). In many real problems, you don't know the model, so you use model-free methods like Q-Learning or policy gradients instead.


Part 7: Temporal Difference Learning

What is TD Learning? (Simple Explanation)

Temporal Difference (TD) learning is like learning from your mistakes in real-time, rather than waiting until the end. It's a middle ground between Monte Carlo (wait until episode ends) and dynamic programming (need the model).

The Key Idea:

Instead of waiting to see the complete outcome, TD learning makes a prediction, then immediately updates when it gets new information. It's like guessing the answer to a question, then immediately correcting yourself when you get feedback, rather than waiting until the end of the test.

TD(0) - The Simplest TD:

This updates the value estimate based on the immediate reward and the next state's value:

V(s) ← V(s) + α [r + γ V(s') - V(s)]

The term in brackets is the "TD error" - how wrong your prediction was. If it's positive, the state was better than you thought, so increase its value. If negative, decrease it.

Why TD Learning Matters:

TD learning combines the best of both worlds: it's model-free like Monte Carlo (doesn't need the environment model), but it learns faster because it doesn't wait for episodes to end. Q-Learning is actually a form of TD learning. TD learning is fundamental to understanding many modern RL algorithms.


Summary: How These Concepts Fit Together

MDP is the framework - it describes the problem (states, actions, rewards, transitions).

Multi-Armed Bandit is the simplest RL problem - teaches you exploration vs exploitation.

Monte Carlo is a way to learn - play complete episodes and learn from the outcomes.

Q-Learning is a value-based method - learns which actions are best in each state.

Policy Gradients are a policy-based method - learns how to act directly.

Value/Policy Iteration solve MDPs when you know the model.

TD Learning learns from immediate feedback, combining ideas from Monte Carlo and dynamic programming.

All of these are tools in the RL toolbox. Different problems need different tools, and modern methods like PPO combine ideas from many of these fundamental concepts.