Skip to content
Reinforcement Learning
Lesson 2 ⏱ 14 min

Markov Decision Processes

Video coming soon

The Math Behind RL: Markov Decision Processes

Derives the MDP tuple from the gridworld example, introduces the Markov property, and walks through computing expected return with discounting.

⏱ ~9 min

🧮

Quick refresher

Probability and conditional distributions

Probability distributions describe the likelihood of outcomes. Conditional probability P(A|B) gives the probability of A given B. Expectation E[X] is the weighted average over all possible values.

Example

P(rain tomorrow | cloudy today) is a conditional probability — RL's transition function P(s'|s,a) is exactly this kind of conditional distribution.

From Intuition to Math

In the previous lesson we described RL informally: agent, environment, state, action, reward. But to derive algorithms, we need a precise mathematical framework. The is that framework.

The MDP is not just theory — it's the specification language for every real RL problem. AlphaGo, OpenAI Five, and robotic manipulation systems are all MDPs with different state spaces, action spaces, and reward functions. Learning to formalize a problem as an MDP is the first practical skill in applied RL, because it forces you to define exactly what the agent observes, what it can do, and what it's rewarded for.

The MDP Tuple

Before the formulas: think of the MDP tuple as a recipe card that fully specifies any decision problem. S is every situation you can be in, A is everything you can do, P captures how the world responds to your choices, R is your score card, and γ is how much you care about future scores versus immediate ones. Fill in all five and you have formally defined an RL problem — before writing a single line of code.

An MDP is defined by five components:

MDP=(S,;A,;P,;R,;γ)\text{MDP} = (S,; A,; P,; R,; \gamma)
SS
State space: all possible states the environment can be in
AA
Action space: all possible actions the agent can take
PP
Transition function: probability of next state given current state and action
RR
Reward function: expected immediate reward
γ\gamma
Discount factor: how much to value future rewards
  • Symbol: : the set of all states (e.g., all possible chessboard configurations)
  • Symbol: : the set of all actions (e.g., all legal chess moves)
  • Symbol: : probability of transitioning to state s' when in state s taking action a
  • Symbol: : the reward received for this transition
  • Symbol: ∈ [0,1): the discount factor

The Markov Property

The MDP framework rests on a key assumption: the :

P(st+1st,at,st1,at1,)=P(st+1st,at)P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)
sts_t
Current state
ata_t
Current action
st+1s_{t+1}
Next state
hth_t
History of all past states and actions

The current state is a of history. Once you know where you are, you don't need to know how you got there to predict where you'll go.

Return and Discounting

The agent's objective is to maximize total future reward. The from time t is:

Gt=rt+γrt+1+γ2rt+2+=k=0γkrt+kG_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}
GtG_t
Return (discounted sum of future rewards) from time t
rt+kr_{t+k}
Reward received k steps after time t
γ\gamma
Discount factor: weight on reward k steps in future is γ^k

Why discount? Three reasons:

  1. Mathematical: ensures the sum converges (geometric series with |γ| < 1)
  2. Economic: money now is worth more than money later (time preference)
  3. Uncertainty: the further in the future, the less certain the reward actually arrives

With γ = 0, only immediate reward matters. With γ → 1, the agent cares equally about distant rewards (appropriate for long-horizon tasks). Typical values: 0.95–0.99.

Worked example: rewards [2, 0, 3, 1] with γ = 0.9: G₀ = 2 + 0.9×0 + 0.81×3 + 0.729×1 = 2 + 0 + 2.43 + 0.729 = 5.159

Value Functions

A under policy π:

Vπ(s)=Eπ[Gtst=s]V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]
Vπ(s)V^\pi(s)
Expected return starting from state s, following policy π
GtG_t
Return from time t
π\pi
Policy: mapping from states to action probabilities

An (Q-function):

Qπ(s,a)=Eπ[Gtst=s,;at=a]Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid s_t = s,; a_t = a]
Qπ(s,a)Q^\pi(s,a)
Expected return starting from state s, taking action a, then following π

The Q-function gives per-action granularity. The policy can be recovered from Q: π(s) = argmax_a Q(s,a).

Gridworld Worked Example

A 3×2 gridworld with states labeled 1–6:

[ 1 ][ 2 ][ G ]   ← G = goal (reward +1, terminal)
[ 4 ][ 5 ][ 6 ]   ← All other transitions: reward 0

Actions: {up, down, left, right}. Hitting a wall stays in place. γ = 0.9. Random policy: uniform over available actions.

From state 2 (one step from goal), the expected return V^π(2):

  • With prob 1/3: move right → reach G, get reward 1. Return = 1.
  • With prob 1/3: move left → state 1, then more steps needed.
  • With prob 1/3: move down → state 5.

Solving the full system iteratively gives V^π(2) ≈ 0.39 under random policy, and V*(2) ≈ 0.90 under the optimal policy (always move toward the goal when possible).

Policy and the MDP Connection

A picks one action per state: π(s) = a. A assigns probabilities: π(a|s) = P(action=a | state=s).

The relationship between Q and V:

Vπ(s)=aπ(as),Qπ(s,a)V^\pi(s) = \sum_a \pi(a \mid s), Q^\pi(s, a)
π(as)\pi(a|s)
Probability of taking action a in state s

Interactive example

Set rewards on a gridworld, adjust γ, and compute V^π iteratively for any hand-drawn policy

Coming soon

Summary

  • An MDP is (S, A, P, R, γ) — a formal model of sequential decision-making.
  • The Markov property: future depends only on current state, not history.
  • Return G_t is the discounted sum of future rewards; γ controls how far ahead the agent looks.
  • V^π(s) is the expected return from state s under policy π.
  • Q^π(s,a) is the expected return when taking action a in state s, then following π.
  • RL's goal: find the optimal policy π* that maximizes V^π everywhere.

Quiz

1 / 3

What does the Markov property say about a state s_t?