Markov Decision Processes — Reinforcement Learning

From Intuition to Math

In the previous lesson we described RL informally: agent, environment, state, action, reward. But to derive algorithms, we need a precise mathematical framework. The is that framework.

The MDP is not just theory — it's the specification language for every real RL problem. AlphaGo, OpenAI Five, and robotic manipulation systems are all MDPs with different state spaces, action spaces, and reward functions. Learning to formalize a problem as an MDP is the first practical skill in applied RL, because it forces you to define exactly what the agent observes, what it can do, and what it's rewarded for.

The MDP Tuple

Before the formulas: think of the MDP tuple as a recipe card that fully specifies any decision problem. S is every situation you can be in, A is everything you can do, P captures how the world responds to your choices, R is your score card, and γ is how much you care about future scores versus immediate ones. Fill in all five and you have formally defined an RL problem — before writing a single line of code.

An MDP is defined by five components:

\text{MDP} = (S,; A,; P,; R,; \gamma)

$S$: State space: all possible states the environment can be in
$A$: Action space: all possible actions the agent can take
$P$: Transition function: probability of next state given current state and action
$R$: Reward function: expected immediate reward
$\gamma$: Discount factor: how much to value future rewards

Symbol: : the set of all states (e.g., all possible chessboard configurations)
Symbol: : the set of all actions (e.g., all legal chess moves)
Symbol: : probability of transitioning to state s' when in state s taking action a
Symbol: : the reward received for this transition
Symbol: ∈ [0,1): the discount factor

The Markov Property

The MDP framework rests on a key assumption: the :

P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t)

$s_t$: Current state
$a_t$: Current action
$s_{t+1}$: Next state
$h_t$: History of all past states and actions

The current state is a of history. Once you know where you are, you don't need to know how you got there to predict where you'll go.

Return and Discounting

The agent's objective is to maximize total future reward. The from time t is:

G_t = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k r_{t+k}

$G_t$: Return (discounted sum of future rewards) from time t
$r_{t+k}$: Reward received k steps after time t
$\gamma$: Discount factor: weight on reward k steps in future is γ^k

Why discount? Three reasons:

Mathematical: ensures the sum converges (geometric series with |γ| < 1)
Economic: money now is worth more than money later (time preference)
Uncertainty: the further in the future, the less certain the reward actually arrives

With γ = 0, only immediate reward matters. With γ → 1, the agent cares equally about distant rewards (appropriate for long-horizon tasks). Typical values: 0.95–0.99.

Worked example: rewards [2, 0, 3, 1] with γ = 0.9: G₀ = 2 + 0.9×0 + 0.81×3 + 0.729×1 = 2 + 0 + 2.43 + 0.729 = 5.159

Value Functions

A under policy π:

V^\pi(s) = \mathbb{E}_\pi[G_t \mid s_t = s]

$V^\pi(s)$: Expected return starting from state s, following policy π
$G_t$: Return from time t
$\pi$: Policy: mapping from states to action probabilities

An (Q-function):

Q^\pi(s, a) = \mathbb{E}_\pi[G_t \mid s_t = s,; a_t = a]

$Q^\pi(s,a)$: Expected return starting from state s, taking action a, then following π

The Q-function gives per-action granularity. The policy can be recovered from Q: π(s) = argmax_a Q(s,a).

Gridworld Worked Example

A 3×2 gridworld with states labeled 1–6:

[ 1 ][ 2 ][ G ]   ← G = goal (reward +1, terminal)
[ 4 ][ 5 ][ 6 ]   ← All other transitions: reward 0

Actions: {up, down, left, right}. Hitting a wall stays in place. γ = 0.9. Random policy: uniform over available actions.

From state 2 (one step from goal), the expected return V^π(2):

With prob 1/3: move right → reach G, get reward 1. Return = 1.
With prob 1/3: move left → state 1, then more steps needed.
With prob 1/3: move down → state 5.

Solving the full system iteratively gives V^π(2) ≈ 0.39 under random policy, and V*(2) ≈ 0.90 under the optimal policy (always move toward the goal when possible).

Policy and the MDP Connection

A picks one action per state: π(s) = a. A assigns probabilities: π(a|s) = P(action=a | state=s).

The relationship between Q and V:

V^\pi(s) = \sum_a \pi(a \mid s), Q^\pi(s, a)

$\pi(a|s)$: Probability of taking action a in state s

Interactive example

Set rewards on a gridworld, adjust γ, and compute V^π iteratively for any hand-drawn policy

Coming soon

Summary

An MDP is (S, A, P, R, γ) — a formal model of sequential decision-making.
The Markov property: future depends only on current state, not history.
Return G_t is the discounted sum of future rewards; γ controls how far ahead the agent looks.
V^π(s) is the expected return from state s under policy π.
Q^π(s,a) is the expected return when taking action a in state s, then following π.
RL's goal: find the optimal policy π* that maximizes V^π everywhere.