Fork me on GitHub

Markov Decision Process

Notes of lectures by D. Silver.A brief introduction of MDPs.

Markov Decision Processes

Intro to MDPs

MDP formally describe an environment for RL, where the environment is fully observable. i.e. the current state completely characterizes the process.

Almost all RL problems can be formalized as MDPs, e.g.

  • Optimal control primarily deals with continuous MDPs
  • Partially observable problems can be converted into MDPs
  • Bandits are MDPs with one state

Markov Process

Markov property

“The future is independent of the past given the present”

Definition: a state is Markov iff

  • The state captures all relevant information from the history.
  • Once the state is known, the history maybe thrown away, i.e. the state is sufficient statistic of the future.

State Transition Matrix

For a Markov state $s$ and successor state $s’$, the state transition probability is :

Sate transition matrix $\mathcal{P}$ defines transition probabilities from all sates $s$ to all successor states $s’$

where each row of the matrix sums to 1.

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random state $S_1$, $S_2$, … with the Markov property.

Definition: a Markov Process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$, where $\mathcal{S}$ is a (finite) set of states, $\mathcal{P}$ is a state transition probability matrix,

Markov Reward Process

A Markov reward process is a Markov chain with values.

Definition: A MRP is a tuple $\mathcal{S}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}$

  • $\mathcal{S}$ is a finite set of states
  • $\mathcal{P}$ is a state transition probability matrix,
  • $\mathcal{R}$ is a reward function,
  • $\mathcal{\gamma}$ is a discount factor,


The return is the total discounted reward from time-step $t$.

  • The discount $\gamma \in [0,1]$ is the present value of future rewards.
  • The value of receiving reward $R$ after $k+1$ time stems is $\gamma^k R$.
  • This values immediate reward above delayed reward
    • $\gamma$ close to 0 leads to “myopic” evaluation
    • $\gamma$ close to 1 leads to “far-sighted” evaluation

Why discount ?

Most MRP, MDP are discounted. Why?

  • Mathematically convenient to discount rewards.
  • Avoids infinite returns in cyclic Markov processes.
  • Uncertainty about the future may not be fully represented.
  • If the reward is financial, immediate rewards may earn more interest than delayed rewards
  • Animal/human behavior shows preference for immediate reward
  • It is sometimes possible to use undiscounted MRP (i.e. $\gamma = 1$), e.g. if all sequences terminate.

Value Function

The value function $v(s)$ gives the long-term value of state $s$
Definition: the state value function $v(s)$ of an MRP is the expected return starting from state $s$:

Bellman Equation for MRPs

The value function can be decomposed into two parts:

  • immediate reward
  • discounted value of successor state

upload successful

Bellman equation in matrix form

where $v$ is a column vector with one entry per state

Solving the Bellman equation

  • The Bellman equation is a linear equation

  • Computational complexity $\rightarrow O(n^3)$ for $n$ states

  • Direct solution is only possible for small MRPs
  • For large MRPs, iterative methods:
    • Dynamic programming
    • Monte-Carlo evaluation
    • Temporal-Difference learning

Markov Decision Process

Markov Decision Process(MDP): a Markov reward process with decisions.

An MDP is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$

  • $\mathcal{S}$ is a finite set of states
  • $\mathcal{A}$ is a finite set of actions
  • $\mathcal{P}$ is a state transition probability matrix
  • $\mathcal{R}$ is a reward function,
  • $\gamma$ is a discount factor $\gamma \in [0,1]$


A policy $\pi$ is a distribution over actions given states

  • A policy fully defines the behavior of an agent.
  • MDP policies depends on the current state (not the history), i.e. Policies are stationary (time-independent)

Given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$ and a policy $\pi$

  • The state sequence $S_1, S_2,…$ is a Markov process $<\mathcal{S}, \mathcal{P}^{\pi}>$
  • The state and reward sequence is a markov reward process $<\mathcal{S}, \mathcal{P}^{\pi}, \mathcal{R}^{\pi}, \gamma>$

Value Function

The state-value function of an MDP is the expected return starting from state $s$, and then following policy $\pi$

The action-value function is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

Bellman Expectation Equation

The state-value function can again be decomposed into immediate reward plus discounted value of successor state:

The action-value function can similarly be decomposed:

Bellman Expectation equation for $V^{\pi}$

(Eq. 1)

upload successful

(Eq. 2)

upload successful

Bellman Expectation equation for $Q^{\pi}$

(Eq. 1)

upload successful

(Eq. 2)

upload successful

Matrix form

The Bellman expectation equation can be expressed concisely using the induced MRP:

with direct solution

Optimal Value Function

The optimal state-value function is the maximum value function over all policies:

The optimal action-value function is the maximum action-value function over all policies

An MDP is solved when we know the optimal value fn.

Optimal Policy

Define a partial ordering over policies

Finding an optimal policy, by maximizing over

Solving the Bellman Optimality Equation

  • Bellman Optimal Equation is non-linear
  • No closed form solution (in general)
  • Many iterative solution methods
    • Value iteration
    • Policy iteration
    • Q-learning
    • Sarsa

Extensions to MDPs

Infinite MDPs

  • Countably infinite state and/or action spaces: straightforward
  • Continuous state and/or action spaces
    • Closed form for linear quadratic model (LQR)
  • Continuous time
    • Requires partial differential equations
    • Hamilton-Jacobi-Bellman (HJB) equation
    • LImiting case of bellman equation as time-step $\rightarrow 0$


A POMPD is an MDP with hidden states. It is a hidden Markov model with actions.

A POMDP is a tuple $<\mathcal{S}, \mathcal{A},\mathcal{O},\mathcal{P},\mathcal{R},\mathcal{Z},\mathcal{\gamma}>$

  • $\mathcal{S}$ is a finite set of states
  • $\mathcal{A}$ is a finite set of actions
  • $\mathcal{O}$ is a finite set of observations
  • $\mathcal{P}$ is a state transition probability matrix
  • \mathcal{R} is a reward function,
  • $\mathcal{Z}$ is an observation function
  • $\gamma$ is a discount factor $\gamma \in [0,1]$

Average Reward MDPs

Thanks for your reward!