Notes of lectures by D. Silver.A brief introduction of MDPs.

# Markov Decision Processes

## Intro to MDPs

MDP formally describe an environment for RL, where the environment is **fully observable**. i.e. the current state completely characterizes the process.

Almost all RL problems can be formalized as MDPs, e.g.

- Optimal control primarily deals with continuous MDPs
- Partially observable problems can be converted into MDPs
- Bandits are MDPs with one state

## Markov Process

### Markov property

“The future is independent of the past given the present”

**Definition**: a state is Markov iff

- The state captures all relevant information from the history.
- Once the state is known, the history maybe thrown away, i.e. the state is sufficient statistic of the future.

#### State Transition Matrix

For a Markov state $s$ and successor state $s’$, the state transition probability is :

Sate transition matrix $\mathcal{P}$ defines transition probabilities from all sates $s$ to all successor states $s’$

where each row of the matrix sums to 1.

### Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random state $S_1$, $S_2$, … with the Markov property.

Definition: a Markov Process (or Markov Chain) is a tuple $<\mathcal{S}, \mathcal{P}>$, where $\mathcal{S}$ is a (finite) set of states, $\mathcal{P}$ is a state transition probability matrix,

## Markov Reward Process

A Markov reward process is a Markov chain with values.

**Definition**: A MRP is a tuple $\mathcal{S}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}$

- $\mathcal{S}$ is a finite set of states
- $\mathcal{P}$ is a state transition probability matrix,
- $\mathcal{R}$ is a
**reward function**, - $\mathcal{\gamma}$ is a
**discount factor**,

### Return

The return is the total discounted reward from time-step $t$.

- The discount $\gamma \in [0,1]$ is the present value of future rewards.
- The value of receiving reward $R$ after $k+1$ time stems is $\gamma^k R$.
- This values immediate reward above delayed reward
- $\gamma$ close to 0 leads to “myopic” evaluation
- $\gamma$ close to 1 leads to “far-sighted” evaluation

**Why discount ?**

Most MRP, MDP are discounted. Why?

- Mathematically convenient to discount rewards.
- Avoids infinite returns in cyclic Markov processes.
- Uncertainty about the future may not be fully represented.
- If the reward is financial, immediate rewards may earn more interest than delayed rewards
- Animal/human behavior shows preference for immediate reward
- It is sometimes possible to use undiscounted MRP (i.e. $\gamma = 1$), e.g. if all sequences terminate.

### Value Function

The value function $v(s)$ gives the long-term value of state $s$**Definition**: the state value function $v(s)$ of an MRP is the expected return starting from state $s$:

### Bellman Equation for MRPs

The value function can be decomposed into two parts:

- immediate reward
- discounted value of successor state

Bellman equation in **matrix form**

where $v$ is a column vector with one entry per state

Solving the Bellman equation

The Bellman equation is a

**linear equation**Computational complexity $\rightarrow O(n^3)$ for $n$ states

- Direct solution is only possible for small MRPs
- For large MRPs, iterative methods:
- Dynamic programming
- Monte-Carlo evaluation
- Temporal-Difference learning

## Markov Decision Process

Markov Decision Process(MDP): a Markov reward process **with decisions**.

An MDP is a tuple $<\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$

- $\mathcal{S}$ is a finite set of states
- $\mathcal{A}$ is a finite set of actions
- $\mathcal{P}$ is a state transition probability matrix
- $\mathcal{R}$ is a reward function,
- $\gamma$ is a discount factor $\gamma \in [0,1]$

### Policies

A policy $\pi$ is a distribution over actions given states

- A policy fully defines the behavior of an agent.
- MDP policies depends on the current state (not the history), i.e. Policies are stationary (time-independent)

Given an MDP $\mathcal{M} = <\mathcal{S}, \mathcal{A}, \mathcal{P}, \mathcal{R}, \mathcal{\gamma}>$ and a policy $\pi$

- The state sequence $S_1, S_2,…$ is a Markov process $<\mathcal{S}, \mathcal{P}^{\pi}>$
- The state and reward sequence is a markov reward process $<\mathcal{S}, \mathcal{P}^{\pi}, \mathcal{R}^{\pi}, \gamma>$

where

### Value Function

The **state-value function** of an MDP is the expected return starting from state $s$, and then following policy $\pi$

The **action-value function** is the expected return starting from state $s$, taking action $a$, and then following policy $\pi$

### Bellman Expectation Equation

The **state-value function** can again be decomposed into immediate reward plus discounted value of successor state:

The **action-value function** can similarly be decomposed:

#### Bellman Expectation equation for $V^{\pi}$

(Eq. 1)

(Eq. 2)

#### Bellman Expectation equation for $Q^{\pi}$

(Eq. 1)

(Eq. 2)

#### Matrix form

The Bellman expectation equation can be expressed concisely using the induced MRP:

with direct solution

### Optimal Value Function

The **optimal state-value function** is the maximum value function over all policies:

The **optimal action-value function** is the maximum action-value function over all policies

An MDP is solved when we know the optimal value fn.

#### Optimal Policy

Define a partial ordering over policies

Finding an optimal policy, by maximizing over

#### Solving the Bellman Optimality Equation

- Bellman Optimal Equation is non-linear
- No closed form solution (in general)
- Many iterative solution methods
- Value iteration
- Policy iteration
- Q-learning
- Sarsa

## Extensions to MDPs

### Infinite MDPs

- Countably infinite state and/or action spaces: straightforward
- Continuous state and/or action spaces
- Closed form for linear quadratic model (LQR)

- Continuous time
- Requires partial differential equations
- Hamilton-Jacobi-Bellman (HJB) equation
- LImiting case of bellman equation as time-step $\rightarrow 0$

### POMDPs

A POMPD is an MDP **with hidden states**. It is a hidden Markov model with actions.

A POMDP is a tuple $<\mathcal{S}, \mathcal{A},\mathcal{O},\mathcal{P},\mathcal{R},\mathcal{Z},\mathcal{\gamma}>$

- $\mathcal{S}$ is a finite set of states
- $\mathcal{A}$ is a finite set of actions
- $\mathcal{O}$ is a finite set of observations
- $\mathcal{P}$ is a state transition probability matrix
- \mathcal{R} is a reward function,
- $\mathcal{Z}$ is an observation function
- $\gamma$ is a discount factor $\gamma \in [0,1]$