Notes of lectures by D. Silver.

## Introduction

**Dynamic**: sequential or temporal components to the problem**Programming**: optimizing a “program”, i.e. a policy

c.f. linear programmingBreaking down the complex problems into subproblems

- Solve the problems
- Combine solutions to subproblems

### Requirements for dynamic programming

Dynamic programming(DP) is general solution for problems with two properties:

**Optimal substructure**- Principle of optimality applies
- Optimal solution can be decomposed into subproblems

- Overlapping subproblems
- Subproblems recur many times
- Solutions can be cached and reused

MDP satisfy both properties:

- Bellman equation gives recursive decomposition
- Value function stores and reuses solutions

### Planning by DP

- DP assumes
**full knowledge**of the MDP - It is used for planning in an MDP
- For prediction:
- Input: DMP $<\mathcal{S}, \mathcal{A},\mathcal{P},\mathcal{R},\mathcal{\gamma}>$ and policy $\pi$, or MRP $<\mathcal{S},\mathcal{P}^{\pi},\mathcal{R}^{\pi},\mathcal{\gamma}>$
- Output: value function

- For control:
- Input: MDP $<\mathcal{S}, \mathcal{A},\mathcal{P},\mathcal{R},\mathcal{\gamma}>$
- Output: optimal value function , and optimal policy

### Other applications by DP

- Scheduling algorithms
- String algorithms (e.g. sequence alignment)
- Graph algorithms (e.g. shortest path algorithms)
- Graphical models (e.g. Viterbi algorithm)
- Bioinformatics (e.g. lattice model)

## Iterative policy evaluation

- Problem: evaluate a given policy $\pi$
- Solution: iterative application of Bellman expectation backup

Using synchronous backups:

- At each iteration $k+1$
- For all states $s \in \mathcal{S}$
- Update from , where $s’$ is a successor state of $s$

## Policy iteration

### How to improve a policy

Given a policy $\pi$:

- Evaluate the policy $\pi$
- Improve the policy by acting greedily w.r.t

**Policy evaluation**: estimate (iterative policy evaluation)**Policy improvement**: generate (greedy policy improvement)

### Policy improvement

Consider a deterministic policy $a=\pi(s)$

- Improve the policy by acting greedily

This improves the value from any state $s$ over one step

It therefore improves the value function

- If improvements stop,
Then the Bellman optimality equation has been satisfied

Therefore for all $s \in \mathcal{S}$. So $\pi$ is an optimal policy

## Value iteration

### Value iteration in MDPs

#### Principle of Optimality

An optimal policy can be subdivided into two components:

- An optimal first action
- Followed by an optimal policy from successor state $S’$

#### Deterministic value iteration

If we know the solution to subproblems , then solution can be found by one-step lookahead:

- The idea of value iteration is to apply these update iteratively
- Intuition: start with final rewards and work backwards
- Still works loopy, stochastic MDPs

#### Value iteration

- Problems: find optimal policy $\pi$
Solution: iterative application of Bellman optimality backup

Using synchronous backups at each iteration $k+1$, for all states , update from

- Unlike policy iteration , there is
**no explicit policy**

## Synchronous DP algorithms

Problem | Bellman equation | Algorithm |
---|---|---|

Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |

Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |

Control | Bellman Optimality Equation | Value Iteration |

## Asynchronous DP

### In-place DP

Synchronous value iteration stores two copies of value function for all $s$ in $\mathcal{S}$:

**In-place**value iteration only stores one copy of value function for all $s$ in $\mathcal{S}$:

### Prioritized sweeping

Use magnitude of Bellman error to select state

Backup the state with the largest remaining Bellman error

### Real-time DP

- Idea: only states that are relevant to agent
- Use agent’s experience to guide the selection of states
- After each time-step
- Backup the state