Notes of lectures by D. Silver.

# Monte-Carlo Learning

- MC methods learn directly from episodes of experience
- MC is
**model-free**: no knowledge of MDP transitions / rewards. - MC learns from
**complete episodes**: no bootstrapping - MC uses the simplest possible idea: value = mean return
- Caveat: can only apply MC to episodic MDPs
- All episodes must terminate

## MC policy evaluation

**Goal**： learn from episodes of experience under policy $\pi$- The return is the total discounted reward:
- The value function is the expected return:
- Monte-Carlo policy evaluation uses
**empirical mean return**instead of expected return

### First-visit MC policy evaluation

To evaluate state $s$

- The first time-step $t$ that state $s$ is visited in an episode
- Increment counter $N(s) \leftarrow N(s) + 1$
- Increment counter
- Value is estimated by mean return $V(s) = S(s) / N(s)$
- By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$

### Every-visit MC policy evaluation

To evaluate state $s$

- The every time-step $t$ that state $s$ is visited in an episode
- Increment counter $N(s) \leftarrow N(s) + 1$
- Increment counter
- Value is estimated by mean return $V(s) = S(s) / N(s)$
- By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$

## Incremental Monte-Carlo

### Incremental Mean

The mean of a sequence can be computed incrementally

### Incremental MC updates

- Update $V(s)$ incrementally after episode $S_1, A_1, R_2, \cdots, S_T$
- For each state with return
- In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes

# Temporal-Difference Learning

- TD methods learn directly from episodes of experience
- TD is
**model-free**: no knowledge of MDP transitions / rewards - TD learns from incomplete episodes, by bootstrapping
- TD updates a guess towards a guess

## MC and TD

**goal**: learn online from experience under policy $\pi$Incremental every-visit MC

Update value toward actual return

- Simplest temporal-difference learning algorithm: TD(0)
- Update toward estimated return

# MC v.s TD

## Pros and cons

1.Process

TD can learn before knowing the final outcome

- TD can learn online after every step
- MC must wait until end of episode before return is known

TD can learn without the final outcome

- TD can learn from incomplete sequences
- MC can only learn from complete sequences
- TD works in continuing (non-terminating) environments
- MC only works for episodic (terminating) environments

2.Statistics

- MC has
**high variance**,**zero bias**- Good convergence properties (even with function approximation)
- Not very sensitive to initial value
- Very simple to understand and use

- TD has
**low variance**,**some bias**- Usually more efficient than MC
- TD(0) converges to but not always with function approximation
- More sensitive to initial value

3.Markov property

- TD exploits Markov property
- Usually more efficient in Markov environments

- MC does not exploit Markov property
- Usually more efficient in non-Markov environments

## Bias / Variance trade-off

- Return is unbiased estimate of
- True TD target is
**unbiased**estimate of - TD target is
**biased**estimate of - TD target is much lower variance that the return:
- Return depends on many random actions, transitions, rewards
- TD target depends on one random action, transition, reward