Fork me on GitHub

Model-Free Prediction (RL)

upload successful
Notes of lectures by D. Silver.

Monte-Carlo Learning

  • MC methods learn directly from episodes of experience
  • MC is model-free: no knowledge of MDP transitions / rewards.
  • MC learns from complete episodes: no bootstrapping
  • MC uses the simplest possible idea: value = mean return
  • Caveat: can only apply MC to episodic MDPs
    • All episodes must terminate

MC policy evaluation

  • Goal: learn from episodes of experience under policy $\pi$
  • The return is the total discounted reward:
  • The value function is the expected return:
  • Monte-Carlo policy evaluation uses empirical mean return instead of expected return

First-visit MC policy evaluation

To evaluate state $s$

  • The first time-step $t$ that state $s$ is visited in an episode
  • Increment counter $N(s) \leftarrow N(s) + 1$
  • Increment counter
  • Value is estimated by mean return $V(s) = S(s) / N(s)$
  • By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$

Every-visit MC policy evaluation

To evaluate state $s$

  • The every time-step $t$ that state $s$ is visited in an episode
  • Increment counter $N(s) \leftarrow N(s) + 1$
  • Increment counter
  • Value is estimated by mean return $V(s) = S(s) / N(s)$
  • By law of large numbers, $V(s) \rightarrow v_{\pi}(s)$ as $N(s) \rightarrow \infty$

Incremental Monte-Carlo

Incremental Mean

The mean of a sequence can be computed incrementally

Incremental MC updates

  • Update $V(s)$ incrementally after episode $S_1, A_1, R_2, \cdots, S_T$
  • For each state with return
  • In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes

Temporal-Difference Learning

  • TD methods learn directly from episodes of experience
  • TD is model-free: no knowledge of MDP transitions / rewards
  • TD learns from incomplete episodes, by bootstrapping
  • TD updates a guess towards a guess

MC and TD

  • goal: learn online from experience under policy $\pi$
  • Incremental every-visit MC

    • Update value toward actual return

  • Simplest temporal-difference learning algorithm: TD(0)
    • Update toward estimated return

MC v.s TD

Pros and cons

1.Process

  • TD can learn before knowing the final outcome

    • TD can learn online after every step
    • MC must wait until end of episode before return is known
  • TD can learn without the final outcome

    • TD can learn from incomplete sequences
    • MC can only learn from complete sequences
    • TD works in continuing (non-terminating) environments
    • MC only works for episodic (terminating) environments

2.Statistics

  • MC has high variance, zero bias
    • Good convergence properties (even with function approximation)
    • Not very sensitive to initial value
    • Very simple to understand and use
  • TD has low variance, some bias
    • Usually more efficient than MC
    • TD(0) converges to but not always with function approximation
    • More sensitive to initial value

3.Markov property

  • TD exploits Markov property
    • Usually more efficient in Markov environments
  • MC does not exploit Markov property
    • Usually more efficient in non-Markov environments

Bias / Variance trade-off

  • Return is unbiased estimate of
  • True TD target is unbiased estimate of
  • TD target is biased estimate of
  • TD target is much lower variance that the return:
    • Return depends on many random actions, transitions, rewards
    • TD target depends on one random action, transition, reward
Thanks for your reward!