# Yekun's Note

Machine learning notes and writeup.

Notes of lectures by D. Silver.

For problems like elevator, robot walking and the game of Go, MDP model is unknown, but experience can be sampled; or MDP model is known, but is too big to use, except by samples. Model-free control could solve this.

• On-policy learning
• “learn on the job
• Learn about policy $\pi$ from experience sampled from $\pmb{\pi}$
• Off-policy learning
• “learn over someone’s shoulder
• Learn about policy $\pi$ from experience sampled from $\pmb{\mu}$

# On-policy

## On-policy Monte-Carlo control

• Greedy policy improvement over $V(s)$ requires model of MDP:
• Greedy policy improvement over $Q(s,a)$ is model-free:

### $\epsilon$-greedy exploration

• Simplest idea for ensuring continual exploration
• All $m$ actions are tried with non-zero probability
• With probability $1-\epsilon$ choose the greedy action
• With probability $\epsilon$ choose an action at random

## On-policy Temporal-Difference learning

### MC vs. TD control

• TD learning has several advantages over MC:
• Lower variance
• Online
• Incomplete sequences
• Natural idea: use TD instead of MC in out control loop
• Apply TD to $Q(S,A)$
• Use $\epsilon$-greedy policy improvement
• Update every time-step

### Sarsa($\lambda$)

#### SARSA:

Every time-step:

• Policy evaluation Sarsa: $Q \approx q_{\pi}$
• Policy improvement $\epsilon$-greedy policy improvement

#### $n$-step Sarsa

• Consider the following $n$-step returns for $n=1,2,\infty$
• Define the $n$-step Q-return

• $n$-step Sarsa updates $Q(s,a)$ towards the n-step Q-return

#### Forward-view Sarsa($\lambda$)

• The $q^{\pi}$ return combines all $n$-step Q-returns $q_t^{(n)}$
• Using weight $(1-\lambda) \lambda^{(n-1)}$
• Forward-view Sarsa($\lambda$)

#### Back-view Sarsa($\lambda$)

• Like TD($\lambda$), we use eligibility traces
• Sarsa($\lambda$) has one eligibility trace for each state-action pair
• $Q(s,a)$ is updated for every state $s$ and action $a$
• In proportion to TD-error $\delta_t$ and eligibility trace $R_t(s,a)$