 The landscape of algorithms in modern RL.

# A taxonomy of RL algorithms (OpenAI SpinningUp) # Types of RL algorithms (UCB CS294-112)

• Policy gradient: directly differentiate the above objective
• Value-based: estimate value function or Q-funtion of the optimal policy (no explicit policy)
• Actor-critic: estimate value function or Q-function of the curent policy, use it to improve policy
• Model-based RL: estimate the transition model, and then
• Use it for planning (no explicit policy)
• Use it to improve a policy
• Something else

## Model-free v.s model-based RL

• Branching point: whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transition and rewards.

Advantages of having a model:

• Allows the agent to plan by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. E.g. AlphaZero .

Disadvantages of having a model:

• A ground-truth model of the environment is usually not available to the agent. Agents have to learn he model purely from the experience, and the model-learning is fundamentally hard.

# What to learn

Branching point: what to learn, include

• Policies, (stochastic or deterministic)
• Action-value functions (Q-functions)
• Value functions
• and/or environment models

## What to learn in model-free RL

### Policy optimization

The policy is denoted as $\pi_{\theta}(a \vert s)$. Optimize the parameters $\theta$ either

1. directly by gradient ascent on the performance objective $J(\pi_{\theta})$
or
2. indirectly by maximizing local approximation of $J(\pi_\theta)$
• on-policy: each update uses data collected from the most recent version of the policy. Policy optimization also usually involves learning an approximator $V_{\phi}(s)$ for the on-policy value function $V^{\pi}(s)$.

Example algorithms:

• A2C / A3C: performs gradient ascent to directly maximize the performance 
• PPO: updates indirectly maximize performance, by instead maximizing a surrogate objective function, which gives a conservative estimate for how much $J(\pi_{\theta})$ will change as a result of the update. 

### Q-learning

Methods learn an appoximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$.

• Typically the objective function is based on the Bellman equation
• Off-policy: each update can use data collected at any point during training. regardless of how the agent was choosing to explore the environment when the data is obtained.

The actions taken by the Q-learning:

Example algorithms:

• DQN 
• C51: a variant that learns a distribution over return whose expectation is $Q^*$ 

Policy optimization:

• Strength: directly optimize the thing you want(i.e. policy); stable and reliable

Q-learning

• Indirectly optimize for agent performance by training $Q_{\theta}$ to satisfy a self-consistency equation
• less stable
• more sample efficient

• DDPG: concurrently learns a deterministic policy and a Q-function by using each to improve the other 
• SAC: a variant using stochastic policies, entropy regularization, and other tricks to stablize learning 

• ExIt .
• AlphaZero

# RL problems

### Sample efficiency

• Sample efficiency = how many samples do we need to get a good policy? • Most important question: is the algorithm off policy?
• Off policy: able to improve the policy without generating new samples from that policy
• On policy: each time the policy is changed, even a little bit, we need to generate new samples

### Stability & ease of use

• Does it converge?
• And if it converges, to what?
• And does it converge every time?
• Supervised learning: almost always gradient descent
• Reinforcement learning: often not gradient descent
• Q-learning: fixed point iteration
• Model-based RL: model is not optimized for expected reward
• Policy gradient: is gradient descent, but also often the least efficient!
1. Value-function fitting
• At best, minimizes error of fit "Bellman error"
• Not the same as expected reward
• At worst, doesnot optimize anything
• Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the non-linear case
2. Model-based RL
• Model minimizes error of fit
• This will converge
• No gurantee that better model = better policy
• The only one that actually performs gradient descent(ascent) on the true objective

## b. Different assumptions

• Stochastic or deterministic?
• Continuous or discrete?
• Episodic or infinite horizon?
1. Full observability
• Generally assumed by value function fitting methods
• Can be mitigated by adding recurrence
2. Episodic learning
• Often assumed by pure policy gradient methods
• Assumed by some model-based RL methods
3. Continuous or smoothness
• Assumed by some continuous value function learning methods
• Often assumed by some model-based RL methods

## c. Different things are easy or hard in different settings?

• Easier to represent the policy?
• Easier to represent the model?

1. 1.AlphaZero : Silver et al, 2017
2. 2.A2C / A3C (Asynchronous Advantage Actor-Critic): Mnih et al, 2016
3. 3.PPO (Proximal Policy Optimization): Schulman et al, 2017
4. 4.TRPO (Trust Region Policy Optimization): Schulman et al, 2015
5. 5.DDPG (Deep Deterministic Policy Gradient): Lillicrap et al, 2015
6. 6.TD3 (Twin Delayed DDPG): Fujimoto et al, 2018
7. 7.SAC (Soft Actor-Critic): Haarnoja et al, 2018
8. 8.DQN (Deep Q-Networks): Mnih et al, 2013
9. 9.C51 (Categorical 51-Atom DQN): Bellemare et al, 2017
10. 10.QR-DQN (Quantile Regression DQN): Dabney et al, 2017
11. 11.HER (Hindsight Experience Replay): Andrychowicz et al, 2017
12. 12.World Models: Ha and Schmidhuber, 2018
13. 13.I2A (Imagination-Augmented Agents): Weber et al, 2017
14. 14.MBMF (Model-Based RL with Model-Free Fine-Tuning): Nagabandi et al, 2017
15. 15.MBVE (Model-Based Value Expansion): Feinberg et al, 2018
16. 16.Anthony, T., Tian, Z., & Barber, D. (2017). Thinking Fast and Slow with Deep Learning and Tree Search . NIPS.
17. 17.