The landscape of algorithms in modern RL.

A taxonomy of RL algorithms (OpenAI SpinningUp)

Types of RL algorithms (UCB CS294-112)

$\theta^{*} = \arg \max_{\theta} E_{\tau \sim p_{\theta}(\tau)} \big[\sum_t r(\pmb{s}_t, \pmb{a}_t)\big]$

Policy gradient: directly differentiate the above objective
Value-based: estimate value function or Q-funtion of the optimal policy (no explicit policy)
Actor-critic: estimate value function or Q-function of the curent policy, use it to improve policy
Model-based RL: estimate the transition model, and then
- Use it for planning (no explicit policy)
- Use it to improve a policy
- Something else

Model-free v.s model-based RL

Branching point: whether the agent has access to (or learns) a model of the environment. By a model of the environment, we mean a function which predicts state transition and rewards.

Advantages of having a model:

Allows the agent to plan by thinking ahead, seeing what would happen for a range of possible choices, and explicitly deciding between its options. Agents can then distill the results from planning ahead into a learned policy. E.g. AlphaZero ^[1].

Disadvantages of having a model:

A ground-truth model of the environment is usually not available to the agent. Agents have to learn he model purely from the experience, and the model-learning is fundamentally hard.

What to learn

Branching point: what to learn, include

Policies, (stochastic or deterministic)
Action-value functions (Q-functions)
Value functions
and/or environment models

What to learn in `model-free` RL

Policy optimization

The policy is denoted as $\pi_{\theta}(a \vert s)$ . Optimize the parameters $\theta$ either

directly by gradient ascent on the performance objective $J(\pi_{\theta})$
or
indirectly by maximizing local approximation of $J(\pi_\theta)$

on-policy: each update uses data collected from the most recent version of the policy. Policy optimization also usually involves learning an approximator $V_{\phi}(s)$ for the on-policy value function $V^{\pi}(s)$.

Example algorithms:

A2C / A3C: performs gradient ascent to directly maximize the performance ^[2]
PPO: updates indirectly maximize performance, by instead maximizing a surrogate objective function, which gives a conservative estimate for how much $J(\pi_{\theta})$ will change as a result of the update. ^[3]

Q-learning

Methods learn an appoximator $Q_{\theta}(s,a)$ for the optimal action-value function, $Q^*(s,a)$.

Typically the objective function is based on the Bellman equation
Off-policy: each update can use data collected at any point during training. regardless of how the agent was choosing to explore the environment when the data is obtained.

The actions taken by the Q-learning:

$a(s) = \arg\max_a Q_{\theta}(s,a)$

Example algorithms:

DQN ^[8]
C51: a variant that learns a distribution over return whose expectation is $Q^*$ ^[9]

Trade-offs between policy gradient and Q-learning

Policy optimization:

Strength: directly optimize the thing you want(i.e. policy); stable and reliable

Q-learning

Indirectly optimize for agent performance by training $Q_{\theta}$ to satisfy a self-consistency equation
less stable
more sample efficient

Trade-of algorithms:

DDPG: concurrently learns a deterministic policy and a Q-function by using each to improve the other ^[5]
SAC: a variant using stochastic policies, entropy regularization, and other tricks to stablize learning ^[7]

What to learn in model-based RL

Pure planning

MBMF ^[14]

Expert iteration

ExIt ^[16].
AlphaZero

Data augmentation for Model-free methods

MBVE ^[15]
World Models ^[12]

Embedding planning loops into policies

I2A ^[13]

RL problems

a. Different tradeoffs

Sample efficiency

Sample efficiency = how many samples do we need to get a good policy?
Most important question: is the algorithm off policy?
- Off policy: able to improve the policy without generating new samples from that policy
- On policy: each time the policy is changed, even a little bit, we need to generate new samples

Stability & ease of use

Does it converge?
And if it converges, to what?
And does it converge every time?

Supervised learning: almost always gradient descent
Reinforcement learning: often not gradient descent
- Q-learning: fixed point iteration
- Model-based RL: model is not optimized for expected reward
- Policy gradient: is gradient descent, but also often the least efficient!

Value-function fitting
- At best, minimizes error of fit "Bellman error"
  - Not the same as expected reward
- At worst, doesnot optimize anything
  - Many popular deep RL value fitting algorithms are not guaranteed to converge to anything in the non-linear case
Model-based RL
- Model minimizes error of fit
  - This will converge
- No gurantee that better model = better policy
Policy gradiuent
- The only one that actually performs gradient descent(ascent) on the true objective

b. Different assumptions

Stochastic or deterministic?
Continuous or discrete?
Episodic or infinite horizon?

Full observability
- Generally assumed by value function fitting methods
- Can be mitigated by adding recurrence
Episodic learning
- Often assumed by pure policy gradient methods
- Assumed by some model-based RL methods
Continuous or smoothness
- Assumed by some continuous value function learning methods
- Often assumed by some model-based RL methods

c. Different things are easy or hard in different settings?

Easier to represent the policy?
Easier to represent the model?

References

1.AlphaZero : Silver et al, 2017 ↩
2.A2C / A3C (Asynchronous Advantage Actor-Critic): Mnih et al, 2016 ↩
3.PPO (Proximal Policy Optimization): Schulman et al, 2017 ↩
4.TRPO (Trust Region Policy Optimization): Schulman et al, 2015 ↩
5.DDPG (Deep Deterministic Policy Gradient): Lillicrap et al, 2015 ↩
6.TD3 (Twin Delayed DDPG): Fujimoto et al, 2018 ↩
7.SAC (Soft Actor-Critic): Haarnoja et al, 2018 ↩
8.DQN (Deep Q-Networks): Mnih et al, 2013 ↩
9.C51 (Categorical 51-Atom DQN): Bellemare et al, 2017 ↩
10.QR-DQN (Quantile Regression DQN): Dabney et al, 2017 ↩
11.HER (Hindsight Experience Replay): Andrychowicz et al, 2017 ↩
12.World Models: Ha and Schmidhuber, 2018 ↩
13.I2A (Imagination-Augmented Agents): Weber et al, 2017 ↩
14.MBMF (Model-Based RL with Model-Free Fine-Tuning): Nagabandi et al, 2017 ↩
15.MBVE (Model-Based Value Expansion): Feinberg et al, 2018 ↩
16.Anthony, T., Tian, Z., & Barber, D. (2017). Thinking Fast and Slow with Deep Learning and Tree Search . NIPS. ↩
17.OpenAI, Spinning Up, Part 2: Kinds of RL algorithms ↩

A taxonomy of RL algorithms (OpenAI SpinningUp)

Types of RL algorithms (UCB CS294-112)

Model-free v.s model-based RL

What to learn

What to learn in model-free RL

Policy optimization

Q-learning

Trade-offs between policy gradient and Q-learning

What to learn in model-based RL

Pure planning

Expert iteration

Data augmentation for Model-free methods

Embedding planning loops into policies

RL problems

a. Different tradeoffs

Sample efficiency

Stability & ease of use

b. Different assumptions

c. Different things are easy or hard in different settings?

References

What to learn in `model-free` RL