An introduction to key concepts and terminology in reinforcement learning.

Environment and agent

The main components of RL are environment and agent.

The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the word, then decide on an action to take. The environment changes when the agent acts on it, but my also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return.

State and observations

State:

A state $s$ is a complete description of the state of the world. There is no information which is hidden from the state.

Observation:

An observation $o$ is a partial description of a state, which may omit information.

State and observations are almost a real-valued vector, matrix, higher-order tensor in deep RL.

When the agent can observe the complete state of the environment, we say the environment is fully observed.
When the agent can only see a partial observation, the environment is partially observed (c.f. POMDP).

In practice, RL state $s$ is more appropriate to use observation $o$. Specifically, we often signal in notation that the action is conditioned on the state, when in practice, the action is conditioned on the observation because the agent does not have the access to the state. In notation, also use standard notation $s$, rather than $o$.

Action spaces

Action space: the set of all valid actions in a given environment.
Discrete action space: only a finite number of moves are available to the agent, e.g. Atari, Go.
Continuous action space: actions are real-valued vectors, e.g. robot walk control.

Policies

A policy is a rule used by an agent to decide what actions to take. It is the agent’s brain. It is common to substitute the word “policy” for “agent”, e.g. saying “The policy is trying to maximize the reward”.

Deterministic policies

deterministic (denoted by $\mu$) $a_t = \mu(s_t)$

Tensorflow code snippet:

1
2
3

obs = tf.placeholder(shape=(None, obs_dim), dtype=tf.float32)
net = mlp(obs, hidden_dims=(64,64), activation=tf.tanh)
actions = tf.layers.dense(net, units=act_dim, activation=None)

where mlp represents MLP layers.

Stochastic policies

stochastic (denoted by $\pi$) $a_t \sim \pi(\cdot \vert s_t)$

Two most common kinds of stochastic policies:

Categorical policies

used in discrete action spaces

A categorical policy is like a classifier over discrete actions:

build the NN (the same as a classifier): input is the observation, followed by some layers (CNNs FC layers, depending on the kind of input). Then one dense layer gives the logits for each action, followed by a softmax to convert the logits to probabilities.
Sampling: Given probabilities for each action, frameworks like tensorflow has builtin tools for sampling. E.g. tf.distributions.Categorical or tf.multinomial
Log-likelihood: Denote the last layer of probabilities as $P_{\theta}(s)$ . Treat the actions as the indices of the vector. The log likeligood for an action $a$ can then be obtained by indexing into the vector.
$\text{log} \pi_{\theta} (a \vert s) = \text{log} [P_{\theta}(s)]_a$

Diagonal Gaussian policies

used in continuous action spaces

A diagonal Gaussian distribution is a special case of multivariate Gaussians where the covariance matrix only has entries on the diagonal, which can be represented as a vector.

NN maps from observations to mean actions, $\mu_{\theta}(s)$ , in two different ways:

The first way: there is a single vector of log standard deviations, $\text{log} \sigma$ are standalone parameters.
The second way: NN maps from states to log standard deviations, $\text{log} \sigma_{\theta}(s)$ . It may optionally share some layers with the mean network.

Both output log standard deviations instead of std deviations directly. Since log stds are free to take any values in $(-\infty, \infty)$, while stds must be non-negative. It’s easier to train parameters without such constraints.

Sampling: Given the mean action $\mu_{\theta}(s)$ and std deviation $\sigma_{\theta}(s)$ , and a vector $z$ of noise from a spherical Gaussian $(z \sim ~ \mathcal{N}(0, \mathcal{I}))$, an action sample can be computed:
$a = \mu_{\theta}(s) + \sigma_{\theta}(s) \odot z$
where $\odot$ denotes the element-wise product.
Log-likelihood: the log-likelihood of a $k$-dimensional action $a$, for a diagonal Gaussian with mean $\mu = \mu_{\theta}(s)$ and std dev $\sigma = \sigma_{\theta}(s)$ is:
$\text{log} \pi_{\theta}(a \vert s) = - \frac{1}{2} \big( \sum_{i=1}^k (\frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2 \text{log} \sigma_i) + k \text{log} 2 \pi \big)$

Parameterized policies:

In deep RL, policies whose output are computable functions that depend on a set of parameters (e.g. the weights and biases in NNs).
Let $\theta$ or $\phi$ denotes the parameters, written as a subscript:
$a_t = \mu_{\theta}(s_t)$ $a_t \sim \pi_{\theta}(\cdot \vert s_t)$

Trajectories(a.k.a Episodes, Rollouts)

A trajectory $\tau$ is a sequence of states and actions in the world:

$\tau = (s_0, a_0, s_1, a_1, \cdots)$

The very first state of the world, $s_0$ is randomly sampled from the start-state distribution, sometimes denoted by $\rho_0$

$s_0 \sim \rho(\cdot)$

State transitions are governed by the natural laws of the environment, and depend only the most recent action $a_t$ . It is either deterministic，

$s_{t+1} = f(s_t, a_t)$

or stochastic

$s_{t=1} \sim P(\cdot \vert s_t, a_t)$

Reward and return

The reward function $R$ depends on the current state of the world, the action just taken, and the next state of the world:

$r_t = R(s_t, a_t, s_{t+1})$

Although frequently this is simplified to just a dependence on the current state, $r_t = R(s_t)$ , or state-action pair $r_t = R(s_t, a_t)$

The goal of the agent is to maximize some notation of cumulative reward over a trajectory, $R(\tau)$.

Two kinds of returns:

Finite-horizon undiscounted return: the sum of rewards obtained in a fixed window of steps: $R(\tau) = \sum_{t=0}^T r_t$
Infinite-horizon discounted return: the sumof all rewards ever obtained by the agent, but discounted by how far off in the future they’re obtained. This includes a discounted factor $\gamma \in (0,1)$:
- Intuition: “cash now is better than cash later”
- Mathematically: more convenient to converge. An infinite-horizon sum of rewards may not converge to a finite value, and is hard to deal with in equations. But with a discount factor and under reasonable conditions, the infinite sum converges.

The RL Problem

Whatever the choice of return measure and policy, the goal of RL is to select a policy which maximize expected return when the agent acts accordingly.

Let us suppose the environment transitions and the policy are stochastic. The probability of a $T$-step trajectory is:

$P(\tau \vert \pi) = \rho_0 (s_0) \sum_{t=0}^{T-1} P(s_{t+1} \vert s_t, a_t) \pi(a_t \vert s_t)$

The expected return denoted by $J(\pi)$ is:

$J(\pi) = \int_\tau P(\tau \vert \pi) R(\tau) = \underset{\tau \sim \pi}{E} [R(\tau)]$

The central optimization problem in RL is expressed as:

$\pi^* = \arg\max_\pi J(\pi)$

where $\pi^{*}$ being the optimal policy

Value Functions

Q-function

Q-function: total reward from taking $\pmb{a}_t$ in $\pmb{s}_t$ $Q^{\pi}(\pmb{s}_t, \pmb{a}_t) = \sum_{t=t'}^T E_{\pi_{\theta}} [r(\pmb{s}_{t'}, \pmb{a}_{t'}) \vert \pmb{s}_t, \pmb{a}_t]$

Value function

Value function: total reward from $\pmb{s}_t$ $V^{\pi}(\pmb{s}_t) = \sum_{t=t'}^T E_{\pi_{\theta}}[r(\pmb{s}_{t'}, \pmb{a}_{t'}) \vert \pmb{s}_t]$ $V^{\pi}(\pmb{s}_t) = E_{\pmb{a}_t \sim \pi(\pmb{a}_t \vert \pmb{s}_t)} [Q^{\pi} (\pmb{s}_t, \pmb{a}_t)]$

$E_{\pmb{s_1 \sim p(\pmb{s}_1)}}[V^{\pi}(\pmb{s}_1)]$ is the RL objective!

Idea: compute gradient to increase the probability of good actions a:

If $Q^{\pi}(\pmb{s}, \pmb{a}) > V^{\pi}(\pmb{s})$ , then $\pmb{a}$ is better than average. (recall $V^{\pi}(s) = E[Q^\pi (\pmb{s}, \pmb{a})]$ under $\pi(\pmb{a} \vert \pmb{s})$ )
modify $\pi(\pmb{a} \vert \pmb{s})$ to increase the probability of $\pmb{a}$ if $Q^{\pi}(\pmb{s}, \pmb{a}) > V^{\pi}(\pmb{s})$

By value, we mean the expected return if you start in that state or state-action pair, and then act according to a particular policy forever after. Value functions are used in almost every RL algorithm.

Four main functions:

On-policy value function, $V^\pi(s)$: give the expected return if you start in state $s$ and always act according to policy $\pi$
$V^\pi(s) = \underset{\tau \sim \pi}{E}[R(\tau) \vert s_0 = s]$
On-policy action-value function, $Q^\pi(s,a)$: gives the expected return if you start in state $s$, take an arbitrary action $a$, and then forever after act according to policy $\pi$:
$Q^{\pi}(s,a) = \underset{\tau \sim \pi}{E}[R(\tau) \vert s_0=s, a_0=a]$
Optimal value function, $V^(s)$: give the expected reutrn if you *start in state $s$, and always act according to the optimal policy in the environment: $V^*(s) = \max_\pi \underset{\tau \sim \pi}{E}[R(\tau) \vert s_0 = s]$
Optimal action-value function, $Q^(s,a)$: give the expected return if you start in state $s$, take and arbitrary action $a$, and then forever after act according to the optimal* policy in the environment: $Q^*(s,a) = \max_\pi \underset{\tau \sim \pi}{E} [R(\tau) \vert s_0=s, a_0=a]$

The optimal Q-function and the optimal action

$a^*(s) = \arg\max_{a} Q^*(s,a)$

Bellman equations

Basic idea: the value of your starting point is the reward you expect to get from being there, plus the value of wherever you land next.
Bellman equation for the on-policy value functions:
$V^{\pi}(s) = \underset{\underset{s' \sim P}{a \sim \pi}}{E} [r(s,a) + \gamma V^{\pi}(s')]$ $Q^\pi (s,a) = \underset{s' \sim P}{E} \big[r(s,a) + \gamma \underset{a' \sim \pi}{E} [Q^\pi (s', a')] \big]$

where $s’ \sim P$ is shorthand for $s’ \sim P(\cdot \vert s,a)$, indicating that the next state $s’$ is sampled from the environment’s transition rules; $a \sim \pi$ is shorthand for $a \sim \pi(\cdot \vert s)$ and $a’ \sim \pi$ is shorthand for $a’ \sim \pi(\cdot \vert s’)$

Bellman equation for the optimal value functions: $V^*(s) = \max_a \underset{s' \sim P}{E} [r(s,a) + \gamma V^*(s')]$ $Q^*(s,a) = \underset{s' \sim P}{E} \big[ r(s,a) + \gamma \max_{a'} Q^*(s', a') \big]$

Difference between the Bellman equations for the on-policy value functions and the optimal value functions: the absence or presence of the $\max$ over actions.

Advantage functions

Intuition: In RL, we want to know how much better it is than other on average, i.e the relative advantage of that action.
The advantage function $A^\pi (s,a)$ corresponding to a policy $\pi$ describes how much better it is to take a specific action $a$ in state $s$, over randomly selecting an action according to $\pi(\cdot \vert s)$ , assuming you act according to $\pi$ forever after.
$A^\pi(s,a) = Q^\pi (s,a) - V^\pi(s)$

Markov Decision Process

An MDP is a 5-tuple $<S,A,R,P, \rho_0>$ , where

$S$ is the set of all valid states
$A$ is the set of all valid actions
$R$: $S \times A \times S \rightarrow \mathbb{R}$ is the reward function, with $r_t = R(s_t, a_t, s_{t+1})$
$P$: $S \times A \rightarrow \mathcal{P}(S)$ is the transition probability function, with $P(S’ \vert s,a)$ being the prob of transitioning into state $s’$ if you start in state $s$ and take action $a$
$\rho_0$ is the starting state distribution.

Markov property:

transitions only depend on the most recent state and action, and no prior history.

References

1.OpenAI, Spinning Up, Part 1: Key Concepts in RL ↩