Notes of lectures by D. Silver. A brief introduction of RL.

# Introduction to Reinforcement Learning

## Characteristics v.s. ML

• No supervisor, only a reward signal
• Feedback is delayed, not instantaneous
• Time really matters (sequential, non i.i.d data)
• Agent’s action affect the subsequent data it receives

## RL problem

### Rewards

• A reward $R_t$ is a scalar feedback signal
• Indicates how well agent is doing at step $t$
• The agent’s job is to maximize cumulative reward

RL is based on reward hypothesis, i.e. All goals can be described by the maximization of expected cumulative reward.

### Sequential Decision making

• Goal: select actions to maximize total future reward
• Actions may have long term consequences
• Reward may be delayed
• It may be better to sacrifice immediate reward to gain more long-term reward

### Environments

• At each time step $t$ the agent:
• executed action $A_t$
• Receives observations $O_t$
• Receives scalar reward $R_t$
• The environment:
• receives action $A_t$
• emits observation $O_{t+1}$
• emits scalar reward $R_{t+1}$
• $t$ increments at env. step

### State

The history is the sequence of observations, actions, rewards

• i.e. all observable variables up to time $t$
• i.e. the sensorimotor stream of a robot or embodied agent
• What happens depends on the history:
• The agent selects actions
• The environment selects observations / rewards
• State is the information used to determine what happens next. Formally, state is a function of the history:

#### Environment state

The environment state $S_t^e$ is the environment’s private representation, i.e. whatever data the environment uses to pick the next observation / reward.

• The environment state is not usually visible to the agent. Even if $S_t^e$ is visible, it may contain irrelevant information.

#### Agent state

The agent state $S_t^a$ is the agent’s internal representation.

• i.e. whatever information the agent uses to pick the next action.
• i.e. it is the information used by RL algorithms.
• It can be any function of history

#### Information state

An information state (a.k.a.Markov state) contains all useful information from the history.

Definition: A state $S_t$ is Markov if and only if

• “The future is independent of the past given the present”
• Once the state is known, the history may be thrown away, i.e. the state is a sufficient statistic of the future
• The environment state $S_t^e$ is Markov
• The history $H_t$ is Markov

#### Fully observable environment

Fully observability: agent directly observes environment state;

• Agent state = environment state = information state
• Formally, this is a Markov decision process (MDP)

#### Partially observable environments

Partially observability: agent indirectly observes environment. e.g.:

• A robot with camera vision is not told its absolute location.
• A trading agent only observes current prices.

Not agent state $\neq$ environment state.

• Formally, this is a partially observable Markov decision process (POMDP).
• Agent must construct its own state representation $S_t^a$, e.g.
• Complete history $S_t^a = H_t$
• Beliefs of environment state: $S_t^a = (\mathbb{P}(S_t^e = s^1),...,\mathbb{P}(S_t^e = s^n))$
• Recurrent neural net: $S_t^a = \sigma(S^a_{t-1} W_s + O_t W_t)$

## RL agent

### Major component

• Policy: agent’s behavior function
• Value function: how good is each state and/or action
• Model: agent’s representation of the environment

#### Policy

• A policy is the agent’s behavior
• it is a map from state to action, e.g.
• Deterministic policy: $a=\pi(s)$
• Stochastic policy: $\pi(a|s) = \mathbb{P}[A_t=a|S_t=s]$

#### Value function

• Value function is a prediction of future reward
• Used to evaluate the goodness / badness of states
• Therefore to select between actions, e.g.

#### Model

A model predicts what the environment will do next

• $\mathcal{P}$ predicts the next state
• $\mathcal{R}$ predicts the next (immediate) reward, e.g.

### RL agent category 1

• Value based
• No policy (implicit)
• Value function
• Policy based
• Policy
• No value function
• Actor Critic
• Policy
• Value function

### RL agent category 2

• Model Free
• Policy and/or Value Function
• No Model
• Model based
• Policy and/or Value Function
• Model

## Problems within RL

### Leaning and Planning

Two fundamental problems in sequential decision making

• Reinforcement learning
• The environment is initially unknown
• The agent interacts with the environment
• The agent improves its policy
• Planning
• A model of the environment is known
• The agent performs computations with its model (without any external interaction)
• The agents improves its policy
• a.k.a. deliberation, reasoning, introspection, pondering, thought, search

### Exploration and Exploitation

• RL is like trial-and-error learning
• The agent should discover a good policy
• From its experiences of the environment
• Without losing too much reward along the way