Yekun's Note

Machine learning notes and writeup.

Fork me on GitHub

Introduction to Reinforcement Learning

Notes of lectures by D. Silver. A brief introduction of RL.

Introduction to Reinforcement Learning

Characteristics v.s. ML

  • No supervisor, only a reward signal
  • Feedback is delayed, not instantaneous
  • Time really matters (sequential, non i.i.d data)
  • Agent’s action affect the subsequent data it receives

RL problem


  • A reward is a scalar feedback signal
  • Indicates how well agent is doing at step $t$
  • The agent’s job is to maximize cumulative reward

RL is based on reward hypothesis, i.e. All goals can be described by the maximization of expected cumulative reward.

Sequential Decision making

  • Goal: select actions to maximize total future reward
  • Actions may have long term consequences
  • Reward may be delayed
  • It may be better to sacrifice immediate reward to gain more long-term reward


  • At each time step $t$ the agent:
    • executed action
    • Receives observations
    • Receives scalar reward
  • The environment:
    • receives action
    • emits observation
    • emits scalar reward
  • $t$ increments at env. step

upload successful


The history is the sequence of observations, actions, rewards

  • i.e. all observable variables up to time $t$
  • i.e. the sensorimotor stream of a robot or embodied agent
  • What happens depends on the history:
    • The agent selects actions
    • The environment selects observations / rewards
  • State is the information used to determine what happens next. Formally, state is a function of the history:

Environment state

The environment state is the environment’s private representation, i.e. whatever data the environment uses to pick the next observation / reward.

  • The environment state is not usually visible to the agent. Even if is visible, it may contain irrelevant information.

Agent state

The agent state is the agent’s internal representation.

  • i.e. whatever information the agent uses to pick the next action.
  • i.e. it is the information used by RL algorithms.
  • It can be any function of history

Information state

An information state (a.k.a.Markov state) contains all useful information from the history.

Definition: A state is Markov if and only if

  • “The future is independent of the past given the present”
  • Once the state is known, the history may be thrown away, i.e. the state is a sufficient statistic of the future
  • The environment state is Markov
  • The history is Markov

Fully observable environment

Fully observability: agent directly observes environment state;

  • Agent state = environment state = information state
  • Formally, this is a Markov decision process (MDP)

Partially observable environments

Partially observability: agent indirectly observes environment. e.g.:

  • A robot with camera vision is not told its absolute location.
  • A trading agent only observes current prices.

Not agent state $\neq$ environment state.

  • Formally, this is a partially observable Markov decision process (POMDP).
  • Agent must construct its own state representation , e.g.
    • Complete history
    • Beliefs of environment state:
    • Recurrent neural net:

RL agent

Major component

  • Policy: agent’s behavior function
  • Value function: how good is each state and/or action
  • Model: agent’s representation of the environment


  • A policy is the agent’s behavior
  • it is a map from state to action, e.g.
    • Deterministic policy:
    • Stochastic policy:

Value function

  • Value function is a prediction of future reward
  • Used to evaluate the goodness / badness of states
  • Therefore to select between actions, e.g.


A model predicts what the environment will do next

  • $\mathcal{P}$ predicts the next state
  • $\mathcal{R}$ predicts the next (immediate) reward, e.g.

RL agent category 1

  • Value based
    • No policy (implicit)
    • Value function
  • Policy based
    • Policy
    • No value function
  • Actor Critic
    • Policy
    • Value function

RL agent category 2

  • Model Free
    • Policy and/or Value Function
    • No Model
  • Model based
    • Policy and/or Value Function
    • Model

upload successful

Problems within RL

Leaning and Planning

Two fundamental problems in sequential decision making

  • Reinforcement learning
    • The environment is initially unknown
    • The agent interacts with the environment
    • The agent improves its policy
  • Planning
    • A model of the environment is known
    • The agent performs computations with its model (without any external interaction)
    • The agents improves its policy
    • a.k.a. deliberation, reasoning, introspection, pondering, thought, search

Exploration and Exploitation

  • RL is like trial-and-error learning
  • The agent should discover a good policy
  • From its experiences of the environment
  • Without losing too much reward along the way

  • Exploration finds more information about the environment; exploitation exploits known information to maximize reward

  • It is usually important to explore as well as exploit.