What is Reinforcement Learning?

Quick informal recap of what RL is, for the audience to get the vibe

Recap

Reminder of MDPs

Remind the audience of the basic structure of an MDP

  • S: state space
  • A: action space
  • P: state transition probability function
  • R: reward function

Policy

A policy is just a mapping from states to actions, it can be deterministic or it van be stochastic

Value functions

  • return
  • state value:
  • action value:

Value functions visual representation

insert graphic of one step lookahead

Algorithms

Value based

learn a value function, and derive a policy from the values (e.g. -greedy)

we will not talk about these, only the pros and cons

pros:

  • easy to understand
  • easy to implement
    cons:
  • really hard for high dimensional state and/or action spaces
  • hard to learn stochastic policy

Policy based

Directly learn a policy, no need for intermediary value function

pros:

  • easily adaptable for high dimensional state and/or action spaces
    cons:
  • can get stuck in local optima

Value based vs Policy based

insert infographic

Policy gradient

Define a loss function and change the parameters of the policy in the steepest direction

Some examples of objecitve functions:

  • If the problem is episodic, meaning that it always starts at state and terminates eventually and restarts at , then the objective function of taking the state value of the starting state makes sense:
  • For continuing environments, that is an environment that does not terminate and restert, but keeps on rolling, the average value of the states weighted by their stationary distribution works:
  • Still for continuing environments, we can also take the average reward after a single step and weigh it by the stationary distibution:

Policy gradient theorem

For any differentiable policy , for any policy objective function , the policy gradient is the following:

where is the long-term value of a state action pair.

Actor Critic

Approximate (Critic) some way and use the approximate to update the parameters of the Actor

We will not explain how to approximate in this presentation

Baseline function

The policy gradient theorem still holds if we subtract a baseline function from

we can take the state value function as a baseline:

Advantage function

Estimating the advantage function

introduce the td error:

so the td error is an unbiased estimator of the advantage function

Generalized Advantage Estimator

A single step TD error may result in high variance, thsu we introduce the following:

as increases, the variance decreases

Clipped Surrogate Objective

PPO (finally)

for each iteration, run the parallel environments for timesteps ( episode length), optimize the loss function with respest to for epochs and minibatch size