What is Reinforcement Learning?
Quick informal recap of what RL is, for the audience to get the vibe
Recap
Reminder of MDPs
Remind the audience of the basic structure of an MDP
- S: state space
- A: action space
- P: state transition probability function
- R: reward function
Policy
A policy is just a mapping from states to actions, it can be deterministic or it van be stochastic
Value functions
- return
- state value:
- action value:
Value functions visual representation
insert graphic of one step lookahead
Algorithms
Value based
learn a value function, and derive a policy from the values (e.g. -greedy)
we will not talk about these, only the pros and cons
pros:
- easy to understand
- easy to implement
cons: - really hard for high dimensional state and/or action spaces
- hard to learn stochastic policy
Policy based
Directly learn a policy, no need for intermediary value function
pros:
- easily adaptable for high dimensional state and/or action spaces
cons: - can get stuck in local optima
Value based vs Policy based
insert infographic
Policy gradient
Define a loss function and change the parameters of the policy in the steepest direction
Some examples of objecitve functions:
- If the problem is episodic, meaning that it always starts at state and terminates eventually and restarts at , then the objective function of taking the state value of the starting state makes sense:
- For continuing environments, that is an environment that does not terminate and restert, but keeps on rolling, the average value of the states weighted by their stationary distribution works:
- Still for continuing environments, we can also take the average reward after a single step and weigh it by the stationary distibution:
Policy gradient theorem
For any differentiable policy , for any policy objective function , the policy gradient is the following:
where is the long-term value of a state action pair.
Actor Critic
Approximate (Critic) some way and use the approximate to update the parameters of the Actor
We will not explain how to approximate in this presentation
Baseline function
The policy gradient theorem still holds if we subtract a baseline function from
we can take the state value function as a baseline:
Advantage function
Estimating the advantage function
introduce the td error:
so the td error is an unbiased estimator of the advantage function
Generalized Advantage Estimator
A single step TD error may result in high variance, thsu we introduce the following:
as increases, the variance decreases
Clipped Surrogate Objective
PPO (finally)
for each iteration, run the parallel environments for timesteps ( episode length), optimize the loss function with respest to for epochs and minibatch size