PPO presentation outline

What is Reinforcement Learning?

Quick informal recap of what RL is, for the audience to get the vibe

Recap

Reminder of MDPs

Remind the audience of the basic structure of an MDP

S: state space
A: action space
P: state transition probability function
R: reward function

Policy

A policy is just a mapping from states to actions, it can be deterministic or it van be stochastic

Value functions

return

G_{t} = R_{t} + γ R_{t + 1} + γ^{2} R_{t + 2} + \dots

state value:

v_{π} (s) = E_{π} [G_{t} ∣ s_{t} = s]

action value:

q_{π} (s, a) = E_{π} [G_{t} ∣ s_{t} = s, a_{t} = a]

Value functions visual representation

insert graphic of one step lookahead

Algorithms

Value based

learn a value function, and derive a policy from the values (e.g. $ε$ -greedy)

we will not talk about these, only the pros and cons

pros:

easy to understand
easy to implement
cons:
really hard for high dimensional state and/or action spaces
hard to learn stochastic policy

Policy based

Directly learn a policy, no need for intermediary value function

pros:

easily adaptable for high dimensional state and/or action spaces
cons:
can get stuck in local optima

Value based vs Policy based

insert infographic

Policy gradient

Define a loss function and change the parameters of the policy in the steepest direction

Some examples of objecitve functions:

If the problem is episodic, meaning that it always starts at state $s_{1}$ and terminates eventually and restarts at $s_{1}$ , then the objective function of taking the state value of the starting state makes sense:

J_{1} (θ) = V^{π_{θ}} (s_{1}) = E_{π_{θ}} [v_{1}]

For continuing environments, that is an environment that does not terminate and restert, but keeps on rolling, the average value of the states weighted by their stationary distribution works:

J_{a v V} (θ) = s \sum d^{π_{θ}} (s) V^{π_{θ}} (s)

Still for continuing environments, we can also take the average reward after a single step and weigh it by the stationary distibution:

J_{a v R} (θ) = s \sum d^{π_{θ}} (s) a \sum π_{θ} (s, a) R_{s}^{a}

Policy gradient theorem

For any differentiable policy $π_{θ}$ , for any policy objective function $J$ , the policy gradient is the following:

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \cdot Q^{π_{θ}} (s, a)]

where $Q^{π_{θ}} (s, a)$ is the long-term value of a state action pair.

Actor Critic

Approximate $Q^{π_{θ}}$ (Critic) some way and use the approximate $Q$ to update the parameters of the Actor

We will not explain how to approximate $Q$ in this presentation

Baseline function

The policy gradient theorem still holds if we subtract a baseline function from $Q$

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \cdot (Q^{π_{θ}} (s, a) - B (s))]

we can take the state value function as a baseline: $B (s) = V^{π_{θ}} (s)$

Advantage function

A^{π_{θ}} (s, a) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s)

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \cdot A^{π_{θ}} (s, a)]

Estimating the advantage function

Q^{π_{θ}} (s, a) = E [G_{t} ∣ s, a] = E_{π_{θ}} [r + γ V^{π_{θ}} (s^{'}) ∣ s, a]

A^{π_{θ}} (s, a) = E_{π_{θ}} [r + γ V^{π_{θ}} (s^{'}) ∣ s, a] - V^{π_{θ}} (s)

introduce the td error:

δ^{π_{θ}} = r + γ V^{π_{θ}} (s^{'}) - V^{π_{θ}} (s)

E_{π_{θ}} [δ^{π_{θ}} ∣ s, a] = A^{π_{θ}} (s, a)

so the td error is an unbiased estimator of the advantage function

Generalized Advantage Estimator

A single step TD error may result in high variance, thsu we introduce the following:

A_{t}^{(k)} = m = 0 \sum k - 1 γ^{m} δ_{t + 1}^{π_{θ}} = - V^{π_{θ}} (s_{t}) + r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + \dots + γ^{k} V^{π_{θ}} (s_{t + k})

as $k$ increases, the variance decreases

A_{t}^{GAE} = (1 - λ) (A_{t}^{(1)} + λ A_{t}^{(2)} + λ^{2} A_{t}^{(3)} + \dots)

Clipped Surrogate Objective

L^{CLIP} (θ) = \hat{E}_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]

PPO (finally)

for each iteration, run the $N$ parallel environments for $T$ timesteps ( $T \neq =$ episode length), optimize the loss function $L$ with respest to $θ$ for $K$ epochs and minibatch size $M$

Jegyzetek kincstára

Explorer