Proximal Policy Optimization

Define the Clipped surrogate objective function to be the following:

L^{C L I P} (θ) = \hat{E}_{t} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ε, 1 + ε) \hat{A}_{t}]

where the ratio function $r_{t}$ is define as follows:

r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}

Explanation of each part:

The ratio function $r_{t}$ tells us how much more likely we are to take action $a_{t}$ at state $s_{t}$ with our current policy than with our old policy.
This is the part we want to clip. If $r_{t}$ is much bigger than $1$ or much smaller than $1$ that means that our policy changed a lot. This is what we don’t want to happen so we clip it to fall into the range $[1 - ε, 1 + ε]$ .
The advantage function $\hat{A}_{t}$ tells us how much better the action we have taken at timestep $t$ was compared to the mean.

The following table demonstrates how and when the objective function is clipped.

The final PPO objective function looks like the following:

L_{t}^{C L I P + V F + S} (θ) = \hat{E}_{t} [L_{t}^{C L I P} (θ) - c_{1} L_{t}^{V F} (θ) + c_{2} S [π_{θ}] (s_{t})]

Where $c_{1}$ and $c_{2}$ are hyperparameters to weigh each part.
$L_{t}^{V F}$ is the Squared-error value loss.
$S [π_{θ}]$ is the entropy bonus to ensure sufficient exploration.

Jegyzetek kincstára

Explorer

Proximal Policy Optimization