Define the Clipped surrogate objective function to be the following:
where the ratio function is define as follows:
Explanation of each part:
- The ratio function tells us how much more likely we are to take action at state with our current policy than with our old policy.
This is the part we want to clip. If is much bigger than or much smaller than that means that our policy changed a lot. This is what we don’t want to happen so we clip it to fall into the range . - The advantage function tells us how much better the action we have taken at timestep was compared to the mean.
The following table demonstrates how and when the objective function is clipped.
The final PPO objective function looks like the following:
Where and are hyperparameters to weigh each part.
is the Squared-error value loss.
is the entropy bonus to ensure sufficient exploration.