In policy based algorithms:

  • There is no value function during or after training
  • We directly learn the policy, as opposed to value based algorithms where we implicitly learn the policy via a learnt value function

Advantages of policy based algorithms:

  • Better for action spaces that are huge or continuous
    In Q-learning, at each step we took a max over the actions, this is expensive for action spaces that are huge or continuous

Disadvantage:

  • Can get stuck in local optima

Policy based methods are optimization problems, where we aim to maximize the policy for a given objective function.

In policy gradient methods, the policy is parameterized, and apply gradient ascent to find local maxima based on the objective function.

Some examples of objecitve functions:

  • If the problem is episodic, meaning that it always starts at state and terminates eventually and restarts at , then the objective function of taking the state value of the starting state makes sense:
  • For continuing environments, that is an environment that does not terminate and restert, but keeps on rolling, the average value of the states weighted by their stationary distribution works:
  • Still for continuing environments, we can also take the average reward after a single step and weigh it by the stationary distibution:

If we can analytically calculate the gradient of the policy, using autograd software (pytorch, tensorflow, tinygrad), we can easily apply gradient ascent.

Note, that in the following equations we will see , before we get confused we explain where this part comes from:

As you can see, this is just an algebraic manipulation.

Policy gradient theorem
For any differentiable policy , for any policy objective function , the policy gradient is the following: