Policy gradient methods

In policy based algorithms:

There is no value function during or after training
We directly learn the policy, as opposed to value based algorithms where we implicitly learn the policy via a learnt value function

Advantages of policy based algorithms:

Better for action spaces that are huge or continuous
In Q-learning, at each step we took a max over the actions, this is expensive for action spaces that are huge or continuous

Disadvantage:

Can get stuck in local optima

Policy based methods are optimization problems, where we aim to maximize the policy for a given objective function.

In policy gradient methods, the policy is parameterized, and apply gradient ascent to find local maxima based on the objective function.

Some examples of objecitve functions:

If the problem is episodic, meaning that it always starts at state $s_{1}$ and terminates eventually and restarts at $s_{1}$ , then the objective function of taking the state value of the starting state makes sense:

J_{1} (θ) = V^{π_{θ}} (s_{1}) = E_{π_{θ}} [v_{1}]

For continuing environments, that is an environment that does not terminate and restert, but keeps on rolling, the average value of the states weighted by their stationary distribution works:

J_{a v V} (θ) = s \sum d^{π_{θ}} (s) V^{π_{θ}} (s)

Still for continuing environments, we can also take the average reward after a single step and weigh it by the stationary distibution:

J_{a v R} (θ) = s \sum d^{π_{θ}} (s) a \sum π_{θ} (s, a) R_{s}^{a}

If we can analytically calculate the gradient of the policy, using autograd software (pytorch, tensorflow, tinygrad), we can easily apply gradient ascent.

Note, that in the following equations we will see $\nabla_{θ} lo g π_{θ} (s, a)$ , before we get confused we explain where this part comes from:

\nabla_{θ} π_{θ} (s, a) = π_{θ} (s, a) \frac{\nabla _{θ} π _{θ} ( s , a )}{π _{θ} ( s , a )} = π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)

As you can see, this is just an algebraic manipulation.

Policy gradient theorem
For any differentiable policy $π_{θ}$ , for any policy objective function $J$ , the policy gradient is the following:

\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \cdot Q^{π_{θ}} (s, a)]

Jegyzetek kincstára

Explorer

Policy gradient methods