A2C is a policy based reinforcement learning algorithm where there is an Actor netwrk and a Critic network.
The Actor network takes as input the the state of the environment and outputs an action.
The Critic network takes as input a state-action pair and outputs a value of how ‘good’ it is to take said action at given state.
Let be the Actor network and be the Critic network.
Define the output of the Critic network to be the following where denotes the weights of the Critic network.
Let represent the parameters of the Actor network.
The update of the parameters of the Actor network is given by the following equation:
where denotes the current policy with parameters .
The update of the parameters of the Critic network is given by the following equation:
where means Temporal Difference error and represents how much the Critic network is off on the estimation of the value of the state-action pair.
Define the Advantage function to be the following:
where represents the Q-value of action at state and represents the average value of the sate, meaning the mean reward.
In other words the Advantage function tells us how good action is at state compared to the mean at state . If then action is good in sate , otherwise it is bad.
The problem is that we don’t know the Q-value function, otherwise we wouldn’t be trying to estimate it with the Critic network. THe remedy this problem we approximate the Q-value function the following way:
Substituting this approximation back into the Advantage function we get the following:
The right side of the above equation looks just like the previously mentioned TD error except that before we didn’t use but used it’s approximation . We will call this TD error too.