The point of actor critic is to decouple the gradient update from the Q-function update.

Unlike Q-Learning (value based), which directly attempts to learn the optimal Q-function, actor-critic methods aim to learn the Q-function corresponding to the current parametrized policy , which must obey the equation:

Methods

Definition

Actor-Critic is a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance using two elements:

  1. actor controls how our agent behaves (policy-based)
  2. critic measures how good the taken action is (value-based)

Needless to say, the two learn in parallel. The actor learns the policy and the critic assists the policy update by measuring the performance of the action that we take .

flow 1
flow 2

The workflow is straightforwards:

Step 1: At each timestep t, we get the current state from the environment and pass it as input through our Actor and Critic the policy takes the state and outputs an action .

Step 2: The Critic takes as input and, together with , it computes the value of taking that action at that state (i.e. the Q-value).

Step 3: The performed action outputs a new state and a reward .

Step 4: The Actor updates it policy parameters using the Q-value.

Step 5: The Critic then updates again its value parameters .

Note that the policy and value have different learning rates (i.e. and )


Improvements — Advantage function

Advantage function

calculates the relative advantage of an taking an action compared to the average value of the state.

  • is the average value of that state.

Therefore is the TD Error.

We use instead of since acts as a baseline, centering the updates around 0, which directly reduces variance without introducing bias.