Actor-Critic Methods

The point of actor critic is to decouple the gradient update from the Q-function update.

Policy-based methods directly optimize the policy but have high variance
Value-based methods estimate values but don’t give a parametrized policy.

Unlike Q-Learning (value based), which directly attempts to learn the optimal Q-function, actor-critic methods aim to learn the Q-function corresponding to the current parametrized policy $π_{θ} (a ∣ s)$ , which must obey the equation:

Q^{π} (s, a) = E_{s^{'}, r \sim p (\cdot ∣ s, a)} [r + γ E_{a^{'} \sim π (\cdot ∣ s^{'})} Q^{π} (s^{'}, a^{'})]

Methods

A3C
SAC
GAE
PPO
TD3

Definition

Actor-Critic is a hybrid architecture combining value-based and policy-based methods that helps to stabilize the training by reducing the variance using two elements:

actor $\to$ controls how our agent behaves (policy-based)

critic $\to$ measures how good the taken action is (value-based)

Needless to say, the two learn in parallel. The actor learns the policy $π_{θ}$ and the critic assists the policy update by measuring the performance of the action that we take $\overset{q}{^}_{w} (s, a)$ .

The workflow is straightforwards:

Step 1: At each timestep t, we get the current state $s_{t}$ from the environment and pass it as input through our Actor and Critic $\to$ the policy takes the state and outputs an action $a_{t}$ .

Step 2: The Critic takes $a_{t}$ as input and, together with $s_{t}$ , it computes the value of taking that action at that state $\overset{q}{^}_{w} (s, a)$ (i.e. the Q-value).

Step 3: The performed action outputs a new state $s_{t + 1}$ and a reward $r_{t + 1}$ .

Step 4: The Actor updates it policy parameters using the Q-value.

Δ θ = α \nabla_{θ} (lo g π_{θ} (s, a)) \cdot \overset{q}{^}_{w} (s, a)

Step 5: The Critic then updates again its value parameters $Δ ω$ .

Note that the policy and value have different learning rates (i.e. $α$ and $β$ )

Improvements — Advantage function

Advantage function

calculates the relative advantage of an taking an action compared to the average value of the state.
$A (s, a) = Q (s, a) - V (s)$

$V (s)$ is the average value of that state.

$Q (s, a) = r + γV (s^{'})$

Therefore $A (s, a) = r + γV (s^{'}) - V (s)$ is the TD Error.
$Gradient Update Direction = {+ \nabla_{θ} lo g π_{θ} (a ∣ s) - \nabla_{θ} lo g π_{θ} (a ∣ s) if A (s, a) > 0 i.e. does better if A (s, a) < 0 i.e. does worse$

We use $A (s, a)$ instead of $Q (s, a)$ since $V (s)$ acts as a baseline, centering the updates around 0, which directly reduces variance without introducing bias.

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Actor-Critic Methods

Graph View

Backlinks