Policy Gradient Methods

So far, I covered the Value Based Methods, where we estimate a value function as an intermediate step towards finding an optimal policy.

In Value Based Methods, the policy ( $π$ ) is a function that will select the action with the highest value given a current state. For example, in Q-Learning we used a $ϵ$ -greedy policy.

In policy-based methods, we directly learn to approximate $π^{*}$ without having to learn a value function.

Classes of policy gradient methods:

Vanilla Policy Gradient
TRPO (Trust Region Policy Optimization)
PPO (Proximal Policy Optimization)
REINFORCE (Monte-Carlo Policy Gradient)

The main idea is to parametrize the policy.

aka NN.

This way, the policy will output a probability distribution over actions (stochastic policy).
$π_{θ} (s) = P [a ∣ s; θ]$
To optimize it, we define an objective function $J (θ)$ (the expected cumulative reward) and look to maximize it using gradient ascend $\to$ the parameter $θ$ will affect the distribution of actions over a state.

The main advantage is the simplicity of integration $\to$ we can estimate the policy directly without storing additional data (action-values).

Policy Gradient versus Value Based

PG can learn a stochastic policy, while value methods cannot. It introduces some consequences:

Advantages

no exploration/exploitation trade-off implementation by hand. Since we output a probability distribution over actions, the agent explores the state space without always taking the same trajectory.
we don’t face perceptual aliasing $\to$ when two states seem (or are) the same but need different actions.
- an optimal stochastic policy will randomly move left or right, and it will not reach the goal state with a higher probability (what?)
more effective in high-dimensional action spaces & continuous actions spaces.
better convergence properties:
- Value functions $\to$ use an aggressive operator to change the value function. The action probabilities may change drastically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
- PG methods $\to$ stochastic policy action preferences (probability of taking action) change smoothly over time.

Disadvantages

typically converge to a local rather than global optimum.
evaluating a policy is typically inefficient (slow) and high variance.
- the high variance problem is partially solved using Actor Critic methods.

In some cases, stochastic policies are the best. Imagine rock-paper-scissors: if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.

Policy-Based versus Policy-Gradient

Policy-gradient methods \in Policy-based Methods

The difference is in how they optimize the parameter $θ$ .

policy-based methods $\to$ search directly for the optimal policy $π^{*}$ and optimize the parameters $θ$ indirectly by maximizing the local approximation of the objective function.
policy-gradient methods $\to$ search directly for the optimal policy $π^{*}$ and optimize the parameters $θ$ directly by performing the gradient ascend on the objective function $J (θ)$ .

With policy based methods, this pseudo code represents the idea perfectly. The aim is to increase (or decrease) the $P (a ∣ s)$ . If we win the episode, we consider that each action taken was good and must be more sampled in the future.

But you don’t know how good that policy is. Exactly why policy gradient methods introduced the objective function.

Mathematics of Policy Gradient, briefly

$J (θ)$ gives us the performance of the agent given a trajectory ( $τ$ ) and it outputs the expected cumulative reward $R (τ)$ . Our goal is therefore to maximize this expected return:

θ max J (θ) = E_{τ \sim π} [R (τ)]

the expected return will be the weighted average.
- weights are given by $P (τ; θ)$ of all possible values that the return $R (τ)$ can take.
  - $P (τ, θ)$ — probability of each possible trajectory $τ$ depends on $θ$ since it defines the policy that it uses to select the actions.
$R (τ)$ — return from an arbitrary trajectory.
- we consider all the possible trajectories “weighted” by their probabilities to calculate the expected return $\to r_{t + 1} + γ r_{t + 2} + γ^{2} r_{t + 3} + \dots$

Therefore we can write the same objective functions as:

J (θ) = \sum P (τ; θ) R (τ)

We update our parameters with gradient-ascend

θ \leftarrow θ + α \cdot \nabla J (π_{θ})

can’t calculate the true gradient $\to$ it requires calculating the probability of each possible trajectory
can’t differentiate this objective function since it’s attached to the environment. The problem is we might not know about it.

Policy Gradient Theorem

helps us reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.
$\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t}) R (τ)]$

$\nabla_{θ} lo g π_{θ} (a_{t} ∣ s_{t})$ is the direction of the steepest increase of probability of selecting action $a_{t}$ from state $s_{t}$ .

$R (τ)$ is just the cumulative reward we discussed earlier.

As mentioned in the beginning, REINFORCE is a policy-gradient algorithm based on Monte Carlo methods to estimate the return over an entire episode before updating the data. It works according to sequential steps:

use the policy $π_{θ}$ to collect an episode $τ$
use the episode to estimate the gradient $\overset{g}{^} = \nabla_{θ} J (θ)$
update the weights of the policy $θ \leftarrow θ + α \overset{g}{^}$

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Policy Gradient Methods

Graph View

Backlinks