Proximal Policy Optimization (PPO)

Related to Actor Critic Methods. PPO $\in$ Actor Critic $\in$ Policy-Gradient $\in$ Policy-Based.

How can we take the biggest possible improvement step on a policy using the data we currently have, without stepping so far that we accidentally cause performance collapse?

motivation.

PPO is a family of first-order methods that use a few other tricks to keep new policies close to old, therefore improving training stability of the policy. To do that, we use a ratio that indicated the difference between our current and old policy and clip it to a specific range $[1 - ϵ, 1 + ϵ]$ .

Empirically, smaller policy updates during training are more likely to converge to an optimal solution. Too big of a step in the wrong direction can take a long time or even having no possibility to recover.

on-policy algorithm.
can be used for environments with either discrete or continuous action spaces

The problem with REINFORCE (Vanilla policy gradient), a policy-gradient method, is that having a too small advantage results in slow training process, and having a too high advantage results in too much variability in the training.

L^{PG} (θ) = E [lo g π_{θ} (a_{t} ∣ s_{t}) * A_{t}]

$A_{t}$ is the advantage we are talking about.

In PPO, the clipped surrogate objective function is given by:

L^{C L I P} (θ) = \hat{E} [min (r_{t} (θ) \hat{A}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A}_{t})]

$r_{t} (θ) = \frac{π _{θ} ( a _{t} ∣ s _{t} )}{π _{θ_{o l d}} ( a _{t} ∣ s _{t} )}$ is the ratio of probabilities to taken an action given a state in two consecutive policies. It replaces the log probability.
- if $r_{t} (θ) > 1$ ⇒ the action $a_{t}$ at state $s_{t}$ is more likely in the current policy than in the old one
- if $r_{t} (θ) < 1$ ⇒ it’s between $[0, 1]$ and the action is less likely for the current policy than for the old one.

The objective function penalizes changes that lead to a huge change (far from 1) in the ratio. We take the minimum of the clipped and non-clipped objective, so the final objective is a lower bound (pessimistic bound) of the unclipped objective.

Interesting, we only clip updates that are too large, but harmful ones we still take. That’s why it says pessimistic/conservative.

the clipped part is the pessimistic boundary, to be more precise.
if the ratio is $> 1 + ϵ$ or $< 1 - ϵ$ , the gradient will be equal to 0. (see image explanation)

Looks like PPO only blocks updates that would push the policy further outside the trust region in the direction it’s already gone.

For example, rows 3 and 4:

Row 3: action is less likely than before ( $r < 1 - ϵ$ ) but was good ( $A_{t} > 0$ ) $\to$ we should increase its probability $\to$ gradient allowed
Row 4: action is less likely than before ( $r < 1 - ϵ$ ) but was bad ( $A_{t} < 0$ ) $\to$ we’d be pushing probability down even further $\to$ too large a step, so clip it and zero the gradient

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Proximal Policy Optimization (PPO)

Graph View

Backlinks