So far, I covered the Value Based Methods, where we estimate a value function as an intermediate step towards finding an optimal policy.

  • In Value Based Methods, the policy () is a function that will select the action with the highest value given a current state. For example, in Q-Learning we used a -greedy policy.

In policy-based methods, we directly learn to approximate without having to learn a value function.

Classes of policy gradient methods:

  • Vanilla Policy Gradient
  • TRPO (Trust Region Policy Optimization)
  • PPO (Proximal Policy Optimization)
  • REINFORCE (Monte-Carlo Policy Gradient)

The main idea is to parametrize the policy.

aka NN.

This way, the policy will output a probability distribution over actions (stochastic policy).

To optimize it, we define an objective function (the expected cumulative reward) and look to maximize it using gradient ascend the parameter will affect the distribution of actions over a state.

The main advantage is the simplicity of integration we can estimate the policy directly without storing additional data (action-values).


Policy Gradient versus Value Based

PG can learn a stochastic policy, while value methods cannot. It introduces some consequences:

Advantages

  • no exploration/exploitation trade-off implementation by hand. Since we output a probability distribution over actions, the agent explores the state space without always taking the same trajectory.
  • we don’t face perceptual aliasing when two states seem (or are) the same but need different actions.
    • an optimal stochastic policy will randomly move left or right, and it will not reach the goal state with a higher probability (what?)
  • more effective in high-dimensional action spaces & continuous actions spaces.
  • better convergence properties:
    • Value functions use an aggressive operator to change the value function. The action probabilities may change drastically for an arbitrarily small change in the estimated action values if that change results in a different action having the maximal value.
    • PG methods stochastic policy action preferences (probability of taking action) change smoothly over time.

Disadvantages

  • typically converge to a local rather than global optimum.
  • evaluating a policy is typically inefficient (slow) and high variance.
    • the high variance problem is partially solved using Actor Critic methods.

In some cases, stochastic policies are the best. Imagine rock-paper-scissors: if your policy was deterministic, your opponent would eventually figure it out, and you would keep losing.


Policy-Based versus Policy-Gradient

The difference is in how they optimize the parameter .

  • policy-based methods search directly for the optimal policy and optimize the parameters indirectly by maximizing the local approximation of the objective function.
  • policy-gradient methods search directly for the optimal policy and optimize the parameters directly by performing the gradient ascend on the objective function .

With policy based methods, this pseudo code represents the idea perfectly. The aim is to increase (or decrease) the . If we win the episode, we consider that each action taken was good and must be more sampled in the future.

But you don’t know how good that policy is. Exactly why policy gradient methods introduced the objective function.


Mathematics of Policy Gradient, briefly

gives us the performance of the agent given a trajectory () and it outputs the expected cumulative reward . Our goal is therefore to maximize this expected return:

  • the expected return will be the weighted average.
    • weights are given by of all possible values that the return can take.
      • — probability of each possible trajectory depends on since it defines the policy that it uses to select the actions.
  • — return from an arbitrary trajectory.
    • we consider all the possible trajectories “weighted” by their probabilities to calculate the expected return

Therefore we can write the same objective functions as:

We update our parameters with gradient-ascend

  • can’t calculate the true gradient it requires calculating the probability of each possible trajectory
  • can’t differentiate this objective function since it’s attached to the environment. The problem is we might not know about it.

Policy Gradient Theorem

helps us reformulate the objective function into a differentiable function that does not involve the differentiation of the state distribution.

  • is the direction of the steepest increase of probability of selecting action from state .
  • is just the cumulative reward we discussed earlier.

As mentioned in the beginning, REINFORCE is a policy-gradient algorithm based on Monte Carlo methods to estimate the return over an entire episode before updating the data. It works according to sequential steps:

  • use the policy to collect an episode
  • use the episode to estimate the gradient
  • update the weights of the policy