Value Based Methods

Related to Reinforcement Learning.

Policy-based methods directly train the policy to select what action to take given a state (or a probability distribution over actions at that state) and the optimal policy ( $π^{*}$ ) is found by training the policy directly.

can be deterministic or stochastic

Definition

Value-based methods learn the value function that maps a state to the expected value of being at that state to have an optimal policy $π^{*}$ .
$π^{*} (s) = ar g a max Q^{*} (s, a)$

v (s) = E [G_{t} ∣ S_{t} = s]

$v (s)$ is the value of a state $s$
$E [G_{t} ∣ S_{t} = s]$ is the expected discounted return of the agent if it starts in that state $s$

State value functions

Two typologies of value-based functions:

state-value function (denoted with $v$ ): it calculates the value of a state $S_{t}$
- $v (s) = E [G_{t} ∣ S_{t} = s] = E [R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots ∣ S_{t} = s]$
- For each state, the state-value function $v (s)$ outputs the expected return $E$ if the agent starts at that state s and then follows the policy $π$ forever afterwards (for all future timesteps, if you prefer).
action-value function (denoted with $q$ ): calculate the value of the state-action pair $(S_{t}, A_{t})$ .
- $q (s, a) = E [G_{t} ∣ S_{t} = s, A_{t} = a] = E [R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots ∣ S_{t} = s, A_{t} = a]$
- for each state and action pair, the action-value function outputs the expected return if the agent starts in that state, takes that action, and then follows the policy forever after.

To calculate EACH value of a state or a state-action pair is redundant.

Bellman Equation simplifies the state value or state-action value calculation.

In this course, we will focus on Epsilon Greedy $ϵ$ -greedy policy that handles the exploration/exploitation trade-off.

There are two strategies on how train our value function or our policy function.

On-policy vs. Off-policy

On-policy = learning from policy’s own demonstrations

Direct experience. Evaluates or improves the policy that is used to make decisions. Less sample efficient
Examples: Value-Based Methods (Sarsa), Policy Gradient Methods (PPO, A2C, A3C)

Off-policy = learning from other policy’s demonstrations. Off-Policy Learning is the idea of evaluating target policy $π (a ∣ s)$ while following behavior policy $μ (a ∣ s)$ .

Evaluates or improves a policy different from that used to generate the data. More sample efficient.
Examples: Value-Based Methods (Q-Learning, DQN), Policy Gradient Methods (DDPG, SAC, TD3)

Q-Learning is off-policy, but we only update states that we visited?

The fundamental difference is that the optimal policy is not followed from that. You learn about the optimal policy using data collected from a different policy.

$ϵ$ -greedy Exploitation/Exploration

The $ϵ$ -greedy strategy is a policy that handles the exploration/exploitation trade-off. With an initial value of $ϵ = 1.0$ :

With _probability $1 - ϵ$ : we do exploitation (aka our agent selects the action with the highest state-action pair value).
With probability $ϵ$ : we do exploration (trying random action).

At the beginning of the training, the probability of doing exploration will be huge since $ϵ$ is very high, mainly exploring. But as the training goes on, Q-table estimates better, $ϵ$ is reduced since we will need less and less exploration and more exploitation.

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Value Based Methods

Graph View

Backlinks