Q-Learning

Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function.

DQL is for the continuous state case. instead of using a table, uses a Neural Network that takes a state and approximates Q-values for each action based on that state

Tabular Q-Learning

Consider off-policy learning of action-values $Q (s, a)$
Next action is chosen using behavior policy $a_{t + 1} \sim μ (\cdot ∣ s_{t})$
But we consider alternative successor $a^{'} \sim π (\cdot ∣ s_{t})$
- We let $a^{'} = ar g max_{a} Q (s_{t + 1}, a)$
And update $Q (s_{t}, a_{t})$ towards value of alternative action $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ Q (s t + 1, a^{'}) - Q (s_{t}, a_{t})]$

Q-Learning Properties (Off-policy learning)

Q-Learning converges to optimal policy — even if it’s acting suboptimally

Q-learning generally uses the 1-step bellman optimality backup

Q-learning can produce excellent results for relatively small environments because each state space is discrete and small. For comparison, the state space of a simple video game could contain few billion states, making it practically useless

Sarsa is the on-policy version of Q-learning

Off-policy Control with Q-Learning

We now allow both behavior and target policies to improve
The target policy $π$ is greedy w.r.t. $Q (s, a) π (S + t + 1) = argmax_{a^{'}} Q (S_{t + 1}, a^{'})$

Q-Learning control algorithm: $Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t + 1} + γ max_{a^{'}} Q (s_{t + 1}, a^{'}) - Q (s_{t}, a_{t})]$

$max_{a^{'}} Q (s_{t + 1}, a^{'})$ is the off-policy target aspect: this is what makes Q-learning “off-policy”. Instead of looking at the action the agent actually chooses next, it looks ahead to the next state $s_{t + 1}$ and assumes it will select the absolute best possible action ( $a'$ ) available there.

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Q-Learning

Graph View

Backlinks