Related to Value Based Methods.
Q-Learning is an off-policy value-based method that uses a TD approach to train its action-value function.
DQL is for the continuous state case. instead of using a table, uses a Neural Network that takes a state and approximates Q-values for each action based on that state
Tabular Q-Learning
- Consider off-policy learning of action-values
- Next action is chosen using behavior policy
- But we consider alternative successor
- We let
- And update towards value of alternative action
Q-Learning Properties (Off-policy learning)
Q-Learning converges to optimal policy — even if it’s acting suboptimally
Q-learning generally uses the 1-step bellman optimality backup
Q-learning can produce excellent results for relatively small environments because each state space is discrete and small. For comparison, the state space of a simple video game could contain few billion states, making it practically useless
- Sarsa is the on-policy version of Q-learning
Off-policy Control with Q-Learning
- We now allow both behavior and target policies to improve
- The target policy is greedy w.r.t.
Q-Learning control algorithm:
- is the off-policy target aspect: this is what makes Q-learning “off-policy”. Instead of looking at the action the agent actually chooses next, it looks ahead to the next state and assumes it will select the absolute best possible action () available there.