Temporal Difference Learning

It’s a Model-Free Policy Evaluation method together with MC Learning.

TD learns from incomplete episodes via bootstrapping. It waits for only one step to form a TD target and update $V (S_{t})$ using the immediate reward $R_{t + 1}$ and the current discounted guess of how good the next state is $γ \cdot V (S_{t + 1})$ .

Bootstrapping

bootstrapping in RL means that you update a value based on some estimates and not on some exact values and it’s done via Bellman Update.

Unlike Monte Carlo Learning,

TD can learn before knowing the final outcome
TD can learn without the final outcome

TD exploits the MDP structure.

Simplest TD learning algorithm TD(0): Update value $V (S_{t})$ towards the TD target $R_{t + 1} + γ \cdot V (S_{t + 1})$ . The update equation becomes:

V (S_{t}) = V (S_{t}) + α δ_{t}

where $δ_{t}$ is the TD Error term between the estimated returns:

δ_{t} = R_{t + 1} + γV (S_{t + 1}) - V (S_{t})

Notice the similarity/difference to Monte Carlo Learning. We just replaced $G_{t}$ in the update equation with the Bellman Expectation Backup, $G_{t} = R_{t + 1} + γV (S_{t + 1})$ . Because we did this, the problem has to be MDP. It’s not mandatory for Monte-Carlo.

🚀 Costin Chitic

Recent Notes

Actor-Critic Methods

Deep Q-Learning

Monte Carlo Learning

Proximal Policy Optimization (PPO)

Q-Learning

Temporal Difference Learning

Graph View

Backlinks