Related to Value Based Methods.
It’s a Model-Free Policy Evaluation method together with MC Learning.
TD learns from incomplete episodes via bootstrapping. It waits for only one step to form a TD target and update using the immediate reward and the current discounted guess of how good the next state is .
Bootstrapping
bootstrapping in RL means that you update a value based on some estimates and not on some exact values and it’s done via Bellman Update.
Unlike Monte Carlo Learning,
- TD can learn before knowing the final outcome
- TD can learn without the final outcome
TD exploits the MDP structure.
Simplest TD learning algorithm TD(0): Update value towards the TD target . The update equation becomes:
where is the TD Error term between the estimated returns:
Notice the similarity/difference to Monte Carlo Learning. We just replaced in the update equation with the Bellman Expectation Backup, . Because we did this, the problem has to be MDP. It’s not mandatory for Monte-Carlo.