Related to Value Based Methods.

It’s a Model-Free Policy Evaluation method together with MC Learning.

TD learns from incomplete episodes via bootstrapping. It waits for only one step to form a TD target and update using the immediate reward and the current discounted guess of how good the next state is .

Bootstrapping

bootstrapping in RL means that you update a value based on some estimates and not on some exact values and it’s done via Bellman Update.

Unlike Monte Carlo Learning,

  • TD can learn before knowing the final outcome
  • TD can learn without the final outcome

TD exploits the MDP structure.

Simplest TD learning algorithm TD(0): Update value towards the TD target . The update equation becomes:

where is the TD Error term between the estimated returns:

Notice the similarity/difference to Monte Carlo Learning. We just replaced in the update equation with the Bellman Expectation Backup, . Because we did this, the problem has to be MDP. It’s not mandatory for Monte-Carlo.