TD3 is an evolution of DDPG algorithms and it uses an Actor-Critic framework with twin critics and delayed policy updates.
- off-policy algorithm
- only for envs with continuous action spaces
Why is TD3 better than DDPG?
- The learned Q-function begins to dramatically overestimate Q-values, so policy exploits these errors in the Q-function.
Trick one: Twin Critic Networks: Two Q-function estimators to mitigate overestimation.
Trick two: Delayed Policy Updates: Policy updated less frequently than Q-functions.
Trick three: Target Policy Smoothing: Adds clipped noise to target actions for robustness i.e. make it harder for the policy to exploit Q-function errors by smoothing out Q along changes in action.