Instead of fitting a value function or optimizing a policy, Decision Transformers reframe RL as a sequence modelling problem using a Transformer architecture.

The key shift: rather than maximizing return, the model generates actions conditioned on a desired return. You tell it what return you want, and it produces the actions to get there.

This is a complete paradigm shift β€” no Bellman Equation, no reward maximization, just trajectory generation.

Input sequence at each step:

where ​ is the return-to-go (desired future return from timestep ).

How it works:

  1. Represent the RL problem as a sequence of tuples
  2. Condition the transformer on a target return
  3. Autoregressively predict the next action given history + return condition
  4. At inference, sample actions step-by-step conditioned on the highest return seen during training

Advantages:

  • Leverages transformer strengths β€” handles long-horizon and multi-task problems well
  • No reward engineering or value function needed
  • Trajectories are interpretable as sequences

Disadvantages:

  • Sensitive to data quality and trajectory length
  • Computationally expensive
  • Can’t discover strategies beyond what’s in the training data β€” unlike RL, it can’t explore