Instead of fitting a value function or optimizing a policy, Decision Transformers reframe RL as a sequence modelling problem using a Transformer architecture.
The key shift: rather than maximizing return, the model generates actions conditioned on a desired return. You tell it what return you want, and it produces the actions to get there.
This is a complete paradigm shift β no Bellman Equation, no reward maximization, just trajectory generation.
Input sequence at each step:
where β is the return-to-go (desired future return from timestep ).
How it works:
- Represent the RL problem as a sequence of tuples
- Condition the transformer on a target return
- Autoregressively predict the next action given history + return condition
- At inference, sample actions step-by-step conditioned on the highest return seen during training
Advantages:
- Leverages transformer strengths β handles long-horizon and multi-task problems well
- No reward engineering or value function needed
- Trajectories are interpretable as sequences
Disadvantages:
- Sensitive to data quality and trajectory length
- Computationally expensive
- Canβt discover strategies beyond whatβs in the training data β unlike RL, it canβt explore