Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

  • Yevgen Chebotar*
  • Quan Vuong*
  • Alex Irpan
  • Karol Hausman
  • Fei Xia
  • Yao Lu
  • Aviral Kumar
  • Tianhe Yu
  • Alexander Herzog
  • Karl Pertsch
  • Keerthana Gopalakrishnan
  • Julian Ibarz
  • Ofir Nachum
  • Sumedh Sontakke
  • Grecia Salazar
  • Huong T Tran
  • Jodilyn Peralta
  • Clayton Tan
  • Deeksha Manjunath
  • Jaspiar Singht
  • Brianna Zitkovich
  • Tomas Jackson
  • Kanishka Rao
  • Chelsea Finn
  • Sergey Levine

  • *equal contribution


In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizing each action dimension and representing the Q-value of each action dimension as separate tokens, we can apply effective high-capacity sequence modeling techniques for Q-learning. We present several design decisions that enable good performance with offline RL training, and show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite.


We first describe how to enable using Transformers for Q-learning by applying discretization and autoregression of the action space. The classical way for learning a Q-function using TD-learning is based on the Bellman update rule:

We change the Bellman update to be performed for each action dimension by transforming the original MDP of the problem into an MDP where each action dimension is treated as a separate step for Q-learning. In particular, given the action dimensionality dA, the new Bellman update rule is:

This means that for each intermediate action dimension we maximize over the next action dimension given the same state, and for the final action dimension we use the first action dimension from the next state. This decomposition makes sure that the maximization within the Bellman update remains tractable while ensuring that we still solve the original MDP problem.

In order to account for the distribution shift during offline learning, we introduce a simple regularization technique that minimizes unseen actions (in the discretized case unseen action bins) to the lowest value. To accelerate learning, we also employ Monte-Carlo (MC) returns that use the original return-to-go from a given episode and n-step returns that can skip per-dimension maximization.

Results and Videos

In our experiments, we start by evaluating Q-Transformer on a suite of real world tasks introduced in the RT-1 paper while limiting the data per task to only contain 100 human demonstrations. In addition to demonstrations, we also add autonomously collected failed episodes, resulting in a dataset of 38,000 positive examples from demos and 20,000 negative autonomously collected examples.

Compared to such baselines as RT-1, IQL and Decision Transformer (DT), Q-Transformer can effectively utilize autonomous episodes to significantly improve on such skills as picking from and placing objects into drawers, moving objects near targets and closing and opening drawers.

We also benchmark our method in a challenging simulated picking task, where only ~8% of the data are positive examples, and the rest are noisy negative examples. Q-learning methods, such as QT-Opt, IQL, AW-Opt and our Q-Transformer are generally performing better on this task as they are able to utilize negative examples to learn policies through dynamic programming.

Ablating our decision choices on this picking task, we notice that both the conservative regularizer and MC returns are important for retaining the performance. Switching to a Softmax regularizer, which is similar to a CQL regularizer for discrete actions, performs significantly worse as it bounds the policy too much to the distribution of the data, showing that our choice of the regularizer works better for such tasks.

We also ablate n-step returns and notice that although introducing bias they can help us achieving the same high performance in much fewer number of gradient steps, making them an efficient choice in many problems.

We also try to run our Q-Transformer on a much larger dataset, scaling up the number of positive examples to 115,000 and the number of negative examples to 185,000 resulting in 300,000 episodes. Q-Transformer is still able to learn from this large dataset and even provide some improvement over the RT-1 BC baseline.

Finally, we use the Q-function trained by Q-Transformer as an affordance model in combination with a language planner, similar to the SayCan work.

Q-Transformer affordance estimation works better than the previously used Q-functions trained with QT-Opt, especially when combined with relabeling non-sampled tasks as negatives for the current task during training. As Q-Transformer does not require sim-to-real training that was used for the QT-Opt training, it makes it easier to use it in the absence of suitable simulations.

To test the full planning + execution system, we use Q-Transformer for both affordance estimation and the actual policy execution, where it shows to outperform the previous combination of QT-Opt and RT-1.

As can be seen in the examples of task affordance values for a given image, Q-Transformer can provide high-quality affordance values that can be used in downstream plan-and-execute frameworks.


The website template was borrowed from Jon Barron.