Off-Belief Learning

International Conference on Machine Learning (ICML)


The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and rely on multi-step counterfactual reasoning based on assumptions about other agents’ actions and thus fail when paired with humans or independently trained agents. In contrast, no current methods can learn optimal grounded policies that do not rely on counterfactual information from observing other agents’ actions. To address this, we present off-belief learning (OBL): at each time step OBL agents interpret all past actions as if they were taken by a given, fixed policy, but assume that future actions will be taken by an optimal policy. When π0 is uniform random, OBL learns the optimal grounded policy. OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next. This introduces counterfactual reasoning in a controlled manner. Unlike traditional RL which may converge to any equilibrium policy, OBL converges to a unique policy, making it more suitable for zero-shot coordination. OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark humanAI/zero-shot coordination problem Hanabi.

Latest Publications