Abstract
Reinforcement learning (RL) algorithms have achieved superhuman performance
on many sequential decision-making tasks, but often struggle in domains with
large, combinatorial action spaces. To address this, we introduce a practical and
stable algorithm for training discrete diffusion models to represent policies in
such environments. We formulate a policy mirror descent algorithm that enhances
training stability by reframing policy optimization as an inference problem, which
naturally aligns with the learning objective of discrete diffusion models. Through
extensive experiments on a suite of challenging benchmark tasks, we demonstrate
that our approach achieves significant improvements over existing methods in both
performance and sample efficiency. This work opens a promising new direction
for applying discrete diffusion models in RL to tackle long-standing challenges in
large-scale combinatorial action spaces.