Diffuse (large) language models (dLLMs) are now comparable to the downstream performance of autoregressive language models in many tasks, but are expected to be more efficient during inference. One of the key design aspects of dLLM is the sampling procedure that selects the tokens to unmask at each diffusion step. Indeed, recent work has found that heuristic strategies such as confidence thresholding improve both sample quality and token throughput compared to random unmasking. However, such heuristics also have drawbacks. Manual tuning is required, and performance has been observed to degrade as block size increases. In this study, we instead propose to use reinforcement learning to train the sampling procedure. Specifically, we formalize masked diffusion sampling as a Markov decision process in which the dLLM acts as the environment, and propose a lightweight policy based on a single-layer transformer that maps the confidence of dLLM tokens to unmasking decisions. Our experiments show that these trained policies match the performance of state-of-the-art heuristics when combined with semi-autoregressive (block) generation and outperform in a fully diffuse setting.
- * Equal contributor
- † University of Amsterdam
- ‡ Massachusetts Institute of Technology
- ** Work I did while at Apple
