Policy gradient algorithms have driven many recent advances in language model inference. An attractive characteristic is the ability to learn from exploring one’s trajectory, a process that is important for fostering diverse and creative solutions. As we have shown in this paper, many policy gradient algorithms naturally reduce entropy as part of their training, thus reducing the diversity of trajectories explored and producing policies with increasingly limited search capabilities. This paper argues that entropy should be actively monitored and controlled throughout training. We formally analyze the contribution of key policy gradient objectives to entropy dynamics, identify empirical factors (such as numerical precision) that significantly influence entropy behavior, and propose explicit mechanisms for entropy control. These include REPO, a family of algorithms that adjusts entropy by changing the advantage function, and ADAPO, an adaptive asymmetric clipping approach. Models trained with our entropy-preserving method maintain diversity throughout training, producing a final policy that performs better and maintains trainability for sequential learning in new environments.
- † Massachusetts Institute of Technology
- ‡ Equal contribution
- ** Work I did while at Apple
