Reinforcement learning with nvidia nemo-RL: Recreate deep scoring recipes using GRPO

Reinforcement Learning (RL) is the backbone of interactive AI. The basics are to teach agents how to infer and learn from human preferences, to enable them to use multi-turn tools, and more. In this post, we present Nvidia Nemo-RL, a new open source post-training library built to support everything from single GPU prototypes to thousands of large-scale models and to easily coordinate multicomponent RL pipelines.

Part of the NVIDIA NEMO framework, Nemo-RL includes native integration with face-encompassing models, optimized training and inference, popular algorithms such as DPO and GRPO, and ray-based orchestration. The current V0.2.1 release supports models with a size of up to 32 billion parameters, but ongoing development aims to expand support for even larger models.

The key design principle for NEMO-RL is a flexible backend architecture that supports multiple training and rollout backends. To train the backend, the library now supports tying face models with Pytorch native parallelism, and Megatron core backends will soon arrive, allowing for larger models with advanced parallelism strategies.

Nemo-RL uses a VLLM backend for Generation to easily extend additional generation backends such as Nvidia Tensort-LLM and Sglang. Due to the overall design, high-level algorithm implementations remain agnostic to backend implementation details of the backend, with each backend operating in its own isolated environment and adhere to standardized training or generation interfaces. This architecture allows seamless scaling from single GPU prototypes to thousands of GPU deployments without changing the algorithm code.

In this post, we will specifically explore how to use NEMO-RL to seamlessly reproduce Deepscaler-1.5B recipes using Group Relative Policy Optimization (GRPO) reinforcement learning algorithms.