Reinforcement Learning (RL) is the backbone of interactive AI. The basics are to teach agents how to infer and learn from human preferences, to enable them to use multi-turn tools, and more. In this post, we present Nvidia Nemo-RL, a new open source post-training library built to support everything from single GPU prototypes to thousands of large-scale models and to easily coordinate multicomponent RL pipelines.
Part of the NVIDIA NEMO framework, Nemo-RL includes native integration with face-encompassing models, optimized training and inference, popular algorithms such as DPO and GRPO, and ray-based orchestration. The current V0.2.1 release supports models with a size of up to 32 billion parameters, but ongoing development aims to expand support for even larger models.
The key design principle for NEMO-RL is a flexible backend architecture that supports multiple training and rollout backends. To train the backend, the library now supports tying face models with Pytorch native parallelism, and Megatron core backends will soon arrive, allowing for larger models with advanced parallelism strategies.
Nemo-RL uses a VLLM backend for Generation to easily extend additional generation backends such as Nvidia Tensort-LLM and Sglang. Due to the overall design, high-level algorithm implementations remain agnostic to backend implementation details of the backend, with each backend operating in its own isolated environment and adhere to standardized training or generation interfaces. This architecture allows seamless scaling from single GPU prototypes to thousands of GPU deployments without changing the algorithm code.
In this post, we will specifically explore how to use NEMO-RL to seamlessly reproduce Deepscaler-1.5B recipes using Group Relative Policy Optimization (GRPO) reinforcement learning algorithms.
Training high-performance inference models with NEMO-RL
Recently, long-term thinking (COT) inference models such as Deepseek-R1 and Openai O1 have been gaining widespread attention. These models provide much more sophisticated functionality for language models across a variety of challenging domains. The next section shows how to use NEMO-RL to train these high-performance inference models.
Following the Deepscaler recipe, we provide data sets and methodologies for training inference models for difficult mathematical problems. In particular, train QWEN-1.5B to OpenAI O1 level using GRPO on the competitive academic mathematics benchmark AIME24.
Step-by-step training process
Because long COT inference models can be very slow due to long generations of time, the initial training of the maximum sequence length of Deepscaler gradually increases the maximum sequence length used. Specifically, Deepscaler has three training steps: 8K context length, 16K context length, and 24K context length. This approach also helps to control the long-term distribution of rollout sequence lengths.
Performing this training on NEMO-RL is very easy and requires only three steps.
Step 1: Setup
Clone the report and install the UV Python package. UV allows you to quickly create isolated virtual environments despite potentially conflicting dependencies, and at the same time allow native integration with Ray.
git clone git@github.com:NVIDIA-NeMo/RL.git
cd nemo-rl
pip install uv
Step 2: Training
Perform the training using DeepSeek-R1-Distill-Qwen-1.5B. Train at maximum context length, then at maximum context length, then at maximum context length of 24k. NEMO-RL natively integrates face models with embracing to directly specify model selection. The configuration file specifies the DeepScaler dataset and the correct GRPO hyperparameters.
uv run examples/run_grpo_math.py
--config=examples/configs/grpo-deepscaler-1.5b-8K.yaml
uv run examples/run_grpo_math.py
--config=examples/configs/grpo-deepscaler-1.5b-16K.yaml
policy.model_name=/path/to/8K/checkpoint/hf
uv run examples/run_grpo_math.py
--config=examples/configs/grpo-deepscaler-1.5b-24K.yaml
policy.model_name=/path/to/16K/checkpoint/hf
policy:
# Qwen/Qwen2.5-1.5B has tied weights which are only supported with dtensor policy with
tp size 1 (https://github.com/NVIDIA-NeMo/RL/issues/227)
model_name: "Qwen/Qwen2.5-1.5B"
tokenizer:
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from
the model's default
train_global_batch_size: 512
train_micro_batch_size: 4
generation_batch_size: 32 # Only used when generating using HF backend
logprob_batch_size: 4
max_total_sequence_length: 512
precision: "bfloat16"
fsdp_offload_enabled: false
activation_checkpointing_enabled: false
Step 3: Evaluation
Transform checkpoints into facial-formed embraces and evaluate the model. Note that models are constantly evaluated throughout training. You must specify the model configuration, the model location, and the desired location for the embracing face checkpoints, as shown below.
uv run examples/convert_dcp_to_hf.py
--config=results/grpo-deepscaler-1.5b-8K/step_xx/config.yaml
--dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_xx/policy/weights
--hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_xx/hf
uv run examples/run_eval.py \
generation.model_name=results/grpo-deepscaler-1.5b-8K/step_xx/hf
result
Figure 2 shows the training curve for NEMO-RL. A training reward of 0.65 is achieved in just 400 steps.


Figure 3 shows the evaluation results of AIME24 across training, ultimately surpassing Openai O1.


To visualize tensorboard logs for a start on what you can expect from Deepscaler recipes, check out Nemo-RL Deepscaler Tensorboard Viewer Google Colab.
Get started with Nemo-RL
NEMO-RL is a scalable post-training library designed for models ranging from a single GPU to thousands of other models. It includes seamless integration with hugs, modular design for flexibility, and efficient resource management with Ray.
To start your own reinforcement learning experiments using NEMO-RL, explore the open source Nvidia Nemo-RL Github Repo. Here you can find detailed documentation, sample scripts, and configuration files. You can also try out the DeepScaler and OpenMathinStruct-2 examples for more information.
