Reinforcement learning with nvidia nemo-RL: Recreate deep scoring recipes using GRPO

Machine Learning


Reinforcement Learning (RL) is the backbone of interactive AI. The basics are to teach agents how to infer and learn from human preferences, to enable them to use multi-turn tools, and more. In this post, we present Nvidia Nemo-RL, a new open source post-training library built to support everything from single GPU prototypes to thousands of large-scale models and to easily coordinate multicomponent RL pipelines.

Part of the NVIDIA NEMO framework, Nemo-RL includes native integration with face-encompassing models, optimized training and inference, popular algorithms such as DPO and GRPO, and ray-based orchestration. The current V0.2.1 release supports models with a size of up to 32 billion parameters, but ongoing development aims to expand support for even larger models.

The key design principle for NEMO-RL is a flexible backend architecture that supports multiple training and rollout backends. To train the backend, the library now supports tying face models with Pytorch native parallelism, and Megatron core backends will soon arrive, allowing for larger models with advanced parallelism strategies.

Nemo-RL uses a VLLM backend for Generation to easily extend additional generation backends such as Nvidia Tensort-LLM and Sglang. Due to the overall design, high-level algorithm implementations remain agnostic to backend implementation details of the backend, with each backend operating in its own isolated environment and adhere to standardized training or generation interfaces. This architecture allows seamless scaling from single GPU prototypes to thousands of GPU deployments without changing the algorithm code.

In this post, we will specifically explore how to use NEMO-RL to seamlessly reproduce Deepscaler-1.5B recipes using Group Relative Policy Optimization (GRPO) reinforcement learning algorithms.

Training high-performance inference models with NEMO-RL

Recently, long-term thinking (COT) inference models such as Deepseek-R1 and Openai O1 have been gaining widespread attention. These models provide much more sophisticated functionality for language models across a variety of challenging domains. The next section shows how to use NEMO-RL to train these high-performance inference models.

Following the Deepscaler recipe, we provide data sets and methodologies for training inference models for difficult mathematical problems. In particular, train QWEN-1.5B to OpenAI O1 level using GRPO on the competitive academic mathematics benchmark AIME24.

Step-by-step training process

Because long COT inference models can be very slow due to long generations of time, the initial training of the maximum sequence length of Deepscaler gradually increases the maximum sequence length used. Specifically, Deepscaler has three training steps: 8K context length, 16K context length, and 24K context length. This approach also helps to control the long-term distribution of rollout sequence lengths.

Performing this training on NEMO-RL is very easy and requires only three steps.

Step 1: Setup

Clone the report and install the UV Python package. UV allows you to quickly create isolated virtual environments despite potentially conflicting dependencies, and at the same time allow native integration with Ray.

git clone  git@github.com:NVIDIA-NeMo/RL.git

cd nemo-rl
pip install uv

Step 2: Training

Perform the training using DeepSeek-R1-Distill-Qwen-1.5B. Train at maximum context length, then at maximum context length, then at maximum context length of 24k. NEMO-RL natively integrates face models with embracing to directly specify model selection. The configuration file specifies the DeepScaler dataset and the correct GRPO hyperparameters.

uv run examples/run_grpo_math.py 
--config=examples/configs/grpo-deepscaler-1.5b-8K.yaml

uv run examples/run_grpo_math.py 
--config=examples/configs/grpo-deepscaler-1.5b-16K.yaml 
policy.model_name=/path/to/8K/checkpoint/hf

uv run examples/run_grpo_math.py 
--config=examples/configs/grpo-deepscaler-1.5b-24K.yaml 
policy.model_name=/path/to/16K/checkpoint/hf
policy:
  # Qwen/Qwen2.5-1.5B has tied weights which are only supported with dtensor policy with 
tp size 1 (https://github.com/NVIDIA-NeMo/RL/issues/227)
  model_name: "Qwen/Qwen2.5-1.5B"
  tokenizer:
    name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from 
the model's default
  train_global_batch_size: 512
  train_micro_batch_size: 4
  generation_batch_size: 32 # Only used when generating using HF backend
  logprob_batch_size: 4
  max_total_sequence_length: 512
  precision: "bfloat16"
  fsdp_offload_enabled: false
  activation_checkpointing_enabled: false

Step 3: Evaluation

Transform checkpoints into facial-formed embraces and evaluate the model. Note that models are constantly evaluated throughout training. You must specify the model configuration, the model location, and the desired location for the embracing face checkpoints, as shown below.

uv run examples/convert_dcp_to_hf.py 
--config=results/grpo-deepscaler-1.5b-8K/step_xx/config.yaml 
--dcp-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_xx/policy/weights 
--hf-ckpt-path=results/grpo-deepscaler-1.5b-8K/step_xx/hf

uv run examples/run_eval.py \
    generation.model_name=results/grpo-deepscaler-1.5b-8K/step_xx/hf

result

Figure 2 shows the training curve for NEMO-RL. A training reward of 0.65 is achieved in just 400 steps.

A chart showing training rewards and training steps for Deepscaler QWEN1.5B recipes using NEMO-RL. This curve shows a consistent reward improvement, reaching an average of 0.65 reward scores around step 400.A chart showing training rewards and training steps for Deepscaler QWEN1.5B recipes using NEMO-RL. This curve shows a consistent reward improvement, reaching an average of 0.65 reward scores around step 400.
Figure 2. Training curve using NEMO-RL for deep scoring QWEN1.5B recipe

Figure 3 shows the evaluation results of AIME24 across training, ultimately surpassing Openai O1.

A line chart showing AIME24 rating scores plotted against training steps of QWEN1.5B recipes trained using NEMO-RL. This curve shows a progressive improvement in performance, ultimately surpassing the OpenAI O1 baseline score of 40 on the AIME24 benchmark. A line chart showing AIME24 rating scores plotted against training steps of QWEN1.5B recipes trained using NEMO-RL. This curve shows a progressive improvement in performance, ultimately surpassing the OpenAI O1 baseline score of 40 on the AIME24 benchmark.
Figure 3. AIME24 rating score for QWEN1.5B recipes trained using Nemo-RL

To visualize tensorboard logs for a start on what you can expect from Deepscaler recipes, check out Nemo-RL Deepscaler Tensorboard Viewer Google Colab.

Get started with Nemo-RL

NEMO-RL is a scalable post-training library designed for models ranging from a single GPU to thousands of other models. It includes seamless integration with hugs, modular design for flexibility, and efficient resource management with Ray.

To start your own reinforcement learning experiments using NEMO-RL, explore the open source Nvidia Nemo-RL Github Repo. Here you can find detailed documentation, sample scripts, and configuration files. You can also try out the DeepScaler and OpenMathinStruct-2 examples for more information.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *