- Nemo RL: A Scalable and Efficient Post-Training Library
Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
What you can expect:
- Seamless integration with Hugging Face for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
- High-performance implementation with Megatron Core, supporting various parallelism techniques for large models (>100B) and large context lengths.
- Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
- Flexibility with a modular design that allows easy integration and customization.
- Comprehensive documentation that is both detailed and user-friendly, with practical examples.
- [5/14/2025] Reproduce DeepscaleR with NeMo RL!
✅ Available now | 🔜 Coming in v0.3
-
âś… Fast Generation - vLLM backend for optimized inference.
-
âś… HuggingFace Integration - Works with 1-32B models (Qwen2.5, Llama).
-
âś… Distributed Training - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.
-
âś… Environment Support - Support for multi-environment training.
-
âś… Learning Algorithms - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).
-
âś… Multi-Turn RL - Multi-turn generation and training for RL with tool use, games, etc.
-
âś… Large Model Support - Native PyTorch support for models up to 32B parameters.
-
âś… Advanced Parallelism - PyTorch native FSDP2, TP, and SP for efficient training.
-
âś… Worker Isolation - Process isolation between RL Actors (no worries about global state).
-
âś… Environment Isolation - Dependency isolation between components.
-
🔜 Improved Native Performance - Improve training time for Native Pytorch Models.
-
🔜 (even) Larger Model Support with Long(er) Sequence - Support advanced parallelism in training with Megatron Core.
-
🔜 MoE Models - Support DeepseekV3 and Llama4.
-
🔜 Megatron Inference - Support Megatron Inference for day-0 support for new megatron models.
Clone NeMo RL.
git clone [email protected]:NVIDIA/NeMo-RL.git nemo-rl
cd nemo-rl
Install uv
.
# For faster setup and environment isolation, we use `uv`
pip install uv
# If you cannot install at the system level, you can install for your user with
# pip install --user uv
# Use `uv run` to launch all commands. It handles pip installing implicitly and
# ensures your environment is up to date with our lock file.
# Note that it is not recommended to activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py
Important Notes:
- Use the
uv run <command>
to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions. - Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
- Reminder: Don't forget to set your
HF_HOME
,WANDB_API_KEY
, andHF_DATASETS_CACHE
(if needed). You'll need to do ahuggingface-cli login
as well for Llama models.
We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.
To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B
:
# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py
By default, this uses the configuration in examples/configs/grpo_math_1B.yaml
. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,
# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
cluster.gpus_per_node=8
You can override any of the parameters listed in the yaml configuration file. For example,
uv run python examples/run_grpo_math.py \
policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
checkpointing.checkpoint_dir="results/llama1b_math" \
logger.wandb_enabled=True \
logger.wandb.name="grpo-llama1b_math" \
logger.num_val_samples_to_print=10
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2
# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
The required CONTAINER
can be built by following the instructions in the Docker documentation.
This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=16
# Download Qwen before the job starts to avoid spending time downloading during the training loop
HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B
# Ensure HF_HOME is included in your MOUNTS
HF_HOME=/path/to/hf_home \
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:
uv run python examples/run_grpo_sliding_puzzle.py
We provide an example SFT experiment using the SQuAD dataset.
The default SFT configuration is set to run on a single GPU. To start the experiment:
uv run python examples/run_sft.py
This fine-tunes the Llama3.2-1B
model on the SQuAD dataset using a 1 GPU.
To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:
uv run python examples/run_sft.py \
policy.model_name="meta-llama/Meta-Llama-3-8B" \
policy.train_global_batch_size=128 \
sft.val_global_batch_size=128 \
cluster.gpus_per_node=8
Refer to examples/configs/sft.yaml
for a full list of parameters that can be overridden.
# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2
COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.
The default DPO experiment is configured to run on a single GPU. To launch the experiment:
uv run python examples/run_dpo.py
This trains Llama3.2-1B-Instruct
on one GPU.
If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:
uv run python examples/run_dpo.py \
policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
policy.train_global_batch_size=256 \
cluster.gpus_per_node=8
Any of the DPO parameters can be customized from the command line. For example:
uv run python examples/run_dpo.py \
dpo.sft_loss_weight=0.1 \
dpo.preference_average_log_probs=True \
checkpointing.checkpoint_dir="results/llama_dpo_sft" \
logger.wandb_enabled=True \
logger.wandb.name="llama-dpo-sft"
Refer to examples/configs/dpo.yaml
for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.
For distributed DPO training across multiple nodes, modify the following script for your use case:
# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2
COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
RAY_DEDUP_LOGS=0 \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
--nodes=${NUM_ACTOR_NODES} \
--account=YOUR_ACCOUNT \
--job-name=YOUR_JOBNAME \
--partition=YOUR_PARTITION \
--time=4:0:0 \
--gres=gpu:8 \
ray.sub
We provide evaluation tools to assess model capabilities.
If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:
# Example for a GRPO checkpoint at step 170
uv run python examples/convert_dcp_to_hf.py \
--config results/grpo/step_170/config.yaml \
--dcp-ckpt-path results/grpo/step_170/policy/weights/ \
--hf-ckpt-path results/grpo/hf
Note: Adjust the paths according to your training output directory structure.
For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.
Run evaluation script with converted model:
uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf
Run evaluation script with custom settings:
# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
# Pass@1 accuracy averaged over 16 samples for each problem
uv run python examples/run_eval.py \
generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
generation.temperature=0.6 \
generation.top_p=0.95 \
generation.vllm_cfg.max_model_len=32768 \
data.dataset_name=HuggingFaceH4/MATH-500 \
data.dataset_key=test \
eval.num_tests_per_prompt=16 \
cluster.gpus_per_node=8
Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.
Refer to examples/configs/eval.yaml
for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.
For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.
If you use NeMo RL in your research, please cite it using the following BibTeX entry:
@misc{nemo-rl,
title = {NeMo RL: A Scalable and Efficient Post-Training Library},
howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
year = {2025},
note = {GitHub repository},
}
We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.
NVIDIA NeMo RL is licensed under the Apache License 2.0.