Skip to content

NVIDIA/NeMo-RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nemo RL: A Scalable and Efficient Post-Training Library

Nemo RL is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.

What you can expect:

  • Seamless integration with Hugging Face for ease of use, allowing users to leverage a wide range of pre-trained models and tools.
  • High-performance implementation with Megatron Core, supporting various parallelism techniques for large models (>100B) and large context lengths.
  • Efficient resource management using Ray, enabling scalable and flexible deployment across different hardware configurations.
  • Flexibility with a modular design that allows easy integration and customization.
  • Comprehensive documentation that is both detailed and user-friendly, with practical examples.

📣 News

Features

✅ Available now | 🔜 Coming in v0.3

  • âś… Fast Generation - vLLM backend for optimized inference.

  • âś… HuggingFace Integration - Works with 1-32B models (Qwen2.5, Llama).

  • âś… Distributed Training - Fully Sharded Data Parallel (FSDP) support and Ray-based infrastructure.

  • âś… Environment Support - Support for multi-environment training.

  • âś… Learning Algorithms - GRPO (Group Relative Policy Optimization), SFT (Supervised Fine-Tuning), and DPO (Direct Preference Optimization).

  • âś… Multi-Turn RL - Multi-turn generation and training for RL with tool use, games, etc.

  • âś… Large Model Support - Native PyTorch support for models up to 32B parameters.

  • âś… Advanced Parallelism - PyTorch native FSDP2, TP, and SP for efficient training.

  • âś… Worker Isolation - Process isolation between RL Actors (no worries about global state).

  • âś… Environment Isolation - Dependency isolation between components.

  • 🔜 Improved Native Performance - Improve training time for Native Pytorch Models.

  • 🔜 (even) Larger Model Support with Long(er) Sequence - Support advanced parallelism in training with Megatron Core.

  • 🔜 MoE Models - Support DeepseekV3 and Llama4.

  • 🔜 Megatron Inference - Support Megatron Inference for day-0 support for new megatron models.

Prerequisites

Clone NeMo RL.

git clone [email protected]:NVIDIA/NeMo-RL.git nemo-rl
cd nemo-rl

Install uv.

# For faster setup and environment isolation, we use `uv`
pip install uv

# If you cannot install at the system level, you can install for your user with
# pip install --user uv

# Use `uv run` to launch all commands. It handles pip installing implicitly and
# ensures your environment is up to date with our lock file.

# Note that it is not recommended to activate the venv and instead use `uv run` since
# it ensures consistent environment usage across different shells and sessions.
# Example: uv run python examples/run_grpo_math.py

Important Notes:

  • Use the uv run <command> to execute scripts within the managed environment. This helps maintain consistency across different shells and sessions.
  • Ensure you have the necessary CUDA drivers and PyTorch installed compatible with your hardware.
  • Reminder: Don't forget to set your HF_HOME, WANDB_API_KEY, and HF_DATASETS_CACHE (if needed). You'll need to do a huggingface-cli login as well for Llama models.

GRPO

We have a reference GRPO experiment config set up trained for math benchmarks using the OpenInstructMath2 dataset.

GRPO Single Node

To run GRPO on a single GPU for Qwen/Qwen2.5-1.5B:

# Run the GRPO math example using a 1B parameter model
uv run python examples/run_grpo_math.py

By default, this uses the configuration in examples/configs/grpo_math_1B.yaml. You can customize parameters with command-line overrides. For example, to run on 8 GPUs,

# Run the GRPO math example using a 1B parameter model using 8 GPUs
uv run python examples/run_grpo_math.py \
  cluster.gpus_per_node=8

You can override any of the parameters listed in the yaml configuration file. For example,

uv run python examples/run_grpo_math.py \
  policy.model_name="meta-llama/Llama-3.2-1B-Instruct" \
  checkpointing.checkpoint_dir="results/llama1b_math" \
  logger.wandb_enabled=True \
  logger.wandb.name="grpo-llama1b_math" \
  logger.num_val_samples_to_print=10

GRPO Multi-node

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

# grpo_math_8b uses Llama-3.1-8B-Instruct model
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

The required CONTAINER can be built by following the instructions in the Docker documentation.

GRPO Qwen2.5-32B

This section outlines how to run GRPO for Qwen2.5-32B with a 16k sequence length.

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=16

# Download Qwen before the job starts to avoid spending time downloading during the training loop
HF_HOME=/path/to/hf_home huggingface-cli download Qwen/Qwen2.5-32B

# Ensure HF_HOME is included in your MOUNTS
HF_HOME=/path/to/hf_home \
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml policy.model_name='Qwen/Qwen2.5-32B' policy.generation.vllm_cfg.tensor_parallel_size=4 policy.max_total_sequence_length=16384 cluster.num_nodes=${NUM_ACTOR_NODES} policy.dtensor_cfg.enabled=True policy.dtensor_cfg.tensor_parallel_size=8 policy.dtensor_cfg.sequence_parallel=True policy.dtensor_cfg.activation_checkpointing=True checkpointing.checkpoint_dir='results/qwen2.5-32b' logger.wandb_enabled=True logger.wandb.name='qwen2.5-32b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

GRPO Multi-Turn

We also support multi-turn generation and training (tool use, games, etc.). Reference example for training to play a Sliding Puzzle Game:

uv run python examples/run_grpo_sliding_puzzle.py 

Supervised Fine-Tuning (SFT)

We provide an example SFT experiment using the SQuAD dataset.

SFT Single Node

The default SFT configuration is set to run on a single GPU. To start the experiment:

uv run python examples/run_sft.py

This fine-tunes the Llama3.2-1B model on the SQuAD dataset using a 1 GPU.

To use multiple GPUs on a single node, you can modify the cluster configuration. This adjustment will also let you potentially increase the model and batch size:

uv run python examples/run_sft.py \
  policy.model_name="meta-llama/Meta-Llama-3-8B" \
  policy.train_global_batch_size=128 \
  sft.val_global_batch_size=128 \
  cluster.gpus_per_node=8

Refer to examples/configs/sft.yaml for a full list of parameters that can be overridden.

SFT Multi-node

# Run from the root of NeMo RL repo
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_sft.py --config examples/configs/sft.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/sft_llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='sft-llama8b'" \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

DPO

We provide a sample DPO experiment that uses the HelpSteer3 dataset for preference-based training.

DPO Single Node

The default DPO experiment is configured to run on a single GPU. To launch the experiment:

uv run python examples/run_dpo.py

This trains Llama3.2-1B-Instruct on one GPU.

If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:

uv run python examples/run_dpo.py \
  policy.model_name="meta-llama/Llama-3.1-8B-Instruct" \
  policy.train_global_batch_size=256 \
  cluster.gpus_per_node=8

Any of the DPO parameters can be customized from the command line. For example:

uv run python examples/run_dpo.py \
  dpo.sft_loss_weight=0.1 \
  dpo.preference_average_log_probs=True \
  checkpointing.checkpoint_dir="results/llama_dpo_sft" \
  logger.wandb_enabled=True \
  logger.wandb.name="llama-dpo-sft"

Refer to examples/configs/dpo.yaml for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the DPO documentation.

DPO Multi-node

For distributed DPO training across multiple nodes, modify the following script for your use case:

# Run from the root of NeMo RL repo
## number of nodes to use for your job
NUM_ACTOR_NODES=2

COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
RAY_DEDUP_LOGS=0 \
CONTAINER=YOUR_CONTAINER \
MOUNTS="$PWD:$PWD" \
sbatch \
    --nodes=${NUM_ACTOR_NODES} \
    --account=YOUR_ACCOUNT \
    --job-name=YOUR_JOBNAME \
    --partition=YOUR_PARTITION \
    --time=4:0:0 \
    --gres=gpu:8 \
    ray.sub

Evaluation

We provide evaluation tools to assess model capabilities.

Convert Model Format (Optional)

If you have trained a model and saved the checkpoint in the Pytorch DCP format, you first need to convert it to the Hugging Face format before running evaluation:

# Example for a GRPO checkpoint at step 170
uv run python examples/convert_dcp_to_hf.py \
    --config results/grpo/step_170/config.yaml \
    --dcp-ckpt-path results/grpo/step_170/policy/weights/ \
    --hf-ckpt-path results/grpo/hf

Note: Adjust the paths according to your training output directory structure.

For an in-depth explanation of checkpointing, refer to the Checkpointing documentation.

Run Evaluation

Run evaluation script with converted model:

uv run python examples/run_eval.py generation.model_name=$PWD/results/grpo/hf

Run evaluation script with custom settings:

# Example: Evaluation of DeepScaleR-1.5B-Preview on MATH-500 using 8 GPUs
#          Pass@1 accuracy averaged over 16 samples for each problem
uv run python examples/run_eval.py \
    generation.model_name=agentica-org/DeepScaleR-1.5B-Preview \
    generation.temperature=0.6 \
    generation.top_p=0.95 \
    generation.vllm_cfg.max_model_len=32768 \
    data.dataset_name=HuggingFaceH4/MATH-500 \
    data.dataset_key=test \
    eval.num_tests_per_prompt=16 \
    cluster.gpus_per_node=8

Note: Evaluation results may vary slightly due to various factors, such as sampling parameters, random seed, inference engine version, and inference engine settings.

Refer to examples/configs/eval.yaml for a full list of parameters that can be overridden. For an in-depth explanation of evaluation, refer to the Evaluation documentation.

Set Up Clusters

For detailed instructions on how to set up and launch NeMo RL on Slurm or Kubernetes clusters, please refer to the dedicated Cluster Start documentation.

Citation

If you use NeMo RL in your research, please cite it using the following BibTeX entry:

@misc{nemo-rl,
title = {NeMo RL: A Scalable and Efficient Post-Training Library},
howpublished = {\url{https://github.com/NVIDIA/NeMo-RL}},
year = {2025},
note = {GitHub repository},
}

Contributing

We welcome contributions to NeMo RL! Please see our Contributing Guidelines for more information on how to get involved.

Licenses

NVIDIA NeMo RL is licensed under the Apache License 2.0.