Skip to content

🤝 Compatibility of the TRL CLI with accelerate arguments #3409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
May 6, 2025

Conversation

qgallouedec
Copy link
Member

@qgallouedec qgallouedec commented May 3, 2025

This PR adds two features

  • Use any accelerate launch argument in the TRL CLI. For example, now you ca use:
    trl sft \
      --model_name_or_path Qwen/Qwen2.5-0.5B \
      --dataset_name stanfordnlp/imdb \
      --num_processes 4
    
  • Use predefined accelerate config, example:
    trl sft \
      --model_name_or_path Qwen/Qwen2.5-0.5B \
      --dataset_name stanfordnlp/imdb \
      --accelerate_config deepspeed_zero2
    

The two above settings stay compatible with config file:

# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: deepspeed_zero2  # or path/to/my/accelerate/config.yaml
trl sft --config sft_config.yaml

@qgallouedec qgallouedec marked this pull request as ready for review May 5, 2025 00:32
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice QoL improvement! LGTM with some nits

| `deepspeed_zero1` | DeepSpeed ZeRO Stage 1 |
| `deepspeed_zero2` | DeepSpeed ZeRO Stage 2 |
| `deepspeed_zero3` | DeepSpeed ZeRO Stage 3 |
| `fsdp_qlora` | Fully Sharded Data Parallel with QLoRA |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The QLoRA distinction is not really needed IMO and is an artifact from when we tuned L3 405B. I suggest we merge this #3317 first and then just have fsdp1 and fsdp2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I just rebased on fix-fsdp2, and making the required changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FSDP1/2 added in 2a52ee6

@qgallouedec qgallouedec changed the base branch from main to fix-fsdp2 May 5, 2025 16:20
@shirinyamani shirinyamani self-requested a review May 5, 2025 22:18
Copy link
Member

@shirinyamani shirinyamani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qgallouedec LGTM, I only think we are just missing the config support for multi-node, (aka --num_machines as an accelerate arg) which would be sth like;

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0  
main_training_function: main
mixed_precision: 'bf16'
num_machines: 2  
num_processes: 8  
rdzv_backend: static
rdzv_endpoint: "MASTER_NODE_IP:PORT" 
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but im just not sure abt;

  1. defining the IP address of the main node and available port --> rdzv_endpoint
  2. also the machine_rank which will be different based of the nodes, i.e. for each machine (0, 1, 2, etc.)

@qgallouedec
Copy link
Member Author

You're right @shirinyamani, we don’t have general multi-node examples because you need to provide things like IP addresses manually. However, I think it should be doable by combining, for example, a Zero3 configuration with these arguments directly in the Slurm script. It would look something like this:

# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: deepspeed_zero3
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8

# Get the list of allocated nodes and set main process IP
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER_ADDR="${NODELIST[0]}"

# Launch distributed training
trl sft sft_config.yaml \
    --num_processes $SLURM_NTASKS \
    --num_machines $SLURM_JOB_NUM_NODES \
    --main_process_ip $MASTER_ADDR \
    --machine_rank $SLURM_PROCID \
    --rdzv_backend c10d

I haven't tested though, but feel free to add this to the the doc in a follow-up PR

@qgallouedec qgallouedec changed the title Compatibility of the TRL CLI with accelerate arguments 🤝 Compatibility of the TRL CLI with accelerate arguments May 6, 2025
Base automatically changed from fix-fsdp2 to main May 6, 2025 06:29
@qgallouedec qgallouedec merged commit 1954c02 into main May 6, 2025
8 of 10 checks passed
@qgallouedec qgallouedec deleted the compat-cli-with-accelerate-args branch May 6, 2025 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants