🤝 Compatibility of the TRL CLI with accelerate arguments #3409

qgallouedec · 2025-05-03T06:55:37Z

This PR adds two features

Use any accelerate launch argument in the TRL CLI. For example, now you ca use:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --num_processes 4

Use predefined accelerate config, example:

trl sft \
  --model_name_or_path Qwen/Qwen2.5-0.5B \
  --dataset_name stanfordnlp/imdb \
  --accelerate_config deepspeed_zero2

The two above settings stay compatible with config file:

# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: deepspeed_zero2  # or path/to/my/accelerate/config.yaml

trl sft --config sft_config.yaml

HuggingFaceDocBuilderDev · 2025-05-05T00:38:59Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun

Nice QoL improvement! LGTM with some nits

docs/source/clis.md

lewtun · 2025-05-05T15:30:12Z

docs/source/clis.md

+| `deepspeed_zero1` | DeepSpeed ZeRO Stage 1                 |
+| `deepspeed_zero2` | DeepSpeed ZeRO Stage 2                 |
+| `deepspeed_zero3` | DeepSpeed ZeRO Stage 3                 |
+| `fsdp_qlora`      | Fully Sharded Data Parallel with QLoRA |


The QLoRA distinction is not really needed IMO and is an artifact from when we tuned L3 405B. I suggest we merge this #3317 first and then just have fsdp1 and fsdp2

Sounds good, I just rebased on fix-fsdp2, and making the required changes

FSDP1/2 added in 2a52ee6

Co-authored-by: lewtun <[email protected]>

shirinyamani

@qgallouedec LGTM, I only think we are just missing the config support for multi-node, (aka --num_machines as an accelerate arg) which would be sth like;

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0  
main_training_function: main
mixed_precision: 'bf16'
num_machines: 2  
num_processes: 8  
rdzv_backend: static
rdzv_endpoint: "MASTER_NODE_IP:PORT" 
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

but im just not sure abt;

defining the IP address of the main node and available port --> rdzv_endpoint
also the machine_rank which will be different based of the nodes, i.e. for each machine (0, 1, 2, etc.)

qgallouedec · 2025-05-05T23:02:59Z

You're right @shirinyamani, we don’t have general multi-node examples because you need to provide things like IP addresses manually. However, I think it should be doable by combining, for example, a Zero3 configuration with these arguments directly in the Slurm script. It would look something like this:

# sft_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: stanfordnlp/imdb
accelerate_config: deepspeed_zero3

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8

# Get the list of allocated nodes and set main process IP
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
MASTER_ADDR="${NODELIST[0]}"

# Launch distributed training
trl sft sft_config.yaml \
    --num_processes $SLURM_NTASKS \
    --num_machines $SLURM_JOB_NUM_NODES \
    --main_process_ip $MASTER_ADDR \
    --machine_rank $SLURM_PROCID \
    --rdzv_backend c10d

I haven't tested though, but feel free to add this to the the doc in a follow-up PR

lewtun and others added 17 commits April 17, 2025 12:17

Add FSDP configs

a91c73c

Bump accelerate

debdc8e

Update prepare

e7f74f7

update version accelerate in test

499167d

Merge branch 'main' into fix-fsdp2

6f23aa0

Merge branch 'main' into fix-fsdp2

fd70eb2

Add full state dict

ca08043

Revert

91ae801

return_remaining_strings=True

eb3ed1b

TRLParser compat with subparsers

cd1c0b4

test subpaser config handling

9570ff2

allow launch argument in cli args for sft

770d7b5

better comment

17c0c9f

add accelerate configs

63b8d73

rewrite the cli doc

f6cb3e5

accelerate config

db6ab73

further improve the doc

8202f0a

qgallouedec marked this pull request as ready for review May 5, 2025 00:32

qgallouedec and others added 3 commits May 4, 2025 17:32

Merge branch 'main' into compat-cli-with-accelerate-args

bbd186e

rm chatgpt blabla

dc4f49c

simplify

e8fa32f

qgallouedec added 4 commits May 5, 2025 01:02

Is it clearer?

7e42d82

other examples

1fbb56f

even better

6eccaec

detail

69b90d5

qgallouedec requested review from kashif, edbeeching, lewtun and shirinyamani May 5, 2025 01:54

lewtun added 2 commits May 5, 2025 12:27

Merge branch 'main' into fix-fsdp2

9cfde38

Bump min version

f0eabff

lewtun approved these changes May 5, 2025

View reviewed changes

Update docs/source/clis.md

c584b21

Co-authored-by: lewtun <[email protected]>

qgallouedec changed the base branch from main to fix-fsdp2 May 5, 2025 16:20

qgallouedec and others added 3 commits May 5, 2025 09:20

Merge branch 'fix-fsdp2' into compat-cli-with-accelerate-args

1f0ebcc

deepspeed_zeroN -> zeroN

7a71b9d

remove fsdp qlora and add fsdp1/2

2a52ee6

shirinyamani approved these changes May 5, 2025

View reviewed changes

shirinyamani self-requested a review May 5, 2025 22:18

shirinyamani approved these changes May 5, 2025

View reviewed changes

qgallouedec changed the title ~~Compatibility of the TRL CLI with accelerate arguments~~ 🤝 Compatibility of the TRL CLI with accelerate arguments May 6, 2025

Base automatically changed from fix-fsdp2 to main May 6, 2025 06:29

Merge branch 'main' into compat-cli-with-accelerate-args

be268de

qgallouedec merged commit 1954c02 into main May 6, 2025
8 of 10 checks passed

qgallouedec deleted the compat-cli-with-accelerate-args branch May 6, 2025 07:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤝 Compatibility of the TRL CLI with accelerate arguments #3409

🤝 Compatibility of the TRL CLI with accelerate arguments #3409

qgallouedec commented May 3, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented May 5, 2025

lewtun left a comment

lewtun May 5, 2025

qgallouedec May 5, 2025

qgallouedec May 5, 2025

shirinyamani left a comment •

edited

Loading

qgallouedec commented May 5, 2025

🤝 Compatibility of the TRL CLI with accelerate arguments #3409

🤝 Compatibility of the TRL CLI with accelerate arguments #3409

Conversation

qgallouedec commented May 3, 2025 • edited Loading

HuggingFaceDocBuilderDev commented May 5, 2025

lewtun left a comment

Choose a reason for hiding this comment

lewtun May 5, 2025

Choose a reason for hiding this comment

qgallouedec May 5, 2025

Choose a reason for hiding this comment

qgallouedec May 5, 2025

Choose a reason for hiding this comment

shirinyamani left a comment • edited Loading

Choose a reason for hiding this comment

qgallouedec commented May 5, 2025

qgallouedec commented May 3, 2025 •

edited

Loading

shirinyamani left a comment •

edited

Loading