-
Notifications
You must be signed in to change notification settings - Fork 472
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multigpu support for summarization ppo example #571
Comments
Hello @sayan1101! You can check out the following instructions / configs that were used to train this example: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf#training-process In particular, this example was trained with a config for two 80GB GPUs, so in order to not run out of memory you have to reduce If you were unsuccessful even after that, or if you still want to use your config, you'd have to do the following changes:
This way, the reward model will be loaded on the 8th GPU and won't occupy the space for training LLM |
Thanks for taking the time to reply. I tried using 4 x A100 gpu instance from runpod. even after making the changes that you mentioned, i failed to start the training process. I have made the num_process = 3 in the default_accelerate_config.yaml as shown above so that the training can happen in the rest of the 3 gpus. but i am getting runtime error everytime: Pls suggest any way around for this. |
@sayan1101 If you could post whole stacktrace, including the error before the timeouts, that would be very helpful. And just to confirm, you're using A100 with 40GB of memory, is that correct? |
🐛 Describe the bug
this is not a bug. wanted to know how we can run the ppo training for summarization. this is the file i am trying to run: trlx_gptj_text_summarization.py which is in trlx/examples/summarize_rlhf. i tried to run it with changed accelerate configs:
'''
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
'''
ran it with accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py.
but got cuda out of memory.
I am using 8 x RTX6000 GPUs. 76 vCPUs and 400GB RAM.
Do i need to make changes in the trlx_gptj_text_summarization.py file as well? if yes, please tell what changes are required.
Quick resolution will be highly appreciated.
Which trlX version are you using?
No response
Additional system and package information
No response
The text was updated successfully, but these errors were encountered: