Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multigpu support for summarization ppo example #571

Open
sayan1101 opened this issue Oct 21, 2023 · 3 comments
Open

multigpu support for summarization ppo example #571

sayan1101 opened this issue Oct 21, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@sayan1101
Copy link

🐛 Describe the bug

this is not a bug. wanted to know how we can run the ppo training for summarization. this is the file i am trying to run: trlx_gptj_text_summarization.py which is in trlx/examples/summarize_rlhf. i tried to run it with changed accelerate configs:
'''
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 1
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
'''

ran it with accelerate launch --config_file configs/default_accelerate_config.yaml trlx_gptj_text_summarization.py.
but got cuda out of memory.
I am using 8 x RTX6000 GPUs. 76 vCPUs and 400GB RAM.

Do i need to make changes in the trlx_gptj_text_summarization.py file as well? if yes, please tell what changes are required.
Quick resolution will be highly appreciated.

Which trlX version are you using?

No response

Additional system and package information

No response

@sayan1101 sayan1101 added the bug Something isn't working label Oct 21, 2023
@maxreciprocate
Copy link
Collaborator

Hello @sayan1101! You can check out the following instructions / configs that were used to train this example: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf#training-process

In particular, this example was trained with a config for two 80GB GPUs, so in order to not run out of memory you have to reduce batch_size in trlx_gptj_text_summarization.py (there is a note in the above link that also says this)

If you were unsuccessful even after that, or if you still want to use your config, you'd have to do the following changes:

  1. Change rw_device to 7 in here
    rw_device = torch.device("cuda:{}".format(1)) # set reward model device
  2. Change 'num_processes' to '7' in your accelerate config

This way, the reward model will be loaded on the 8th GPU and won't occupy the space for training LLM

@sayan1101
Copy link
Author

sayan1101 commented Oct 24, 2023

Hello @sayan1101! You can check out the following instructions / configs that were used to train this example: https://github.com/CarperAI/trlx/tree/main/examples/summarize_rlhf#training-process

In particular, this example was trained with a config for two 80GB GPUs, so in order to not run out of memory you have to reduce batch_size in trlx_gptj_text_summarization.py (there is a note in the above link that also says this)

If you were unsuccessful even after that, or if you still want to use your config, you'd have to do the following changes:

  1. Change rw_device to 7 in here
    rw_device = torch.device("cuda:{}".format(1)) # set reward model device
  2. Change 'num_processes' to '7' in your accelerate config

This way, the reward model will be loaded on the 8th GPU and won't occupy the space for training LLM

Thanks for taking the time to reply. I tried using 4 x A100 gpu instance from runpod. even after making the changes that you mentioned, i failed to start the training process.
so these are the changes that i made:
rw_device = torch.device("cuda:{}".format(3))
such that the reward model is loaded in the 4th gpu. and changed the gpu configs to this:
"""
command_file: null
commands: null
compute_environment: LOCAL_MACHINE
deepspeed_config:
deepspeed_config_file: configs/ds_config_trlx_gptj_summarize.json
zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
dynamo_config: {}
fsdp_config: {}
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
megatron_lm_config: {}
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false
"""

I have made the num_process = 3 in the default_accelerate_config.yaml as shown above so that the training can happen in the rest of the 3 gpus.

but i am getting runtime error everytime:
Screenshot 2023-10-24 at 6 34 01 PM

Pls suggest any way around for this.

@maxreciprocate
Copy link
Collaborator

@sayan1101 If you could post whole stacktrace, including the error before the timeouts, that would be very helpful. And just to confirm, you're using A100 with 40GB of memory, is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants