-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple GPU low performance #1734
Comments
Hi @jetstudio-io, thanks for the question. The it/s or sec/it metric is not a great indicator of performance here. Instead, I would check the logs for tokens per second to do a better comparison. For example:
Or you can see it over time if you log with WandB and set Many factors can impact raw seconds/iteration, especially gradient accumulation, but it is not necessarily indicative of training convergence speed. That being said, there are still other ways to improve performance. You can check our documentation page on memory/perf features you can enable to get some ideas (cc @felipemello1): https://pytorch.org/torchtune/main/tutorials/memory_optimizations.html. A very direct way to improve throughput is to enable packing in your dataset. If you are using the torchtune dataset builder functions, you can simply pass |
Like @RdoubleA said, the configh as "gradient_accumulation_steps: 16", which means that one step is actually 16. Maybe try the following:
If you are running out of memory, set enable_activation_checkpointing=True You can see your memory in weights and biases website Also use torchtune/pytorch nightlies for maximum performance: https://github.com/pytorch/torchtune#install-nightly-release |
Thanks for yours advice, I'll try to test the token/s |
Hello,
I have an issue with multiple GPU performance.
lora_finetune_single_device
with the configmini_lora_single_device.yaml
on 6000ADA, I got ~5it/slora_finetune_distributed
with the configmini_lora.yaml
on 2 x 6000ADA, I got 1.5s/itThe dataset that I used to fine-tune is HuggingFaceFW/fineweb-edu-score-2
How can I improve the performance in multiple GPU?
The text was updated successfully, but these errors were encountered: