-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air/train] Multi GPU training occupies more memory memory on first GPU #26707
Comments
@krfricke can you try adding a |
@JiahaoYao FYI seems related to the issue you saw with Horovod(?) |
@amogkam confirmed this removes the double memory allocation. On my first try the run hung forever though, on second run it worked. Waiting now for the full benchmark to pass to confirm. Can we do this automatically in the torch trainer? |
Nice! yes we can do this automatically, will make a quick PR for this. |
With this change training takes forever now: Before:
After:
GPU utilization is also at 100% during the whole training run, while it was at ~50% before and still is for vanilla training. I also believe that vanilla training is mostly slowed down due to left over utilization of the GPUs (but not sure about this). Anyway, removing the line again speeds everything up again. Does the setting of the device somehow interfere with the batch size setting? |
After removing the line (same cluster):
|
Ah I believe it's because presumably the data loaders are packed onto the GPU. |
my experience is logged here: Lightning-AI/pytorch-lightning#13665 and ray-project/ray_lightning#181 @krfricke @matthewdeng |
What does this mean? |
What happened + What you expected to happen
This is a 4 node, 4 GPUs per node setup. I'm running a GPU benchmark script comparing training with Ray AIR and Vanilla PyTorch.
When training with Ray AIR, the GPU with ID 0 on the head node occupies more memory than expected. It seems that worker models are instantiated on the GPU. This seems to happen during setup.
This is the output of nvidia-smi for Ray AIR, which shows that the GPU with ID 0 is used in multiple processes:
For comparison, when running with the vanilla training script, this is the GPU usage (only one process per GPU ID):
When restarting with Ray AIR and monitoring the GPU usage, this comes up:
a few seconds later:
It thus seems to be a setup issue.
Versions / Dependencies
Latest master
Reproduction script
Run the air_benchmark_torch_mnist_gpu_4x4 release test (preferably manually on anyscale).
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: