You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm training the base u-net using the 'accelerate' command provided in the repo (i.e., 'accelerate launch train.py')
I make sure that the batchsize of each GPU is 1. It is expected that no matter how many GPUs I use, as long as I make sure the batchsize of each GPU is 1, the memory usage of each GPU should be roughly the same.
However, I find that the more GPUs I use, the larger the memory usage of each GPU is, though I make sure that the batchsize of each GPU is 1.
For example, when I train the base u-net on a single GPU, the memory usage is:
[0] 19876 / 32510 MB
When I train it with 2 GPUs, the memory usage is:
[0] 23892 / 32510 MB
[1] 23732 / 32510 MB
When I train it with 3 GPUs, the memory usage is:
[0] 25132 / 32510 MB
[1] 24962 / 32510 MB
[2] 24962 / 32510 MB
When I train it with 8 GPUs, the memory usage is:
[0] 31176 / 32510 MB
[1] 31000 / 32510 MB
[2] 30930 / 32510 MB
[3] 30958 / 32510 MB
[4] 30940 / 32510 MB
[5] 30996 / 32510 MB
[6] 31070 / 32510 MB
[7] 30994 / 32510 MB
It would be greatly appreciated if someone could tell me why this is the case.
The text was updated successfully, but these errors were encountered:
I'm training the base u-net using the 'accelerate' command provided in the repo (i.e., 'accelerate launch train.py')
I make sure that the batchsize of each GPU is 1. It is expected that no matter how many GPUs I use, as long as I make sure the batchsize of each GPU is 1, the memory usage of each GPU should be roughly the same.
However, I find that the more GPUs I use, the larger the memory usage of each GPU is, though I make sure that the batchsize of each GPU is 1.
For example, when I train the base u-net on a single GPU, the memory usage is:
[0] 19876 / 32510 MB
When I train it with 2 GPUs, the memory usage is:
[0] 23892 / 32510 MB
[1] 23732 / 32510 MB
When I train it with 3 GPUs, the memory usage is:
[0] 25132 / 32510 MB
[1] 24962 / 32510 MB
[2] 24962 / 32510 MB
When I train it with 8 GPUs, the memory usage is:
[0] 31176 / 32510 MB
[1] 31000 / 32510 MB
[2] 30930 / 32510 MB
[3] 30958 / 32510 MB
[4] 30940 / 32510 MB
[5] 30996 / 32510 MB
[6] 31070 / 32510 MB
[7] 30994 / 32510 MB
It would be greatly appreciated if someone could tell me why this is the case.
The text was updated successfully, but these errors were encountered: