-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on multi-gpu very slow #237
Comments
Hi ! So this is quite a hard problem in itself. 2K hours is a lot, and 10 days of training on a Single GPU sounds reasonable to me. You can 1: Consider something else than LiGRU (LSTM and GRU are faster thanks to CUDNN, but they also give worse performances). 2. Multi-GPU with DataParallel is bottlenecked by Python, and the only solution is to go with DistributedDataParallel (Which is impossible to adapt for pytorch-Kaldi I think). So you should just do mutligpu=true and then do batch_size = max_batch_size_for_one_gpu * number_of_gpu_you_got. Training time doesn't scale linearly with the number of GPU but you can easily go down to 3 days with 4 GPUs. |
Thank you. I use the setting listed below:
It seems that it will take about 12 days. (My sequence length is long). If you think all my setting is reasonable, then I will just wait. |
How many GPUs do you have ? |
4GPUs. |
I am training my ASR model with pytorch-kaldia, and notice the training time is very slow, 10% of 1 chunk is 10 mins. I have 10 chunk and will run to 15 epochs, which leads to about 10 days training.
My dataset has about 2k hours audio, and I split them in 10 chunks. I use multi-gpu, my GPU memory is 32G. I am following cfg/librispeech_liGRU_fmllr.cfg, except I use Adam instead and 4 liGRU layers (instead of the 5 layers set originally).
I have searched in the "Issues" and learnt that the developers have already optimized the multi-GPU training process. But I still see my GPU utils is around 30%, which means not fully used. I would like to know is there anyway that I can speed up the training a little bit?
Thank you very much!
The text was updated successfully, but these errors were encountered: