-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble on Parallel Training #55
Comments
See the below issue |
Thanks : ) Exactly the answer I'm looking for! |
An update: As point out in #48, for batch size 4 it is unnecessary to use parallel training. Actually in my test it slows down the training process due to the data transmission overhead. For epoch size 1000 and V100 GPU:
Just for reference. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks for your work!
Recently I found that if the machine contains more than 1 GPU, then a runtime error will occur: "RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)"
This error disappears if I use
CUDA_VISIBLE_DEVICES=0
to specify a certain GPU for training. The torch version I use is 1.5.1 with cuda 10.0.May I ask whether the data parallel functionality has been successfully tested with multiple GPUs? Thanks.
Below is the full error log for your reference:
The text was updated successfully, but these errors were encountered: