-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory explode after 3 steps #5
Comments
@InitialBug I'm testing on 1080ti and got the same result. The program crashes every time when it run into the self attention layer. I'm sorry that I don't know why. The original model can run normally but now it fails. |
@InitialBug Hey I fixed the issue! The original model can't release computational graph after each epoch so took enormous memory. I modified CQAttention so that the graph can be released automatically. Now the model runs as fast as a rocket! |
I have 2 8GB M60 card, but still fails with CUDA memory exception. Does it support multi gpu for now? |
@deepakkumar1984 I just have one GPU so I can't test it on multi ones. I'll be very grateful if you commit relevant codes. |
I tried using GPU V100 with 16GB HBM VRAM. With that too it give out of memory exception: THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Traceback (most recent call last): Any idea??? |
@deepakkumar1984 Please set batch_size = 28. I tested it on 16g GPU. That setting can run smoothly. BTW, current F1/EM is very low. I'm fixing that. |
Thanks my friend! Please let me know when the scores are fixed so that I
can initiate a test training for you and share the score.
…On Thu, Jun 21, 2018 at 5:06 PM Hengruo Zhang ***@***.***> wrote:
@deepakkumar1984 <https://github.com/deepakkumar1984> Please set
batch_size = 28. I tested it on 16g GPU. That setting can run smoothly.
BTW, current F1/EM is very low. I'm fixing that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AGCQKSIc6Ly0hlgpsvHR7S_1JbfI0KN1ks5t-00ZgaJpZM4Uekys>
.
--
Regards,
Deepak
|
@hengruo the reason for low score may be that your optim scheduler has some problem, I print the learning rate of the optimizer but it seems that it doesn't increase from 0 to 0.001 but got 1e-7 after 1000 steps. And the training loss doesn't converge because of the learning rate. After changing the optim function to Adam without increasing learning rate, the model converges fast and I'v got 53 F1 score after 10,000 traing steps but using different hypeparameters. |
@InitialBug Wow that's amazing! Could you please public your hyperparameters? |
@hengruo I have tried many different hyperparameters, but I think the root problem is the optim function, you can simply try using Adam with fixed learning rate 0.001. I think you can get better results even after 1,000 steps. But I don't know why. This is my experience after 5 days debug. flags.DEFINE_integer("word_len", 16, "Limit length for one word") the attention_map_dim refers to dv,dk in the mutihead attention layer, and I find lower character dimension makes model converge faster |
@InitialBug THANK YOU VERY MUCH!!! I'm testing by your settings! |
@hengruo Any good news? I tested many different hyperparameters, but the best F1 score is only 64.3 as so far. I wonder if the model has some problem? |
I got the same question. |
@InitialBug May I ask what is the current best performance you can get?
I am not sure why fix LR doesn't work well, as it should be good according to the original paper. |
Another problem is that seems we don't have exponential moving average here. |
@BangLiu The best F1 score is aroud 66, but when I keep training, the model overfits. I have re-impemented the part of the model.py, but most of the modules are the same as this reposity. I also use the exponential moving average but it seems to also cause the training unstable. I think the only difference between my model and the paper is the hyperparameters and the stochastic |
@InitialBug I uploaded my implementation based on this repository to: https://github.com/BangLiu/QANet-PyTorch You are welcomed to test my code. I get memory explode using this implementation using batch_size 32, but my implementation doesn't have this problem. However, currently my performance is not good. I implemented EMA in my code repository and tested it in QANet_trainer.py (You can check to see whether my implementation is correct). But the EMA makes the performance quite bad (F1 is less than 10% ....) and I don't know why. @hengruo It will be great if you can also take a look and to see what is different with our implementations. I think we should get about 78 ~ 80 F1, otherwise it is not correct.... |
@BangLiu i also meet the same problem when test your code, i use one 1080ti card |
@haibarasfs Currently the best performance I can get is F1 64 EM 50 (with long time training ......) |
@haibarasfs You mean memory explode? If you use 1080ti card, then the batch_size maybe need to be smaller... |
@InitialBug @BangLiu Sorry for my late response. I was traveling so hardly made any progress. Thanks for your contributions! I start working on it again. If I have any progress I'll let you know. BTW, I added EMA. |
Not sure whether you guys have seen the Tensorflow implementation, it gets even better results than the original paper, see https://github.com/NLPLearn/QANet#results. Hope it helps! |
I have implemented a repository QANet, mostly based on this repository and another Tensorflow implementation Tensorflow QANet. I can reach F1: 75.0 EM: 64.0 in 60000 steps. You could take a look! |
@andy840314 What do you think that may cause the 5.0 point gap between your implementation and that of Tensorflow QANet? I am also trying to reach the performance. |
@BangLiu i'm not sure, but i will try adding EMA first. |
@InitialBug @deepakkumar1984 @Jimmy880 @BangLiu @haibarasfs I contacted one of QANet's authors. They've published their model. Link: https://github.com/tensorflow/tpu/tree/master/models/experimental/qanet |
@andy840314 @hengruo I tested andy's code, it can achieve F1 74.128853 EM 62.707499 in 14 epochs, and F1 70.157651 EM 58.185461 in 4 epochs.
|
@BangLiu I have implemented QAnet with EMA, also mostly based the Tensorflow implementation, the performance is em: 67.317 and f1: 76.953 (without EMA), em: 70.155 and f1: 79.432 (with EMA) after 22 epochs (2730 batches every epoch, 60060steps). |
@hackiey That's great! I will test it |
I tested @hackiey 's code, it can reach em 67.515 after 18 epochs, em 68.09 after 56 epochs, w/o EMA. |
@hengruo I think in the train() function of your code, clip_grad_norm_ shall be put before the optimizer.step() ? |
I use the TiTan x GPU, but the GPU memory is growing rapidly, and after 3 batches, it went out of memory.
I have check your code line by line, and I still don't konw what's wrong with it
The text was updated successfully, but these errors were encountered: