-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RAM Usage keep increasing #71
Comments
I'll try to investigate further but there is definitely something strange in the loading. I think adding more |
@SeanNaren
Even after first epoch, allocated memory is not collected.
I guess cutorch is not the cause, because when I train it without GPU and
also memory increases without limit.
I suspect,
* LMDB lib
* threads lib
* Project level memory management
Because I am a newbie in lua/torch libraries, it is hard for me to track
memory leak(or even concluding it is normal memory usage). Any suggestion
of tools for debugging is welcomed!
2016년 12월 8일 (목) 오후 6:45, Sean Naren <[email protected]>님이 작성:
… I'll try to investigate further but there is definitely something strange
in the loading. I think adding more collectgarbage() calls may help.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABWrZV6-AY5vZs_qDIPLmxBFgifGkBz9ks5rF9GdgaJpZM4K-SzM>
.
|
Sorry for not being active on this, this is a major issue that I ran into myself when training (hard to replicate but eventually the memory does run out). Will try to see if there is a test I could use to verify this leak. |
I observed memory usage eventually converge to some point so that I could
complete training with about 30gb of swap memory.
It might help you to point the memory usage increasing.
2017년 1월 6일 (금) 오전 12:19, Sean Naren <[email protected]>님이 작성:
… Sorry for not being active on this, this is a major issue that I ran into
myself when training (hard to replicate but eventually the memory does run
out). Will try to see if there is a test I could use to verify this leak.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#71 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABWrZdkEeGVAT-gPm1A712IrIT7TXPNXks5rPQn9gaJpZM4K-SzM>
.
|
I had this issue as well. Is it confirmed that the memory leak happens on the CPU as well? I remember having some memory leaking for this project in CUDA https://github.com/mtanana/torchneuralconvo and the fix was not intuitive...but if the leak is on the CPU, it could be one of those libraries... If anyone has ideas let me know, but I might try to narrow down the source in the next couple of days. Awesome project by the way @SeanNaren |
CollectGarbage isn't doing the trick...I remember this was the case with my bug as well..I'll keep looking |
Wow...no memory leak when I iterate over data with the loader, even when I cuda the input. Might really be a memory leak in the model libraries |
That's concerning thanks for taking the time to investigate this... A few tests that would help narrow the problem down:
|
Yeah- good call breaking it down that way. I'll let you know what I find |
wow...it's in the forward step before adding any other the others... I cuda'd the inputs over many iterations and no issue.... |
Pretty sure it has to do with the convolutions inside of a nn.Sequence() Didn't solve it...but I did find some memory, but major speed improvement by switching the convolutions to cudnn:
This was based on a post from the torch nn folks: It is because of the nn.SpatialConvolution. We compute the convolution using a Toeplitz matrix. So unfolding the input takes quite a bit of extra memory. https://en.wikipedia.org/wiki/Toeplitz_matrix If you want to keep the memory down, use cudnn.SpatialConvolution from the cudnn package: |
Haha...soved!!! (See line 15 of UtilsMultiGPU) cudnn.fastest is set to true @SeanNaren I'm thinking maybe I could send a pull request with
let me know what you'd like Man that bug was getting to me...glad we have it figured out. |
@SeanNaren btw...like the way the aync loader works...that's a nice touch...glad that sucker wasn't leaking the memory. |
Nevermind...managed to crash it again...I'll keep at it |
@mtanana Thanks for the work :) I didn't think it would be anything GPU related since it's taking down the RAM mem... But just to clarify, GPU memory usage should increase throughout the epoch (since the time steps get larger and larger) but CPU memory should not! |
Yeah...I think I'm realizing now that the size of the batches are just increasing because of how the loader works. As the sequence size increases, the memory is increasing as well. If I permute before running I get a more or less constant mem usage on the GPU. The CPU isn't increasing. For others that ran into this problem, maybe try these steps and see if you still have issues:
I'll keep an eye on this thread. Tag me if you discover anything new. |
Also- wrote some error catching code so that if you occasionally have a talk turn that is too long for the CUDA memory, it will catch the error instead of killing the training. |
I can understand why GPU menmory usage increase during a batch. But why does GPU memory not get back to as low as before. @SeanNaren |
I'm encountering this as well on GPU training. Doing a bit of searching around reveals others are having this issue too: torch/cutorch#379 . Sounds like it's some subtle interaction of how cutorch is implemented, changes in nVidia drivers and linux default memory management. The proposed solutions sound sensible, although I haven't tried them yet (and know nothing about memory management). The suggested solutions I've seen are:
I've been thinking about another possible approach (and apologies if this sounds stupid, I'm a complete torch newbie):
Compared to pretty much everything else in ML land, this seems like something I could actually understand and implement (with a simple python script running torch as a subprocess). I'll try and find the time in the next week or so (although don't wait on me if you think it's a good idea and want to implement it sooner). Interested to hear your thoughts on this. |
I've tested deepspeech on larger dataset.
I found training task had finally killed by the OS before first epoch had finished. The reason I confirmed is keep increasing Memory(not GPU, RAM) usage. (FYI, my RAM size is 16GB) Training script exceeds memory limit, fills up swap section, and finally killed by OS.
I trained it on GPU, but there is far more memory usage in RAM rather than GPU memory.
I thought this is not because of input source file sizes. Because when I reversed sorted dataset and trained, then RAM usage was similar in both cases.
I think that this memory leak(I guess) comes from dataloding section in many files condition. And from the results of successful trainings of others, these held memories has collected after each epochs had completed.
Could you guys comment on the possible point this had occurred?
The text was updated successfully, but these errors were encountered: