Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RAM Usage keep increasing #71

Open
chanil1218 opened this issue Nov 28, 2016 · 19 comments
Open

RAM Usage keep increasing #71

chanil1218 opened this issue Nov 28, 2016 · 19 comments

Comments

@chanil1218
Copy link

I've tested deepspeech on larger dataset.
I found training task had finally killed by the OS before first epoch had finished. The reason I confirmed is keep increasing Memory(not GPU, RAM) usage. (FYI, my RAM size is 16GB) Training script exceeds memory limit, fills up swap section, and finally killed by OS.
I trained it on GPU, but there is far more memory usage in RAM rather than GPU memory.
I thought this is not because of input source file sizes. Because when I reversed sorted dataset and trained, then RAM usage was similar in both cases.

I think that this memory leak(I guess) comes from dataloding section in many files condition. And from the results of successful trainings of others, these held memories has collected after each epochs had completed.

Could you guys comment on the possible point this had occurred?

@SeanNaren
Copy link
Owner

I'll try to investigate further but there is definitely something strange in the loading. I think adding more collectgarbage() calls may help.

@chanil1218
Copy link
Author

chanil1218 commented Dec 8, 2016 via email

@SeanNaren
Copy link
Owner

Sorry for not being active on this, this is a major issue that I ran into myself when training (hard to replicate but eventually the memory does run out). Will try to see if there is a test I could use to verify this leak.

@chanil1218
Copy link
Author

chanil1218 commented Jan 6, 2017 via email

@mtanana
Copy link

mtanana commented Jan 12, 2017

I had this issue as well. Is it confirmed that the memory leak happens on the CPU as well? I remember having some memory leaking for this project in CUDA https://github.com/mtanana/torchneuralconvo and the fix was not intuitive...but if the leak is on the CPU, it could be one of those libraries...

If anyone has ideas let me know, but I might try to narrow down the source in the next couple of days.

Awesome project by the way @SeanNaren

@mtanana
Copy link

mtanana commented Jan 18, 2017

CollectGarbage isn't doing the trick...I remember this was the case with my bug as well..I'll keep looking

@mtanana
Copy link

mtanana commented Jan 18, 2017

Wow...no memory leak when I iterate over data with the loader, even when I cuda the input. Might really be a memory leak in the model libraries

@SeanNaren
Copy link
Owner

That's concerning thanks for taking the time to investigate this...

A few tests that would help narrow the problem down:

  • Does free memory go down when we just do a forward pass on each batch? IF no then
  • Does free memory go down when we do a backward and forward pass? IF no then
  • Does free memory go down when we do CTC as well? If we get here, it might be the criterion

@mtanana
Copy link

mtanana commented Jan 18, 2017

Yeah- good call breaking it down that way. I'll let you know what I find

@mtanana
Copy link

mtanana commented Jan 21, 2017

wow...it's in the forward step before adding any other the others...

I cuda'd the inputs over many iterations and no issue....
then self.model:forward(inputs) and the memory explodes

@mtanana
Copy link

mtanana commented Jan 21, 2017

Pretty sure it has to do with the convolutions inside of a nn.Sequence()

Didn't solve it...but I did find some memory, but major speed improvement by switching the convolutions to cudnn:

 local n = nn;
    if(opt.nGPU>0) then n = cudnn end
    local conv = nn.Sequential()
    -- (nInputPlane, nOutputPlane, kW, kH, [dW], [dH], [padW], [padH]) conv layers
    conv:add(n.SpatialConvolution(1, 32, 11, 41, 2, 2))
    conv:add(n.SpatialBatchNormalization(32))
    conv:add(nn.Clamp(0, 20))
    conv:add(n.SpatialConvolution(32, 32, 11, 21, 2, 1))
    conv:add( n.SpatialBatchNormalization(32))
    conv:add(nn.Clamp(0, 20))
    local rnnInputsize = 32 * 41 -- based on the above convolutions and 16khz audio.
    local rnnHiddenSize = opt.hiddenSize -- size of rnn hidden layers
    local nbOfHiddenLayers = opt.nbOfHiddenLayers

    conv:add(nn.View(rnnInputsize, -1):setNumInputDims(3)) -- batch x features x seqLength
    conv:add(nn.Transpose({ 2, 3 }, { 1, 2 })) -- seqLength x batch x features

    local rnns = nn.Sequential()
    local rnnModule = RNNModule(rnnInputsize, rnnHiddenSize, opt)
    rnns:add(rnnModule:clone())
    rnnModule = RNNModule(rnnHiddenSize, rnnHiddenSize, opt)

    for i = 1, nbOfHiddenLayers - 1 do
        rnns:add(nn.Bottle(n.BatchNormalization(rnnHiddenSize), 2))
        rnns:add(rnnModule:clone())
    end

    local fullyConnected = nn.Sequential()
    fullyConnected:add(n.BatchNormalization(rnnHiddenSize))
    fullyConnected:add(nn.Linear(rnnHiddenSize, 29))

    local model = nn.Sequential()
    model:add(conv)
    model:add(rnns)
    model:add(nn.Bottle(fullyConnected, 2))
    model:add(nn.Transpose({1, 2})) -- batch x seqLength x features  

This was based on a post from the torch nn folks:

It is because of the nn.SpatialConvolution. We compute the convolution using a Toeplitz matrix. So unfolding the input takes quite a bit of extra memory.

https://en.wikipedia.org/wiki/Toeplitz_matrix

If you want to keep the memory down, use cudnn.SpatialConvolution from the cudnn package:
https://github.com/soumith/cudnn.torch

@mtanana
Copy link

mtanana commented Jan 21, 2017

Haha...soved!!!
From the cudnn literature:
by default, cudnn.fastest is set to false. You should set to true if memory is not an issue, and you want the fastest performance

(See line 15 of UtilsMultiGPU) cudnn.fastest is set to true

@SeanNaren I'm thinking maybe I could send a pull request with

  1. an option to turn 'fastest' on and off and
  2. some code to always turn on cudnn for the convolutions if there is a gpu

let me know what you'd like

Man that bug was getting to me...glad we have it figured out.

@mtanana
Copy link

mtanana commented Jan 21, 2017

@SeanNaren btw...like the way the aync loader works...that's a nice touch...glad that sucker wasn't leaking the memory.

@mtanana
Copy link

mtanana commented Jan 21, 2017

Nevermind...managed to crash it again...I'll keep at it

@SeanNaren
Copy link
Owner

@mtanana Thanks for the work :) I didn't think it would be anything GPU related since it's taking down the RAM mem... But just to clarify, GPU memory usage should increase throughout the epoch (since the time steps get larger and larger) but CPU memory should not!

@mtanana
Copy link

mtanana commented Jan 21, 2017

Yeah...I think I'm realizing now that the size of the batches are just increasing because of how the loader works. As the sequence size increases, the memory is increasing as well. If I permute before running I get a more or less constant mem usage on the GPU. The CPU isn't increasing. For others that ran into this problem, maybe try these steps and see if you still have issues:

  1. comment out the "fastest" line(line 15 of UtilsMultiGPU)
  2. Move the permute line ( if self.permuteBatch then self.indexer:permuteBatchOrder() end) to right after for i = 1, epochs do and run the training command with -permuteBatch . If you run out of memory in the first few iterations, you're model is just too big for the memory.

I'll keep an eye on this thread. Tag me if you discover anything new.

@mtanana
Copy link

mtanana commented Jan 21, 2017

Also- wrote some error catching code so that if you occasionally have a talk turn that is too long for the CUDA memory, it will catch the error instead of killing the training.

@fanlamda
Copy link

I can understand why GPU menmory usage increase during a batch. But why does GPU memory not get back to as low as before. @SeanNaren

@markmuir87
Copy link

markmuir87 commented Feb 6, 2017

I'm encountering this as well on GPU training. Doing a bit of searching around reveals others are having this issue too: torch/cutorch#379 . Sounds like it's some subtle interaction of how cutorch is implemented, changes in nVidia drivers and linux default memory management. The proposed solutions sound sensible, although I haven't tried them yet (and know nothing about memory management). The suggested solutions I've seen are:

  • Force torch to use jemalloc for memory management. It's apparently more aggressive with releasing memory:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so th Train.lua

  • Zero all variables (i.e. assign 'nil') when no longer needed at end of batch iterations, then call collectgarbage() . The behaviour I've observed seems consistent with bits of memory not being deallocated and carrying over to the next iterations (i.e. mem usage gradually climbs until OOM).

I've been thinking about another possible approach (and apologies if this sounds stupid, I'm a complete torch newbie):

  • Chop up your training sets and test sets into manageable numbers. Then just write a script that trains 1 epoch per set, exits, reloads from the saved model and moves on to the next batch.
  • I'm very new to machine learning, but I've been wondering if this might be a way to prevent over-fitting and reduce the probability of the model converging on an 'inescapable' local (but not global) minima for the loss function. You'd have to have at least two seperate (read: uncorrelated) datasets chopped up and mixed in together (and probably randomly shuffled every 'round').

Compared to pretty much everything else in ML land, this seems like something I could actually understand and implement (with a simple python script running torch as a subprocess). I'll try and find the time in the next week or so (although don't wait on me if you think it's a good idea and want to implement it sooner).

Interested to hear your thoughts on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants