Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop after loading a pretrained model. #69

Open
maciejkorzepa opened this issue Nov 22, 2016 · 15 comments
Open

Performance drop after loading a pretrained model. #69

maciejkorzepa opened this issue Nov 22, 2016 · 15 comments

Comments

@maciejkorzepa
Copy link

I am training the model on 1k dataset and decided to split the training into a few so that in each only 10 epochs is processed (I need to share the cluster). By doing so I experienced a drop in WER after the last epoch from one training (WER < 15%) and the first epoch from the next training (WER > 17%) where I load the model from the last epoch of the previous training. In the second training the WER got down to 12.8% but then when third training was started, the WER after first epoch was 14.4%. I tried disabling batch sorting in the first epoch when a pretrained model is loaded, but it didn't help. What can be the cause of this problem?

@nn-learner
Copy link

nn-learner commented Dec 2, 2016

@maciejkorzepa Have you been able to get the WER% down to 15? If so, do you mind sharing your weights? My current desktop has weaker GPUs so I have only been able to go as low as 38% after a long time training.

@maciejkorzepa
Copy link
Author

@nn-learner
Copy link

@maciejkorzepa thank you! Sorry, but do you also mind sharing the network architecture that you used? I assume you used 1280 Hidden nodes, 7 layers, and Traditional RNN Bidirectional?

@maciejkorzepa
Copy link
Author

I didn't make a single change in the architecture, so I am using the default sizes.

@SeanNaren
Copy link
Owner

Been on a bit of a hiatus but am back now! @maciejkorzepa thanks so much for this, I'm going to download and just check the model. Did you just do the librispeech set up, and then hit th Train.lua without modifying parameters? Did you run into any inf costs? What hardware did you run this on?

@maciejkorzepa
Copy link
Author

Actually, I changed some parameters and code. I used for training everything apart from test-clean and test-other. 12% error rate was achieved on test-clean. I got -inf costs almost in each epoch. But as WER kept going down, I didn't think it was a big problem. As for the parameters, maxNorm was set to 100. I ran on 2xK80 with batch size 64. One epoch took almost 8 hours. The change in the code I made was decreasing the batchsize for a few hundred last utterances, as they are some considerably bigger than the rest. So instead of setting batch size ~30-40 for whole training which would reduce the speed of training, I just changed to smaller batch size (24) in the end of an epoch (this is needed for the first epoch when the data is sorted). In DS2 paper, Baidu mentions how they handle long utterances leading to out of memory errors:

and sometimes very deep networks can exceed the GPU memory capacity when processing long utterances. This can happen unpredictably, especially when the distribution of utterance lengths includes outliers, and it is desirable to avoid a catastrophic failure when this occurs. When a requested memory allocation exceeds available GPU memory, we allocate page-locked GPU-memory-mapped CPU memory using cudaMallocHost instead. This memory can be accessed directly by the GPU by forwarding individual memory transactions over PCIe at reduced bandwidth, and it allows a model to continue to make progress even after encountering an outlier.

This solution allows to use much bigger batch size and thus speed up the training. I am wondering whether implementing this in your model would be feasible.

On the other hand, I found that increasing batch size is not desirable when it comes to WER. The smaller the batch size is, the more noisy the gradients are which results helps in getting out from local minima. I did some tests on LibriSpeech 100h and as far as I remember, for batch size 75, the WER stabilized at 58%, for batch size 40 at 52% and for batch size 12 at 42%. I didn't try decreasing the batch size with 1000h as 8h per epoch was already long enough for me :)

@SeanNaren
Copy link
Owner

@maciejkorzepa you're awesome thanks so much for this!

Honestly the issue here is Torch isn't as memory efficient (especially the RNNs) as Baidus' internal code. But as you said, it's still trainable just takes forever.

I'll download the model and do some checks/update documentation. Are you fine with me using this as the pre-trained network for Librispeech? I do have an LSTM based network training which is much smaller but its hovering around ~20 WER, a WER of 12 is awesome!

@maciejkorzepa
Copy link
Author

@SeanNaren Sure, go ahead! I think I might be able to use 4xK80 for the training soon, I might then try reducing the batch size and see if WER can get any lower...

@nn-learner
Copy link

@maciejkorzepa Sorry, I have a question. I was trying to load your weight but it scored a 99% WER on the test-clean. Are there specific parameters that you used on test.lua?

@maciejkorzepa
Copy link
Author

@nn-learner Maybe your input spectrograms were processed with different parameters? I used:
-windowSize 0.02 -stride 0.01 -sampleRate 16000 -processes 8 -audioExtension flac
I actually didn't manage to run Test.lua due to some error (I don't remember what it was exactly), but my project group tried to run Predict.lua with some samples from test-clean and most of transcriptions were perfect and only a few had some very minor errors (e.g. 'I have' instead of 'I've'), so I assumed that ~12% WER calculated during validation in Train.lua was realistic.

@suhaspillai
Copy link

@maciejkorzepa on what basis did you choose maxNorm as 100?. I went through maxNorm paper (http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf) . They mention to take average norm across many updates and choose half or ten times the value of average norm . So, I just wanted to know did you take average norm for 1 complete epoch or multiple epochs and then went on to take 100 as maxNorm (though multiple epochs does not make sense because the weights would be tuned to the data and the norm would be less) or did you try different values for maxNorm and 100 worked out.

@maciejkorzepa
Copy link
Author

@suhaspillai To be honest, I set it to 100 after reading posts from issue #51 and I haven't tried other values since then.

@suhaspillai
Copy link

OKay.Thanks

@shantanudev
Copy link

If I am trying to get this eventually working with live audio recordings, do you think its a good idea to add noise to the data and train it again?

@SeanNaren
Copy link
Owner

@shantanudev, it is a good idea to insert noise into your training data and train on this, however the repo doesn't currently support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants