-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance drop after loading a pretrained model. #69
Comments
@maciejkorzepa Have you been able to get the WER% down to 15? If so, do you mind sharing your weights? My current desktop has weaker GPUs so I have only been able to go as low as 38% after a long time training. |
Sure. https://drive.google.com/drive/folders/0BwyHNzMwVsM-SXdpdlduTlVTT2c?usp=sharing |
@maciejkorzepa thank you! Sorry, but do you also mind sharing the network architecture that you used? I assume you used 1280 Hidden nodes, 7 layers, and Traditional RNN Bidirectional? |
I didn't make a single change in the architecture, so I am using the default sizes. |
Been on a bit of a hiatus but am back now! @maciejkorzepa thanks so much for this, I'm going to download and just check the model. Did you just do the librispeech set up, and then hit |
Actually, I changed some parameters and code. I used for training everything apart from test-clean and test-other. 12% error rate was achieved on test-clean. I got -inf costs almost in each epoch. But as WER kept going down, I didn't think it was a big problem. As for the parameters, maxNorm was set to 100. I ran on 2xK80 with batch size 64. One epoch took almost 8 hours. The change in the code I made was decreasing the batchsize for a few hundred last utterances, as they are some considerably bigger than the rest. So instead of setting batch size ~30-40 for whole training which would reduce the speed of training, I just changed to smaller batch size (24) in the end of an epoch (this is needed for the first epoch when the data is sorted). In DS2 paper, Baidu mentions how they handle long utterances leading to out of memory errors:
This solution allows to use much bigger batch size and thus speed up the training. I am wondering whether implementing this in your model would be feasible. On the other hand, I found that increasing batch size is not desirable when it comes to WER. The smaller the batch size is, the more noisy the gradients are which results helps in getting out from local minima. I did some tests on LibriSpeech 100h and as far as I remember, for batch size 75, the WER stabilized at 58%, for batch size 40 at 52% and for batch size 12 at 42%. I didn't try decreasing the batch size with 1000h as 8h per epoch was already long enough for me :) |
@maciejkorzepa you're awesome thanks so much for this! Honestly the issue here is Torch isn't as memory efficient (especially the RNNs) as Baidus' internal code. But as you said, it's still trainable just takes forever. I'll download the model and do some checks/update documentation. Are you fine with me using this as the pre-trained network for Librispeech? I do have an LSTM based network training which is much smaller but its hovering around ~20 WER, a WER of 12 is awesome! |
@SeanNaren Sure, go ahead! I think I might be able to use 4xK80 for the training soon, I might then try reducing the batch size and see if WER can get any lower... |
@maciejkorzepa Sorry, I have a question. I was trying to load your weight but it scored a 99% WER on the test-clean. Are there specific parameters that you used on test.lua? |
@nn-learner Maybe your input spectrograms were processed with different parameters? I used: |
@maciejkorzepa on what basis did you choose maxNorm as 100?. I went through maxNorm paper (http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf) . They mention to take average norm across many updates and choose half or ten times the value of average norm . So, I just wanted to know did you take average norm for 1 complete epoch or multiple epochs and then went on to take 100 as maxNorm (though multiple epochs does not make sense because the weights would be tuned to the data and the norm would be less) or did you try different values for maxNorm and 100 worked out. |
@suhaspillai To be honest, I set it to 100 after reading posts from issue #51 and I haven't tried other values since then. |
OKay.Thanks |
If I am trying to get this eventually working with live audio recordings, do you think its a good idea to add noise to the data and train it again? |
@shantanudev, it is a good idea to insert noise into your training data and train on this, however the repo doesn't currently support this. |
I am training the model on 1k dataset and decided to split the training into a few so that in each only 10 epochs is processed (I need to share the cluster). By doing so I experienced a drop in WER after the last epoch from one training (WER < 15%) and the first epoch from the next training (WER > 17%) where I load the model from the last epoch of the previous training. In the second training the WER got down to 12.8% but then when third training was started, the WER after first epoch was 14.4%. I tried disabling batch sorting in the first epoch when a pretrained model is loaded, but it didn't help. What can be the cause of this problem?
The text was updated successfully, but these errors were encountered: