ASR: NeMo, Riva and n-gram LM for BPE models #3817

itzsimpl · 2022-03-09T12:39:23Z

itzsimpl
Mar 9, 2022

When doing decoding it is often mentioned that a LM (either an n-gram or neural model) will help accuracy. Browsing through the NeMo codebase and Riva documentation for n-gram LMs I am unsure to how the LM should be prepared. Riva for example accepts all ARPA, CARPA and a binary KenLM. For Riva 1.8.0 no special requirements are listed. So one can assume that a standard approach to preparing the LM works, and tools like SRILM, POCOLM, KENLM can be used as long as the LM format is correct.

The NeMo codebase contains a special scripts section dedicated to n-gram LMs. The scripts make use of the OpenSec2Sec ctc-decoders, for which a separate install shell script is provided. The LM creation scripts create only binary KenLM models, and always character-based ones. I.e. in case of a BPE model the corpus is pre-tokenized and the token ids (shifted by a predefined value) are interpreted as characters, over which the LM is constructed. So no ARPA support.

Going through the codebase one can spot recently added decoder_timestamps_utils, which is based around pyctcdecode. Pyctcdecode can load an n-gram LM model in either KenLM or ARPA format. There is no documentation, however, to how one should prepare the LM. Should the earlier mentioned scripts be used? What changes when deploying to Riva? Can the same models be used?

Could someone please shed some light to how all of this works together, and how/which approach to preparing and using a n-gram LM is the correct one.

VahidooX · 2022-03-10T00:03:54Z

VahidooX
Mar 10, 2022
Collaborator

It depends. Do you plan to use Riva eventually or just NeMo? Do you need the time-stamps or not?

Most of the current scripts for N-gram LM are based on deepspeech decoders while you can still use pyctcdecode with NeMo models. As current wrapper of the deepspeech decoder used in NeMo does not return the timestamps, therefore we used pyctcdecode for that feature. We may move completely to pyctcdecode for all the scripts in NeMo. In the meantime, you may use both decoders.

BTW, deepspeech decoder also supports ARPA file. We just delete it at the end of the training script to save space. The other drawback with this decoder is that the N-gram models are dependent to the tokenizer and you can not use the same model if tokenizers are different.

0 replies

itzsimpl · 2022-03-10T09:30:25Z

itzsimpl
Mar 10, 2022
Author

Of course I would prefer all paths to use the same type of LM, preferably one that is model (tokenizer) independent. And yes, I would like to keep the options open to either use Riva or just NeMo, and time-stamps do come handy.

If I understood correctly. Currently there are three paths: NeMo deepspeech decoder, NeMo pyctcdecoder, and Riva. Deepspeech uses tokenizer dependent LMs and is incompatible with pyctcdedocer and Riva. The pyctcdedoder and Riva paths are compatible and both use tokenizer independent LMs, that can be build with standard recipes with outside tools. Furthermore, these two paths can provide also time-stamps, whereas the deepspeech one can not.

If this is the case, I vote for the team to move completely to pyctcdecode for all scripts in NeMo, as there are more gains than drawbacks. A tokenizer independent LM can be used with any model and does not require to be rebuilt whenever the tokenizer changes, and time-stamps if not required in the end result can simply be ignored.

3 replies

titu1994 Mar 10, 2022
Maintainer

I would not assume NeMo compatible tokenizer to be compatible with Riva. Riva is more efficient and uses alternative decoders which Nemo simply doesn't support, so the compatibility with the LMs may be different too (nemo uses binary optimized format, riva uses wfst equivalent for faster decoding etc). In general don't consider nemo and riva to have exactly equivalent infrastructure.

However, as stated by Vahid, we currently have 2 nemo decoders - deepspeech and pyctcdecode. We may opt to keep just pyctcdecode in the future if it satisfies speed/accuracy requirements (maybe around Nemo 1.9 or 1.10 release since currently we have higher priority tasks). Once that happens, we may also deprecate the deepspeed decoder in nemo.

The main reluctance to move away from the deepspeed decoder is accuracy - pyctcdecode is faster, but worse in accuracy. We need a thorough study of what hyper parameters we can effectively tune to bridge this accuracy gap - the entire point of beam search is to improve accuracy. Gaining speed, timestamps and compatibility with riva is not sufficient motivation to abandon better WER.

itzsimpl Mar 10, 2022
Author

Ok, accuracy is a fair and valid point. Having one use a non-tokenised and the other one a tokenized version must be a huge pain when comparing the difference in accuracy. How do you even approach that? I mean, the degree of the n-gram will of course have an impact on accuracy, and by doing pre-tokenisation really changes everything. A 3-gram word level n-gram will always give different results than a 3-gram sub-word (token) level n-gram. And one can not simply say all words are split into 3 tokens, hence a 3-gram word level does not simply equal a 9-gram sub-word level n-gram.

I may have been not precise enough. My assumption was in respect of the n-gram LM. From the point of view of ARPA/binary format produced from a non-tokenised corpus via SRILM,KenLM or POCOLM. I understand that Riva and NeMo are different beasts when in comes to implementation, but both pyctcdecode and Riva accept a the same type of input n-gram LM. In case of Riva post-processing (if any) happens internally. Am I correct?

titu1994 Mar 10, 2022
Maintainer

I'm not sure what's Rivas internals, but you're right that we will have a hard time doing a 1:1 fair WER comparison between the two decoders. We will try our best. We will eventually move to pyctcdecode, just need to know what to tune in it to get close to same results as the current decoder c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ASR: NeMo, Riva and n-gram LM for BPE models #3817

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

ASR: NeMo, Riva and n-gram LM for BPE models #3817

itzsimpl Mar 9, 2022

Replies: 2 comments · 3 replies

VahidooX Mar 10, 2022 Collaborator

itzsimpl Mar 10, 2022 Author

titu1994 Mar 10, 2022 Maintainer

itzsimpl Mar 10, 2022 Author

titu1994 Mar 10, 2022 Maintainer

itzsimpl
Mar 9, 2022

Replies: 2 comments 3 replies

VahidooX
Mar 10, 2022
Collaborator

itzsimpl
Mar 10, 2022
Author

titu1994 Mar 10, 2022
Maintainer

itzsimpl Mar 10, 2022
Author

titu1994 Mar 10, 2022
Maintainer