Replies: 2 comments 3 replies
-
It depends. Do you plan to use Riva eventually or just NeMo? Do you need the time-stamps or not? Most of the current scripts for N-gram LM are based on deepspeech decoders while you can still use pyctcdecode with NeMo models. As current wrapper of the deepspeech decoder used in NeMo does not return the timestamps, therefore we used pyctcdecode for that feature. We may move completely to pyctcdecode for all the scripts in NeMo. In the meantime, you may use both decoders. BTW, deepspeech decoder also supports ARPA file. We just delete it at the end of the training script to save space. The other drawback with this decoder is that the N-gram models are dependent to the tokenizer and you can not use the same model if tokenizers are different. |
Beta Was this translation helpful? Give feedback.
-
Of course I would prefer all paths to use the same type of LM, preferably one that is model (tokenizer) independent. And yes, I would like to keep the options open to either use Riva or just NeMo, and time-stamps do come handy. If I understood correctly. Currently there are three paths: NeMo deepspeech decoder, NeMo pyctcdecoder, and Riva. Deepspeech uses tokenizer dependent LMs and is incompatible with pyctcdedocer and Riva. The pyctcdedoder and Riva paths are compatible and both use tokenizer independent LMs, that can be build with standard recipes with outside tools. Furthermore, these two paths can provide also time-stamps, whereas the deepspeech one can not. If this is the case, I vote for the team to move completely to pyctcdecode for all scripts in NeMo, as there are more gains than drawbacks. A tokenizer independent LM can be used with any model and does not require to be rebuilt whenever the tokenizer changes, and time-stamps if not required in the end result can simply be ignored. |
Beta Was this translation helpful? Give feedback.
-
When doing decoding it is often mentioned that a LM (either an n-gram or neural model) will help accuracy. Browsing through the NeMo codebase and Riva documentation for n-gram LMs I am unsure to how the LM should be prepared. Riva for example accepts all ARPA, CARPA and a binary KenLM. For Riva 1.8.0 no special requirements are listed. So one can assume that a standard approach to preparing the LM works, and tools like SRILM, POCOLM, KENLM can be used as long as the LM format is correct.
The NeMo codebase contains a special scripts section dedicated to n-gram LMs. The scripts make use of the OpenSec2Sec ctc-decoders, for which a separate install shell script is provided. The LM creation scripts create only binary KenLM models, and always character-based ones. I.e. in case of a BPE model the corpus is pre-tokenized and the token ids (shifted by a predefined value) are interpreted as characters, over which the LM is constructed. So no ARPA support.
Going through the codebase one can spot recently added decoder_timestamps_utils, which is based around pyctcdecode. Pyctcdecode can load an n-gram LM model in either KenLM or ARPA format. There is no documentation, however, to how one should prepare the LM. Should the earlier mentioned scripts be used? What changes when deploying to Riva? Can the same models be used?
Could someone please shed some light to how all of this works together, and how/which approach to preparing and using a n-gram LM is the correct one.
Beta Was this translation helpful? Give feedback.
All reactions