Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

AdolfVonKleist · 2017-08-14T20:17:49Z

The topic of LM training came up again recently.

The aligner produces weighted alignment lattices. There is some evidence that augmenting the Maximization step in the EM alignment process with the sort of expected-count KN smoothing described in this paper may/should improve the overall quality of the G2P aligner:

https://www3.nd.edu/~dchiang/papers/zhang-acl14.pdf

The same approach may be used to directly train the target joint n-gram model from the resulting alignment lattices. I previously tried the latter using the WB fractional counts implementation in OpenGrm NgramLibrary, but it seemed to have little impact. The Zhang paper notes a similar outcome and that EC-KN appears to be much more performant, even compared to the fractional KN implementation employed in Sequitur.

If I'm going to include some form of LM training after all, maybe this represents the most appropriate choice. There is also reference implementation as a Ghiza add-on:

https://github.com/hznlp/giza-kn

The text was updated successfully, but these errors were encountered:

smilenrhyme · 2020-04-05T13:14:34Z

@AdolfVonKleist thanks a lot for such wonderful library 👍 I am having following questions, please share your thoughts.

Can we pass Kenlm based ARPA file in phonetisaurus-train step ? I am not sure the current mit-lm is trained on what data and how is being used and same for RNNLM.

[Does usage same as defined in Ref 1 HMM where emission probabilities comes from alignment module and transition prob comes from LM trained on phonetic sequences passed in training as word pair <Grapheme /t Phoneme> ? but looks like you are trying to improve alignment module itself using KN smoothing]

Note : Got the overview of this work from these references :

https://www.aclweb.org/anthology/N07-1047.pdf (M2M EM -> HMM)
Improving WFST-based G2P Conversion with Alignment Constraints and
RNNLM N-best Rescoring

Thanks a lot !!

AdolfVonKleist · 2020-04-05T16:53:52Z

You should be able to use kenlm to perform the ARPA training directly. Just use the command line utilities instead of the python wrappers.

$ phonetisaurus-align --input=cmudict.formatted.dict \
  --ofile=cmudict.formatted.corpus --seq1_del=false
# Train an n-gram model (5s-10s):
$ estimate-ngram -o 8 -t cmudict.formatted.corpus \
  -wl cmudict.o8.arpa
# Convert to OpenFst format (10s-20s):
$ phonetisaurus-arpa2wfst --lm=cmudict.o8.arpa --ofile=cmudict.o8.fst

just replace the estimate-ngram call with an equivalent kenlm command. You'll need to output it to ARPA text format though, so that you can still transform it into a WFST for inference.

The mitlm call is trained on the output of the alignment - it just treats the aligned and segmented joint token sequences a 'normal' text corpus.

smilenrhyme · 2020-04-06T07:43:08Z

@AdolfVonKleist Thanks for quick response 👍

Does RNNLM work in current master code. Is this used in same way as mit-lm ? or Is there any goodness achieved with RNNLM over mit-lm ?

And just to clarify, this is what you are saying about last line. [Steps from paper referenced above]

Thanks :)

AdolfVonKleist · 2020-04-08T06:16:58Z

Hi, Yes it should work, however the rnnlm code has not been updated since that earliest release, and is effectively the same as the original Mikolov code from that time. The only novel contribution there is the joint token implementation of the decoder. I did not find it to yield any significant improvement over mitlm as a pure alternative, and the training time, as well as decoding time were significantly slower. The only place where it yielded a modest boost was when used in ensemble with mitlm as described in the paper [but again there is a time penalty]. Whether or not that was/is sufficient reason to use the combined system in a real-world or production setting, as opposed to just the normal joint ngram models, would probably depend on how heavily you prioritize speed versus absolute accuracy. Best, Joe 2020年4月6日(月) 0:43 smilenrhyme <[email protected]>:

…

@AdolfVonKleist <https://github.com/AdolfVonKleist> Thanks for quick response 👍 Does RNNLM work in current master code. Is this used in same way as mit-lm ? or Is there any goodness achieved with RNNLM over mit-lm ? And just to clarify, this is what you are saying about last line. [Steps from paper referenced above] [image: image] <https://user-images.githubusercontent.com/45142420/78534235-b5e4db00-7807-11ea-9463-60374d6df83e.png> Thanks :) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#24 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABVUA5U5WV2D6P2Y7WTESLRLGBZVANCNFSM4DW3MMRA> .

smilenrhyme · 2020-04-10T07:38:20Z

Thanks a lot for detailed perspective 👍

AdolfVonKleist added the enhancement label Aug 14, 2017

AdolfVonKleist self-assigned this Aug 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

AdolfVonKleist commented Aug 14, 2017

smilenrhyme commented Apr 5, 2020

AdolfVonKleist commented Apr 5, 2020

smilenrhyme commented Apr 6, 2020

AdolfVonKleist commented Apr 8, 2020 via email

smilenrhyme commented Apr 10, 2020

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

Comments

AdolfVonKleist commented Aug 14, 2017

smilenrhyme commented Apr 5, 2020

AdolfVonKleist commented Apr 5, 2020

smilenrhyme commented Apr 6, 2020

AdolfVonKleist commented Apr 8, 2020 via email

smilenrhyme commented Apr 10, 2020