Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kneser-Ney Smoothing on Expected Counts: Alignment and Joint n-gram models #24

Open
AdolfVonKleist opened this issue Aug 14, 2017 · 5 comments
Assignees

Comments

@AdolfVonKleist
Copy link
Owner

The topic of LM training came up again recently.

The aligner produces weighted alignment lattices. There is some evidence that augmenting the Maximization step in the EM alignment process with the sort of expected-count KN smoothing described in this paper may/should improve the overall quality of the G2P aligner:

The same approach may be used to directly train the target joint n-gram model from the resulting alignment lattices. I previously tried the latter using the WB fractional counts implementation in OpenGrm NgramLibrary, but it seemed to have little impact. The Zhang paper notes a similar outcome and that EC-KN appears to be much more performant, even compared to the fractional KN implementation employed in Sequitur.

If I'm going to include some form of LM training after all, maybe this represents the most appropriate choice. There is also reference implementation as a Ghiza add-on:

@smilenrhyme
Copy link

@AdolfVonKleist thanks a lot for such wonderful library 👍 I am having following questions, please share your thoughts.

  • Can we pass Kenlm based ARPA file in phonetisaurus-train step ? I am not sure the current mit-lm is trained on what data and how is being used and same for RNNLM.

[Does usage same as defined in Ref 1 HMM where emission probabilities comes from alignment module and transition prob comes from LM trained on phonetic sequences passed in training as word pair <Grapheme /t Phoneme> ? but looks like you are trying to improve alignment module itself using KN smoothing]

Note : Got the overview of this work from these references :

  1. https://www.aclweb.org/anthology/N07-1047.pdf (M2M EM -> HMM)
  2. Improving WFST-based G2P Conversion with Alignment Constraints and
    RNNLM N-best Rescoring

Thanks a lot !!

@AdolfVonKleist
Copy link
Owner Author

You should be able to use kenlm to perform the ARPA training directly. Just use the command line utilities instead of the python wrappers.

$ phonetisaurus-align --input=cmudict.formatted.dict \
  --ofile=cmudict.formatted.corpus --seq1_del=false
# Train an n-gram model (5s-10s):
$ estimate-ngram -o 8 -t cmudict.formatted.corpus \
  -wl cmudict.o8.arpa
# Convert to OpenFst format (10s-20s):
$ phonetisaurus-arpa2wfst --lm=cmudict.o8.arpa --ofile=cmudict.o8.fst

just replace the estimate-ngram call with an equivalent kenlm command. You'll need to output it to ARPA text format though, so that you can still transform it into a WFST for inference.

The mitlm call is trained on the output of the alignment - it just treats the aligned and segmented joint token sequences a 'normal' text corpus.

@smilenrhyme
Copy link

@AdolfVonKleist Thanks for quick response 👍

Does RNNLM work in current master code. Is this used in same way as mit-lm ? or Is there any goodness achieved with RNNLM over mit-lm ?

And just to clarify, this is what you are saying about last line. [Steps from paper referenced above]

image

Thanks :)

@AdolfVonKleist
Copy link
Owner Author

AdolfVonKleist commented Apr 8, 2020 via email

@smilenrhyme
Copy link

Thanks a lot for detailed perspective 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants