IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING

by Reinhard Kneser and Hermann Ney

https://www-i6.informatik.rwth-aachen.de/publications/download/951/Kneser-ICASSP-1995.pdf

Abstract

Stochastic language modeling: backing-off is a method to cope with the sparse data problem
Propose distributions optimized for backing-off
Theoretical derivations lead to distributions different from usual probability distributions
Experiments show 10% improvement in terms of perplexity and 5% in word error rate

1. Introduction

Stochastic language model: provides probabilities of a given word sequence through conditional probabilities p(w | h)
M-gram models: consider histories with equivalent last (M - 1) words as equivalent
Sparse data problem: too many possible events, less reliable estimates for unseen events
Smoothing techniques: interpolation and backing-off

2. BACKING-OFF

Uses less specific equivalence classes to estimate probabilities more reliably
Normal probability distribution of coarser model used for backing-off can lead to bias towards heavily conditioned words

Approach:

Assume equivalent histories according to specific and general equivalence relations
Backing-off model: p(w | h) = { £l'(w | h) if N(h, w) > 0 p(w | h) = f(w | h) if N(h, w) = 0 }
Optimize parameters of the distribution function for seen and unseen events
Two approaches lead to similar solutions with no additional computational overhead

3. MARGINAL DISTRIBUTION AS CONSTRAINT

Approach to Modeling Marginal Distributions

Assumes maximum-likelihood estimates for p(w|h) = N(h, w)/N(h) and p(g|h) = N(h, g)/N(h)
Moves β out of second sum and applies constraint equation (3): I(h) = 1 - Σ α(w|h)p(w|h) - Σ (β(w|h)p(g|h) g: N(g, w)=0

Smoothing Techniques

Differ in probability estimate α for seen events
Turing-Good estimates and linear/absolute discounting methods are examples
Smoothing distribution kept fixed as β(w|h) = p(w|h)

Proposed Approach

Leave the parameters of smoothing distribution β free and optimize with others
Two approaches lead to similar solutions, independent of modeling type
No additional computational overhead added

First Approximation

Sum in denominator considered constant with respect to w
Proportional to numerator: β(w|h) = Σ p(v|h) - Σ p(v|h) v (g: N(g, v)>0)
Normalization ensures sum of β(w|h) equals unity

Definitions:

α(h) = 1 - d Σ N(h, w) > 0, 0 < d < 1
p(w|h) = Σ [N(g, w) - d] g: N(g, w)>0 or p(w|h) = Σ [N(g, v) - d] v (using special form of model)

Equation (10)

β(w|h) = N+(·,h,w)
Obtains distribution different from probability distribution p(w|h)
Only takes into account if a word was observed in some coarse context and ignores frequency.

4. LEAVING-ONE-OUT

Leaving-One-Out Technique for Backing-Off Model Estimation

Background:

Maximum likelihood estimation cannot estimate unseen events directly
Cross-validation techniques like leaving-one-out technique are used to overcome this issue
Leaving-one-out technique removes one event from training data and tests model on remaining events
Sum of log probabilities of removed events serves as optimization criterion

Applying the Leaving-One-Out Technique:

Remove single unseen events (singletons) from training data
Train model on remaining data
Estimate leaving-one-out probability for removed event
Sum log probabilities of all removed events to obtain leaving-one-out log likelihood
Use this as optimization criterion
Final result: relative counts where only singletons are considered
Solutions of both approaches (Eqs. 13, 20) are similar

5. Experimental Results

Evaluated on German Verbmobil corpus and Wall Street-Journal task
Separate test sets used for evaluation in both tasks
Used trigram language models with non-linear interpolation smoothing
Standard, 'singleton', and 'marginal constraint' distributions tested
All models smoothed to avoid zero probabilities
Consistent improvement of new models over baseline (up to 10% lower perplexity, 5% lower word error rate)
Recognition results produced for Wall Street-Journal task
Compact trigram models built without loss of performance
Experiments show improvement in terms of perplexity and recognition results
Comparison with ARPA's official model reveals an improvement of about 9% in perplexity

Conclusion

Special backing-off distributions improve language models by up to 10% in perplexity and 5% in word error rate compared to baseline
Both theoretically derived solutions do not depend on specific model or add extra computational costs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLM_IMPROVED-BACKING-OFF-FOR-M-GRAM-LANGUAGE-MODELING.md

SLM_IMPROVED-BACKING-OFF-FOR-M-GRAM-LANGUAGE-MODELING.md

IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING

Contents

Abstract

1. Introduction

2. BACKING-OFF

3. MARGINAL DISTRIBUTION AS CONSTRAINT

4. LEAVING-ONE-OUT

5. Experimental Results

Conclusion

Files

SLM_IMPROVED-BACKING-OFF-FOR-M-GRAM-LANGUAGE-MODELING.md

Latest commit

History

SLM_IMPROVED-BACKING-OFF-FOR-M-GRAM-LANGUAGE-MODELING.md

File metadata and controls

IMPROVED BACKING-OFF FOR M-GRAM LANGUAGE MODELING

Contents

Abstract

1. Introduction

2. BACKING-OFF

3. MARGINAL DISTRIBUTION AS CONSTRAINT

4. LEAVING-ONE-OUT

5. Experimental Results

Conclusion