Better n-gram language models #7

goodmami · 2017-07-06T20:03:46Z

The word and character n-gram models are pretty simple. The feature, as defined, is set as True if some percentage of the tokens on the line exist in the language model for the given language. We could try to build a more typical n-gram model, but data sparsity would be a problem. Consider using a resource like Ethnologue to get language family hierarchies (this data can be extracted from the included Crubadan.csv file), then combining data from related languages to create a more general model. This could help distinguish between language candidates that are radically different. These n-gram langauge models could be arranged in a decision tree (or similar structure) in order to more finely select the matching language.

Note that spelling differences across even closely related languages could make this method infeasible, but it's worth considering if there's time.

rgeorgi · 2017-07-11T18:25:57Z

Generate Language Models

MackieBlackburn · 2017-07-13T20:08:50Z

Should this be implemented as a method that will return larger LMs made by combining the LMs of languages that are distance n away in a decision tree?

goodmami · 2017-07-13T20:26:44Z

(note: this enhancement probably shouldn't be attempted until we have a working system using the existing feature definitions)

Should this be implemented as a method that will return larger LMs made by combining the LMs of languages that are distance n away in a decision tree?

Essentially yes. What I had in mind was to bisect all the data in some way (e.g. by language families, lowest perplexity or cross entropy of individual models, etc.), then to use the resulting splits to train a language model classifier (e.g., does the language belong in class A or B), the continue splitting. The end case (i.e. leaf classifiers) would discriminate between individual languages (it may be that A is a specific language model and B is a combined class, which is then further split, etc.). This structure would essentially be a decision tree.

The motivation is that we don't have enough language data to do multi-class classification with a single classifier. But this is just my idea made without much experience in language classification. See if there is literature on n-gram language classification of low-resource languages.

MackieBlackburn · 2017-07-13T21:19:46Z

I'll do some reading and try out a few things to extend the language models. I'll try to train and test the lgid system to look at performance improvements but where should I get more freki files?

goodmami · 2017-07-13T21:55:21Z

where should I get more freki files?

Check under patas:/projects/grants/riples/odin2.1-pdfs/4-match for training and patas:/projects/grants/riples/odin2.1-pdfs/5-gold* for testing.

goodmami added the enhancement label Jul 6, 2017

rgeorgi assigned MackieBlackburn Jul 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better n-gram language models #7

Better n-gram language models #7

goodmami commented Jul 6, 2017

rgeorgi commented Jul 11, 2017

MackieBlackburn commented Jul 13, 2017

goodmami commented Jul 13, 2017

MackieBlackburn commented Jul 13, 2017

goodmami commented Jul 13, 2017

Better n-gram language models #7

Better n-gram language models #7

Comments

goodmami commented Jul 6, 2017

rgeorgi commented Jul 11, 2017

MackieBlackburn commented Jul 13, 2017

goodmami commented Jul 13, 2017

MackieBlackburn commented Jul 13, 2017

goodmami commented Jul 13, 2017