-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better n-gram language models #7
Comments
Should this be implemented as a method that will return larger LMs made by combining the LMs of languages that are distance n away in a decision tree? |
(note: this enhancement probably shouldn't be attempted until we have a working system using the existing feature definitions)
Essentially yes. What I had in mind was to bisect all the data in some way (e.g. by language families, lowest perplexity or cross entropy of individual models, etc.), then to use the resulting splits to train a language model classifier (e.g., does the language belong in class A or B), the continue splitting. The end case (i.e. leaf classifiers) would discriminate between individual languages (it may be that A is a specific language model and B is a combined class, which is then further split, etc.). This structure would essentially be a decision tree. The motivation is that we don't have enough language data to do multi-class classification with a single classifier. But this is just my idea made without much experience in language classification. See if there is literature on n-gram language classification of low-resource languages. |
I'll do some reading and try out a few things to extend the language models. I'll try to train and test the lgid system to look at performance improvements but where should I get more freki files? |
Check under |
The word and character n-gram models are pretty simple. The feature, as defined, is set as True if some percentage of the tokens on the line exist in the language model for the given language. We could try to build a more typical n-gram model, but data sparsity would be a problem. Consider using a resource like Ethnologue to get language family hierarchies (this data can be extracted from the included Crubadan.csv file), then combining data from related languages to create a more general model. This could help distinguish between language candidates that are radically different. These n-gram langauge models could be arranged in a decision tree (or similar structure) in order to more finely select the matching language.
Note that spelling differences across even closely related languages could make this method infeasible, but it's worth considering if there's time.
The text was updated successfully, but these errors were encountered: