This is an implementation of the LIGA algorithm for language identification [1] ,[2].
n.b.: As far as I know, at time of writing (January 2015), there is no other open-source implementation of this algorithm.
Liga uses erlang.mk, so just make
. There are no dependencies. The test/
directory has some eunit and common test suites, which can be run with make eunit
and make ct
. All tests should pass.
n.b.: The github continuous integration tests (in .github/workflows/erlang.yml
) don’t run the common test suites, as those tests require datasets not included in the repo.
The basic idea is that you train a model with a set of labelled strings, and save the model. Then you can classify strings against this model.
The modules in test/
give examples of usage.
It is possible to create a new, empty, model and add labelled strings to it:
Data = [{en, "is this a test"},
{nl, "is dit een test"}],
Model = lists:foldl(fun({Lab, Str}, Acc) ->
liga_model:import_string(Acc, Str, Lab)
end,
liga_model:new(),
Data).
For a dataset of any size however, this could take a while. The liga_train module provides a build_model/{1,2} function. This takes a data dictionary and an optional NumberOfRegions parameter:
Data = dict:from_list([{en, ["is this a test",
"the cat sat on the mat"]},
{nl, ["is dit een test",
"Hier is nog een Nederlandse zin."]}]),
Model = liga_train:build_model(Data, 27).
liga_train
parallelises model building in two ways:
- A process is created for each label (en, nl, etc.), for each string, for each trigram list and for each trigram edge list. So for the above example 14 processes would be created. The leaf processes here (i.e. the trigram and trigram edge list processes) broadcast data to region processes.
- In the above example 54 (i.e. 27*2) region processes are created (one set of region process for trigrams and one for trigram edges). When all the work is finished a central process compiles the regions into a single liga model.
A trigram is sent to a region based on its co-ordinates (mutatis mutandum for edges):
{X, Y, Z} = Trigram = {123, 234, 345}.
Region = lists:sum([X, Y, Z]) rem NumberOfRegions.
The point of all this parallelising is to enable a lot of data to be processed and merged fairly promptly.
Once you’ve imported your training data, you can write the model to a file. The following saves the model (along with some metadata) to my_new_liga_model.erl
:
ok = liga_writer:write_model(M, my_new_liga_model).
The main use of a liga model is to classify a string (i.e. identify the language):
Result = liga:classify(M, "is dit ook een test").
[{nl,0.5744047619047619}, {en,0.14583333333333331}]
Here, M could be either a liga model data structure, or an atom representing a liga model module. Result is a list of labels (i.e., languages) the model knows about, each with a score. The list is sorted by score, so the “favourite” language is always the first element.
The score is a floating-point number between 0.0 and 1.0.
n.b. it is possible for liga:classify/2
to return an empty list if none of the trigrams in the input string is contained in the model (i.e. it has no idea what language the string was written in):
Result = liga:classify(M, "和而不同").
[]
As well as the common_test suite test/calibration_SUITE.erl
, a replication of the calibration in [2] is described in doc/calibration.org.
I have added a small tweak to the algorithm. The tweak is enabled by default, but can be reverted to the original by changing the value of a macro in liga_labmap.erl
:
%%-define(VERSION, original).
-define(VERSION, llaisdy).
This table shows the effect of the tweak on calibration test results:
Test | Original | Llaisdy | Change |
---|---|---|---|
{sample_size,5} | 87.333% | 96.333% | +9 |
{sample_size,10} | 97.500% | 98.667% | +1.167 |
{sample_size,25} | 98.000% | 99.000% | +1 |
{sample_size,50} | 99.167% | 99.333% | +0.166 |
spec | 93.333% | 98.333% | +5 |
gen | 91.667% | 85.185% | -6.482 |
{holdouts,1} | 95.833% | 100.000% | +4.167 |
The effects are mostly positive, especially when the training data set is small. I don’t know (yet) why there is a negative effect in the generalisation test.
The thinking behind the tweak is that a node (trigram) or edge (trigram pair) which appears in several languages has less significance per language than a node or edge which appears in fewer languages. For example, the grapheme “ŵ” occurs in Welsh but few other scripts, so the presence of “ŵ” in a text is a strong clue that the text is in Welsh. Consequently the weighting (which is used as a divisor) is multiplied by the number of languages the node/edge appears in. See liga_labmap:score/3
and liga_labmap:versioned_weights/3
.
[1] “Mining Social Media”, 2011, http://www.win.tue.nl/~mpechen/projects/smm/
[2] Tromp, E. & Pechenizkiy, M., 2011, “Graph-Based N-gram Language Identification on Short Texts”, http://www.win.tue.nl/~mpechen/publications/pubs/TrompPechenizkiy_LIGA_Benelearn11.pdf
[3] Vogel, J. & Tresner-Kirsch, D., 2012, “Robust Language Identification in Short, Noisy Texts: Improvements to LIGA”, https://www.mitre.org/sites/default/files/pdf/12_2971.pdf
- document liga_util:get_likely/1
- document tests with languages/dialects that use Arabic alphabet
- test with languages/dialects that use Hanzi
- implement in some other languages: e.g., OCaml, Rust, C
- consider implementing [3]