This is a project with an aim of:
-
making a free/libre morphological transducer for every language of the world;
-
making free/libre machine translators for translating between all related languages of a particular language group (e.g. Turkic, Slavic, Indo-Iranian etc);
-
updating Apertium 3.0’s official documentation so that it reflects changes which have happened since the time of its publication, especially as the result of Google Summer of Code projects;
-
reducing the code which developers have to maintain in a typical Apertium 3.0 monolingual or bilingual package.
The project, in its spirit and aims is similar to the Apertium project, and can be considered its spinoff, with two additions (hence two pluses in the name of the project...):
-
for each language, we will maintain a corpus of sharable texts, parallel or comparable to corpora for other languages
-
machine translators will have a speech recognition & speech synthesis front-ends (based on Mozilla’s Common Voice and DeepSpeech projects).
To be frank, if we come up with some useful technology, we hope to see it merged to the Apertium’s code base (maybe in the form of different campaigns, such as "Help us to create a mophological transducer for every human language!").
Since one of the declared goals of ours is being able to handle speech (not just the written word), and, since we want to re-evaluate some of the design decisions made in the Apertium project (and yet are not sure at all whether such re-evaluations will turn out to be wise decisions), we decided to make a separate project out of this effort.
Apertium is a great project, with quite a few contributors, but here is a list of things deemed as less-than-optimal in the current Apertium setup:
-
bilingual dictinoaries support only one-to-one mappings, which leads to a situation where monolingual dictionaries as a workaround unnecessarily include multiwords
-
transfer rules are unidirectional, and currently there is no way of marking a transfer rule as ‘bidirectional’ and get a transfer rule for the opposite direction for free
-
there can be and usually there are several entries in a .lexc or .dix files with the same left-hand side (and there is no compile-time checks against mistakenly adding the same word twice, possibly with a right and a wrong category)
-
rather long compile cycles, translators are not extensible programmatically
-
not possible to re-use components in a programmatic way (i.e. without having to copy-paste manually, which sooner or later will lead to out-of-date chunks). What we want instead is having a way to say, e.g. something like this:
from
apertium-tat-rus
import
transfer-rule-x
,from
apertium-symbols
import
n,
v,
adj,
np
,parameterize(transfer-rule-x,
lemma="foo")
etc. -
?
We suppose that most of the problems listed will be due the fact that
the domain-specific languages used in Apertium (and there are a handful
of them – one seen in monolingual and bilingual .dix
files, one seen
in chunking
, interchunk
and postchunk
transfer rules, lexc
,
twol
and others) are so-called external DSLs. They are parsed and
compiled. When writing in these DSLs, you don’t have access to a
full-fledged programming language which would allow you to extend the
DSL in question easily and simplify&automate things. You also don’t have
access to useful features of a full-fledged programming language like a
proper module system, which would allow you to re-use code across
monolingual and bilingual packages reliably.
To be fair, Apertium was designed for translating between closely-related languages, and for that it works sufficiently well.
- Every Apertium program is a valid Apertium++! program.
At the core of this project is a Racket library called apertiumpp. The documentation of the library can be found at https://taruen.com/apertiumpp/apertiumpp/.
apertiumpp library has an interface to various textual
data which can be used to test apertium packages. Once apertiumpp
is
installed, you should be able to do the following.
raco apertiumpp corpus -l <ISO-630-3 code>
run in the terminal will list the corpora we have for the language:
Example:
> (rash "raco apertiumpp corpus -l tat")
"bible.com"
Example:
> (rash "raco apertiumpp corpus -l tat bible.com")
"1502-ttrbbl-izge-yazma.csv"
If full “path” is given to a corpus, it will be output to stdout:
Example:
> (display
(rash "raco apertiumpp corpus -l tat bible.com
1502-ttrbbl-izge-yazma.csv | head"))
GEN.1.1 Әүвәл
GEN.1.2 Җир йөзе әле сурәтсез, буп-буш – төпсез упкын, караңгылык эченә
чумган; Аллаһының рухы-сулышы сулар өстендә гизә иде.
GEN.1.3 Аллаһы:
GEN.1.3 – Яктылык булсын! – дип боерды, һәм яктылык булды.
GEN.1.4 Аллаһы яктылыкның яхшы булуын күрде һәм аны караңгылыктан аерды.
GEN.1.5 Яктылыкны «көн» дип, караңгылыкны «төн» дип атады. Кич булды,
иртә булды – бер көн кичте.
GEN.1.6 Аллаһы:
GEN.1.6 – Суларның уртасында бер гөмбәз булсын, суларны бер-берсеннән
аерсын! – дип боерды.
GEN.1.7 Нәкъ шулай булды: Аллаһы гөмбәзне яратты. Гөмбәз астындагы
суларны гөмбәз өстендәге сулардан аерды.
GEN.1.8 Аллаһы гөмбәзне «күк» дип атады. Кич булды, иртә булды – икенче
көн кичте.
For that to work, Apertium package should have an "info.rkt"
file and
a "main.rkt"
file with the functionality you want to export and test.
See apertium-kaz and
apertium-kaz-tat for an
example.
To include non-racket files into the resulting racket package, you’ll
need to use the define-runtime-path
macro, as explained e.g. in this
blog post.