-
Notifications
You must be signed in to change notification settings - Fork 0
Portage Statistical Machine Translation software suite — suite de logiciels de traduction automatique statistique
License
nrc-cnrc/Portage-SMT-TAS
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Portage-SMT-TAS Traitement multilingue de textes / Multilingual Text Processing Centre de recherche en technologies numériques / Digital Technologies Research Centre Conseil national de recherches Canada / National Research Council Canada Copyright 2004-2021, Sa Majesté la Reine du Chef du Canada Copyright 2004-2021, Her Majesty in Right of Canada MIT License - see LICENSE See NOTICE for the Copyright notices of 3rd party libraries. Français: LISEZMOI Description Portage is a project to explore new techniques in Statistical Machine Translation (SMT), and to develop state-of-the-art SMT technology that can be used commercially, as well as to support other projects at NRC and elsewhere. Various SMT systems were developed under the Portage project. PORTAGEshared was first released to Canadian universities to build and train a community of users. Commercial releases of PORTAGEshared 1.1 to 1.3 were made, followed by Portage 1.4. PortageII marked a leap forward in NRC's SMT technology. With version 1.0, we brought in significant improvements to translation engine itself that result in better translations, and contributed to our success at the NIST Open Machine Translation 2012 Evaluation (OpenMT12). With version 2.0, we added two important features to help translators: the transfer of markup tags from the source sentence to the machine translation output, as well as handling of the increasingly common xliff file format. 2.1 was a maintenance release while 2.2 brought in the fixed-terms module. PortageII 3.0 marked another important step in the core SMT technology, bringing in deep learning (with Neural Network Joint Models, or NNJMs), an improved reordering model based on sparse features, coarse LMs and other improvement. The NRC put significant efforts in optimizing the default training parameters to improve the quality of the translations produced by PortageII 3.0 out of the box, and we expect users to see and appreciate the difference. PortageII 4.0 brought in incremental document adaptation and training NNJMs on your own data, with fully updated generic models. Portage-SMT-TAS is the name we chose for the GitHub repo, now that we are releasing the code base as open source under the MIT License. Despite the rise of NMT, the Portage code base still includes a number of tools that can be useful to the world of Machine Translation and parallel text processing. All the names above might be found in various parts of the documentation and code, and can be construed to refer to the same system, or one of its versions. When you see PortageII_cur, that's a placeholder that our script for creating an official release replaces by the actual version name and number. Note that some of the Portage tools have been spliced into other GitHub repos: - https://github.com/nrc-cnrc/PortageTextProcessing contains text processing tools that are relevant for both NMT and SMT work. - https://github.com/nrc-cnrc/PortageClusterUtils contains the multi-core and high-performance cluster parallelization tools that were developped for Portage, but can be used for other projects as well. Portage-SMT-TAS includes a complete suite of software tools required for phrase-based statistical machine translation, including: - preprocessing of French, English, Spanish, Danish (tokenization, detokenization, lowercasing, aligning), Chinese (handling of dates and numbers), and Arabic (integration with MADA tokenizer); - training lexical translation models (IBM models 1 and 2, HMM, plus integration with MGiza++ for IBM4) from aligned corpora; - training phrase-based translation models from aligned corpora and lexical models; - training lexicalized distortion models, and their hierarchical variant; - training sparse features, including discriminative hierarchical distortion models; - using language models in text and binary format for various purposes; - training, fine tuning and using Neural Network Joint Models (NNJM) to improve translation quality; - filtering language and translation models to retain only information needed for a particular text to be translated; - adapting language and translation models to reflect the characteristics of the data currently being translated; - decoding (doing the actual translation, producing n-best lists or lattices); - rescoring (choosing the best hypothesis from an n-best list) using sources of information that are external to the decoder; a collection of rescoring features; - optimizing weights, both for decoding and for rescoring, using a variety of tuning algorithms, including N-Best MERT, Lattice and N-Best MIRA and a few more; - truecasing (restoring upper case letters in the output of the translation); - evaluation using BLEU, WER or PER. - a tightly packed representation of various model files (LMs, TMs, (H)LDMs, suffix arrays) for instant loading and fast access via memory-mapped I/O; - a confidence estimation module to compute a per-sentence quality prediction/estimation of the translations produced by Portage; - an API to incorporate the decoder into other applications using SOAP or cgi; a web-ready demo; - PortageLive -- tools to setup a runtime SMT server using Portage, accessible via a SOAP API, via CGI scripts or a command line interface; - incremental document model adaptation; - code to handle translation projects in the TMX and XLIFF format, with optional transfer of formatting tags from the source text to the output. - an experimental framework -- a collection of tools to automate running all the components required to train a full machine translation system (see https://github.com/nrc-cnrc/PortageTrainingFramework); - generic en->fr and fr->en models to supplement in-domain training corpora, for improved translation quality, especially for smaller corpora; - lots of general utility code needed for the above. Contents of this directory bin/ executable code (once compiled) doc/ documentation (once generated) framework/ framework for training PortageII and doing experiments [TODO - this will be a separate GitHub repo] generic-model/ location to install PortageGenericModel 2.0 [TODO - where will this be?] lib/ run-time libraries (once compiled) logs/ place-holder for runtime logs PortageLive/ files needed to setup a runtime translation server src/ source code test-suite/ various test suites third-party/ non-NRC software distributed with PortageII for convenience tmx-prepro/ tools to extract and preprocess training corpora from TMX files [TODO - this will be a separate GitHub repo] INSTALL installation instructions NOTICE third party Copyright notices README this file RELEASES release history with change notes SETUP.bash config file to activate Portage-SMT-TAS Getting started The framework subdirectory [TODO - fix] provides a PortageII training and experiment framework. We recommend you use this framework as your starting point for real experiments, and as your default configuration for training PortageII systems for use with PortageLive. A tutorial is included showing how to run a toy experiment with this framework: tutorial.pdf. Even if you used PortageII 2.2 or earlier before, we recommend you read through this document, as it highlights the currently recommended procedures for using PortageII. Please do not use our old toy example as a starting point: start from the framework instead. For further documentation, look at the contents of doc/. Upgrading from a previous version If you have models that were created using a version of PortageII older than 3.0, we strongly recommend that you retrain new models with the current version, to take advantage of the new features. Between PortageII 2.x and PortageII 3.0, we have conducted extensive experiments to optimize the training procedures, and updated the framework defaults to reflect the configuration we now recommend. If you have customized or created templates from the framework's Makefile.params file, please update them to reflect the changes in PortageII 3.0. If you use custom plugins, please review the new default plugins, which might meet your needs out of the box. We now recommend (and enable by default) the following settings: - Use the new sparse features. - Use the new Coarse LM models. - Use IBM4 word-alignments, along with IBM2 and HMM ones (require MGiza++). - If the target language is French or English, use the generic LM as background model in a MixLM with your main language model (note: this will not introduce out-of-domain terminology in your translations, rather, it will help choose among competing alternatives from your own in-domain data). For the language model, the framework now has a PRIMARY_LM variable that you can use instead of TRAIN_LM or MIXLM: when you use it, the framework will automatically use the generic LM if you're translating into French or English, and it will automatically use MixLMs instead of regular ones as soon as there is more than one component LM or if the generic LM applies. A matrix showing the evolution of the SOAP API, plugins and training features is now provided: review doc/PortageAPIComparison.pdf when upgrading. The PortageLive SOAP API has changed significantly with version 3.0 as well. Review the updated PortageLiveAPI.wsdl, doc/PortageAPIComparison.pdf and the php scripts in PortageLive/www/soap when updating your applications to work with the PortageII 3.0. When installing the updated API, remove compiled WSDL files cached by Apache in /tmp/wsdl-*.
About
Portage Statistical Machine Translation software suite — suite de logiciels de traduction automatique statistique
Topics
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published