RELEASES

                              Portage-SMT-TAS

Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2021, Sa Majesté la Reine du Chef du Canada
Copyright 2004-2021, Her Majesty in Right of Canada

MIT License - see LICENSE

See NOTICE for the Copyright notices of 3rd party libraries.


                              Release History
                   (with a summary of important changes)


Portage-SMT-TAS is the name we use for the open-source release of the Portage
Statistical Machine Translation code base, although it does not constitute a
"Release" as understood in this file. The latest release is still PortageII 4.0.

PortageII is the second generation of SMT software released by the NRC.

log4j patch  2022-01-??
 We have analyzed the Portage/PortageII code base to assess its vulnerability
 to the log4shell CVE reported in December 2021. We estimate that the risk of
 exploiting the log4j vulnerabilities on a machine with Portage/PortageII
 is minimal, because the code using log4j is not exposed on a PortageLive
 run-time server.  It is only used during the training of Portage models, and
 only to print statistics produced by the model-training software itself, and
 never user data.

 Furthermore, Portage uses the older version of log4j, 1.2.14, which also has
 known vulnerabilities, but much less severe than the log4shell CVE found in
 log4j 2.* (before 2.17).

 We therefore do not consider it essential to patch any deployed Portage
 systems.

 However, as of January 2022, the current main branch is patched for log4j
 vulnerabilities.

 The relevant changes are:
  - Java >= 1.8 is now required, since log4j 2.17+ is not available for earlier
    versions.
  - structpred.jar has been updated to use log4j 2.17.1 instead of 1.2.14
    (The pre-compiled code for decoder weight optimization is found in
    structpred.jar, located in src/rescoring/ and/or installed in bin/.)

 Patching instructions:
  - if you have PortageII-3.0 or PortageII-4.0 installed, replace the file
    structpred.jar, normally installed in $PORTAGE/bin/, by the version in the
    current main branch under src/rescoring/.
  - if the machine is using java 1.6 or java 1.7, update java to 1.8. On CentOS
    7, this can be done by running:
       sudo yum install java-1.8.0-openjdk
    or
       sudo yum install java-1.8.0-openjdk-headless


PortageII 4.0  2018-07-19
 This release adds two significant new modules to PortageII:
  - Training of Neural Network Joint Models (NNJMs) on your own data
  - Incremental document adaptation

 Major changes:
  - PortageII 3.0 introduced NNJMs, but only supported models that were trained
    at the NRC. PortageII 4.0 now provides the training software so you can
    train these models on your own data and on your own GPU-enabled PortageII
    training server.
    Warning: this will not work with the Python you installed for previous
    versions of PortageII; requires Python through Miniconda2, as documented in
    INSTALL and the TheanoInstallation page of the user manual.
  - Added the incremental document adaptation module, and augmented the API to
    support it. With this module, when a translator post-edits a document, they
    can push the changes back to the PortageLive server to create a small
    document-specific model. Subsequent translation request for the same
    document will benefit from the post-edited sentence pairs that were
    previously pushed.
    Note: this is not incremental retraining of the global model, only
    incremental updates of small, document-specific models.
    Warning: requires php version 5.4 or more recent.
    See doc/user-manual/IncrementalDocumentAdaptation.html for details.
  - Release of the Portage Generic Model 2.1: ready to deploy, fully trained
    systems for French and English in both direction, trained on 43.7 million
    sentence pairs. The LM, TM and NNJM from these systems are available to use
    as updated pretrained models to combine with your own data to create
    en<->fr systems.

 Minor changes:
  - New demo page is much nicer, in JavaScript with communication to the server
    via the SOAP API.
  - A new REST API is available, which you can use instead of the SOAP API if
    you prefer, especially for incremental documentation adaptation. However,
    it is not feature complete: it supports translating sentences and
    paragraphs, but not whole documents at once.
  - The user manual is now generated using asciidoctor.

 Major bug fixes:
  - Identified and fixed software vulnerabilities in the PortageLive web pages.

 Minor bug fixes:
  - Release 3.0.1 did not make the SOAP API as backwards compatible as we
    intended: the 3.0.1 API worked with 2.2 and 3.0, but not 2.1 or earlier.


PortageII 3.0.1 2016-10-07
 This maintenance release makes the SOAP API backwards compatible, so that
 clients written against the SOAP API from Portage 1.4.3 or PortageII 1.0,
 2.0, 2.1, 2.2 or 3.0 will work with the PortageII 3.0.1 API.


PortageII 3.0 2016-07-26
 This release incorporates significant improvements from the research world
 into the core machine translation engine. Although end users will not see
 visible feature changes, they should appreciate the significant improvements
 to the quality of the translations produced by PortageII 3.0.

 Significant changes were made to the training procedure, to the plugins and to
 the SOAP API, however. Maintainers and developers should carefully review
 doc/PortageAPIComparison.pdf, which shows charts comparing the SOAP API, the
 plugin architecture, and the training parameters and recommendations betweeen
 versions 1.0, 2.0, 2.1, 2.2 and 3.0.

 Note: to take advantage of all these improvements, we recommend that you
 retrain all models trained with previous versions of PortageII.
 Please consult README for more details about upgrading to 3.0.

 Major changes:
  - Since the previous version of PortageII, we have conducted extensive
    experiments to update our recommended use of all the options available.
    Framework defaults have been updated to reflect our new recommendations.
  - NNJM decoder feature: add support for NRC's implementation of the Neural
    Network Joint Model, the ground-breaking deep learning approach of Devlin
    et al (ACL 2014).
  - New sparse features, including the discriminative reordering model, a
    significant improvement over previous reordering models.
  - New coarse BiLM features take into account source word classes in the
    context and give good empirical results, improving translation quality.

 Major bug fix:
  - A vulnerability was found in plive.cgi that could allow a carefully crafted
    URL to execute arbitrary code on the PortageLive server as user Apache.
    This has been fixed for PortageII 3.0, and released as a security patch for
    2.0, 2.1 and 2.2.

 Minor changes:
  - Use of generic models is optimized on PortageLive servers, via new
    plive-optimized-pretrained.sh script.
  - Arabic is now supported as a source language.
  - Added support for SRI format for word alignment files (for fastalign)
  - align-words: added GDF and GDFA as self-documenting aliases for
    IBMOchAligner 3 and 4.
  - General clean up of the eval module, with addition of per-sentence BLEU
    and NIST's 2009 BLEU definition, as published in mteval-v13a.pl.
  - New Zens pruning of phrase tables is quick and simple, and quite effective.
  - Added a number of experimental phrase table smoothers
  - Sentence aligner, ssal, now runs faster, using Moore's diagonal beam
    approach.
  - Added -s option to dmcount to sort its output.
  - Many new options and features in the canoe decoder, including:
     - canoe can now prune its lattices before outputting them, for faster
       handling by the lattice MIRA tuning algorithm. (-lattice-* options)
     - -filter-features option to use some distortion models as hard
       constraints instead of soft ones.
     - New walls and zones features allows imposing specific reordering
       constraints during decoding. Can be treated as soft constraints (via
       -distortion-model) or as hard constraints (via -filter-features).
     - The decoder's phrase table limit can now take into account all features,
       including the LM heuristic ("-ttable-prune-type full" decoder option).
     - Implemented LM context minimization, for faster decoding, following Li &
       Khudanpur (2008).
     - -nosent option to decode sentence fragments.
     - -describe-model-only option describes the model in a canoe.ini file.
     - -hierarchy option to create the output files in a hierarchical way when
       there are too many to hold in one directory.
     - canoe-parallel.sh now supports sentence-level load balancing, helping
       parallel decoding finish faster.
     - new coarse LM models.
     - new RestCostLMs give better LM heuristic during decoding.
  - Many design improvements to the decoder (canoe module), with clean up of
    code in many places, more flexibility for future extensions, etc.
    To mention just a few examples:
     - Save the phrase partial score with the phrase info, so it does not need
       to be recomputed each time the phrase is used.
     - Annotation lists allow phrase pairs to be augmented with arbitrary
       annotations within the decoder, making for more flexible and faster
       decoder features.
     - Singleton pointer to the global vocabulary so it's accessible everywhere
       it's needed without needed to be passed all over the place.
     - Removed obsolete reversing of phrase tables via #REVERSED.
     - Input reader significantly streamlined, making it much easier to use.
     - PhraseInfo and ForwardBackwardPhraseInfo classes merged.
  - configtool:
     - "configtool memmap" now accounts for all models, including BiLMs.
     - many new commands to support sparse features and other changes
  - filter_models:
     - supports for "combined" and "full" phrase table pruning types
     - -plp switch to have filtered (H)LDM names preserve path info

 Minor bug fixes:
  - When writing LM files in ARPA format, explicitly write the 0-backoff
    weights where they are required; binlm2arpalm can also be used to add them
    back to a file where they are missing.
  - Python scripts within PortageII now token-split on spaces and tabs only,
    like the rest of PortageII does; fixes a bug found in truecasing.
  - Bug in markup_canoe_output: in rare circumstances, a seg fault in
    markup_canoe_output made the truecasing module crash and prevent
    PortageLive from producing any output. This has been resolved in 3.0 and is
    available as an optional patch to 2.2.


PortageII 2.2 2014-12-01
 This is a feature release, adding fixed terms handling.

 Changes:
  - PortageLiveAPI:
    - Extended the API with updateFixedTerms to populate a context with fixed
      terms.
    - Extended the API with getFixedTerms to retrieve a context's fixed terms
      list.
    - soap.php was augmented to facilitate testing updateFixedTerms &&
      getFixedTerms.
  - Added fixed_term2tm.pl to create a phrase table from a list of fixed terms.
  - framework:
    - Updated to prepare new systems to handle fixed terms.


PortageII 2.1 2014-10-20
 This is a maintenance release, rolling in all the small patches that have been
 released since PortageII 2.0 was published.

 Minor changes:
  - PortageLiveAPI:
     - Now supports handling markup tags in getTranslation(), the method to
       translate one or a few sentences of plain text at a time in a
       synchronous way.
     - Now supports three interpretations of newline characters, that can be
       selected by the user: paragraph end, sentence end, or just plain
       whitespace.  The output returned follows the same interpretation for
       newline characters.
  - Support submitting PortageLive translation requests with more than one
    thread.

 Bug fixes:
  - Binary distro creation, as well as PortageLive installation scripts, now
    include previously omitted required libraries.
  - Fixed bug in PortageLive web layout installation script.
  - Detokenizer patch for correct handling of smart-quote style apostrophes for
    contractions like 's in English and qu' in French.
  - SDLXLIFF handling patched to avoid generating illegal empty sdl:seg-defs
    entities.


PortageII 2.0 2013-02-15
 This release provides significant user improvements for translators through
 the handling of markup tags and the XML Localization Interchange File Format
 (xliff) often used to package translation projects.

 Major changes:
  - Support for the transfer of tags found in the source sentence by applying
    them to the corresponding text in the target sentence.
     - To enable this functionality, set xtags to true when calling
       translateTMXCE() or translateSDLXLIFFCE() in the SOAP PortageLive API.
     - To support this functionality, store word alignments in phrase tables
       (regular and tightly packed) and output it in the decoder's output
     - As a side effect, the transfer of source case information in truecase.pl
       is improved when the word-alignment is available.
  - Support for the xliff file format via the PortageLive API.

 Minor changes:
  - Support for Danish tokenization and detokenization.
  - Updated software for weight tuning is more stable.
  - New Huffman encoding class with memory-mapped IO implementation
  - Improved the parallelism configuration for sigprune.sh
  - joint2cond_phrase_tables, joint2multi_cpt and train_ibm now overwrite
    existing files when found, instead of aborting with an error message.
  - Support for training and using sentence-level LM mixture adaptation.
  - Minor bug fixes.
  - Clean up of documentation.


PortageII 1.0 2012-09-04
 PortageII marks a significant leap forward in NRC's statistical machine
 translation technology.  With version 1.0, described here, we bring in
 significant improvements to translation engine itself that result in better
 translations, and contributed to our success at the NIST Open Machine
 Translation 2012 Evaluation (OpenMT12), as well as a number of other
 improvements we have made to the system.

 Major changes:
 - Added the key features of our NIST 2012 Chinese system:
   - Tuning decoder weights using Batch Lattice MIRA (new tune.py script), as
     well as other tuning algorithms.  We have done extensive testing of
     various tuning methods, and Lattice MIRA works best, yielding significant
     gains over N-best MERT, our previous default, and reliably beating other
     methods.  (For details, see Cherry and Foster, NAACL 2012.)
   - Tuning rescoring weights using N-Best MIRA (-a mira option to rat.sh).
   - HLDM: Hierarchical lexicalized distortion models, following Galley and
     Manning (EMNLP 2008) and Cherry, Moore and Quick (WMT 2012).  Includes an
     ITG parser used during decoding, which also provides a new distortion
     limit implementation, -dist-limit-itg.
   - Framework support for lattice mira and HLDMs, both enabled by default.
   - Framework support for the use of IBM4 alignments produced by an external
     tool: we obtain our best results when we combine phrase pair counts and
     HLDM counts obtained from IBM4, HMM and IBM2 alignments, rather than
     applying only the single best alignment method.
   - Alignment indicator features tell the decoder which alignment(s) provided
     each phrase pair, allowing the tuning algorithm to weigh the aligners's
     opinions automatically.
   - MixTM: linear mixture of phrase tables for domain adaptation, with proper
     support in the framework, as well as improved framework support for MixLMs.
     In both cases, framework support for combining externally trained generic
     LMs and/or TMs with in-domain data, to get improved results, especially
     for smaller in-domain training corpora.
   - Framework and PortageLive support for Chinese->English pipeline.

 - For French->English and English->French: added generic models that can be
   combined with in-domain corpora to supplement them and improve the
   translation quality.  This can help for all domains, but especially for
   those with smaller amounts of training data: the generic models will help
   the system handle the general constructs of the language, while the
   in-domain material will provide the domain-specific vocabulary and
   contructions.

 Changes of intermediate importance:
 - New -filter-singletons switch to train_ibm reduces memory bubble at the
   beginning of the IBM1 model training procedure for large corpora.
 - Speed up significance pruning by roughly an order of magnitude.
 - Reduce by a factor of two the amount of memory required for the creation of
   Tightly Packed Suffix Arrays (TPSAs), and therefore also for significance
   pruning.
 - Phrase table filtering and loading now takes linear time in the size of the
   phrase table, and no longer has a quadratic term in the number of target
   phrases per source phrase.
 - New decoder features and options:
   - -dist-limit-simple options, which helps at least for Chinese.
   - minimum diversity criterion added to coverage pruning
   - BiLM (following Niehues et al, WMT-2011)
   - carry joint count information in the phrase table (for future use)
   - carry alignment information in the phrase table, and optionally include
     it in the decoder output.
   - unal features count the number of words left unaligned in phrase pairs
     (following Guzman et al, MT Summit 2009).
   - distortion limit based on ITG constraints
   - LeftDistance experimental distortion model
   - -maxlen
   - diversity and -diversity-stack-increment for regular stack decoding
 - Forced decoding is now done by canoe itself, using all available models.
   This functionality replaces the obsolete phrase_tm_align program, which was
   much more limited in functionality.
 - TPPTs can now store fourth column (adirectional) scores and joint counts.
 - More phrase table smoothers, enhanced phrase smoother library, and the
   ability to generate adirectional scores via joint2multi_cpt.
 - Framework support for optionally tuning and testing using multiple tuning
   variants, which are typically 90% sample subsets of the main tuning set.
   This is helpful for assessing quality when using less stable tuning methods
   such as N-best MERT, but the current default, Lattice MIRA, is quite stable.
 - Support for pre-loading models into memory in PortageLive using a new
   priming script (prime.sh).
 - PortageLive support in the framework for the new model features (MixLMs,
   MixTMs, HLDMs, etc.).
 - The global word-alignment model option helps MixTM handle small components
   better.
 - An alternative alignment symmetrization method, grow-diag-final-and, which
   is becoming the standard in SMT, and yields better results, has been
   implemented in PortageII and is now enabled by default.
 - When PortageLive processes a TMX file, as well as when tmx2lfl.pl extracts
   training data from one, handle Trados and MS Word style non-breaking and
   optional hyphens better: replace the non-breaking hyphen by a regular one,
   and remove the optional one.

 Minor changes:
 - New -w switch to joint2multi_cpt.
 - New -sort switch to ibmcat.
 - -w switch to translate.pl
 - filter_models is smarter about not doing or redoing unecessary work.
 - tokenizer: new -pretok option for text that is already tokenized; better
   handling of sequences of periods; and make sure -notok -ss and -pretok -ss
   don't modify tokens, but only split sentences.
 - MagicStream now always opens gzipped files using the gzip library in boost
   instead of a pipe: this has been shown empirically to be faster, and it
   prevents crashes due to the standard implementation of fork() when working
   close to memory limits.
 - Improved stability for run-parallel.sh.
 - Significance pruning calculates the significance level with better precision,
   thanks to boost's high-precision implementation of lgamma().
 - align-words -H now documents word-alignment output formats.
 - Removed some obsolete scripts and source code files.
 - Various optimizations, code clean-up, resolve Klocwork warnings, etc.
 - New python coding conventions and utility library portage_utils.py.
 - New "-s trimmed", "-s max", -t switches to summarize-canoe-results.py, as
   well as support for current framework structure.
 - Framework can calculate OOV rates for test translations.
 - PortageLive now supports specifying the language and country codes for the
   TMX files separately for each context.
 - PortageLive now has proper support for counting words in Chinese text:
   following standard text processing software, we count each Chinese character
   as a word.

 Minor bug fixes:
 - TPSAs, significance pruning and other parts of the tpt/ module now respect
   the convention that tokens in tokenized text are separated by one or more
   space or tab characters.
 - Make sure the future score of a complete decoder hypothesis is 0.
 - rescore_train in PER and WER mode: the stopping criterion was broken and has
   been fixed.
 - In forced decoding (previously phrase_tm_align), under some situations the
   target sentence might not have been fully covered; this has now been fixed.
 - In PortageLive setup, pre-linking of external libraries must be undone when
   making the server copy.


Portage 1.x is the first generation of SMT software released by the NRC.  We
keep its history here to document how our SMT technology evolved over the
years.

Portage 1.4.3 2011-10-03
 This release is primarily intended distribute our improved truecasing module,
 and its integration in the experimental framework.  It also includes
 improvements we've made in various parts of the system since the last release.

 Major changes:
 - Improved truecasing workflow, which takes into account casing information
   from the source sentence.
 - The framework is adjusted to use the new truecasing workflow by default, or
   optionally the old one.
 - Significance pruning is now available (following Johnson et al, EMNLP 2007),
   and integrated in the framework.
 - The framework now supports using a phrase table merged from the IBM2 and HMM
   ones, instead of using them separately.  This is now the recommended
   procedure, and the default.  When you move to the 1.4.3 framework, you will
   notice that PT_TYPES is now set to "merged_cpt" by default, instead of
   "ibm2_cpt hmm3_cpt", which would restore the old behaviour.

 Changes of intermediate importance:
 - The tokenizer and detokenizer (utf-8 version only) now support Spanish.
 - TPPT (Tightly Packed Phrase Table) and TPLM (Tightly Packed Language Model)
   generation is now significantly faster and requires less memory.
 - Added OS X (Darwin) support.

 Minor changes:
 - When PortageLive processes a TMX with confidence estimation enabled, the CE
   score is included in every translation unit as a property of type "Txt::CE".
 - PortageLive now allows overridden programs to exist in each model's /bin and
   /lib directory, thus allowing one to have model specific versions of some
   programs.
 - The boost and zlib libraries are now linked dynamically.
 - Linking with TCMalloc, a fast memory allocator, is now supported.
 - gen_phrase_tables has better memory and performance behaviour on large
   training corpora.
 - TPSA (Tightly Packed Suffix Array) generation now requires less memory.
 - lm_eval now optionally outputs per sentence log-prob or perplexity.
 - gen_phrase_tables/joint2cond_phrase_tables now support -prune1w:
   length-dependent phrase table pruning.
 - parallelize.pl supports a new striped mode with fewer temporary files.
 - A few more test suites were added.
 - New time-mem-tally.pl script makes "make time-mem" faster in the framework.
 - New merge_multi_column_counts program can merge Lexicalized Distortion Model
   count files.
 - New script summarize-canoe-results.py makes it easier to see and compare the
   result of multiple experiments and/or system training runs.
 - Better dependency and sanity checking at installation time.
 - lzma is supported but no longer required.
 - Support multiple training file pairs in gen-jpt-parallel.sh.
 - wc_stats is now about 2.5 times faster.
 - Removed obsolete single prob phrase table format (gen_phrase_tables -tmtext
   and canoe -ttable-file*).
 - canoe -h and -options output improved.
 - canoe now has a -use-ftm switch to enable forward translation models with
   default weights, instead of having to use -ftm with the right number of 1's.

 Minor bug fixes:
 - run-parallel.sh is now more stable (it would sometimes hang if the file
   system was very busy: this is fixed).
 - Fixed problem with TPLMs where, under rare boundary conditions, a parameter
   of the language model might be lost.
 - When PortageLive receives text that collides with the "Magic number" of some
   mime types, the file was not recognized as plain text -- this is now fixed.
 - The tpt module did not allow UNK to appear as a real token, now it does.
 - cow.sh is not longer limited to dev sets have fewer than 10000 sentences.
 - canoe -lattice output is now about 8 times more compact: recombined nodes
   were previously unecessarily expanded, now they are correctly kept as a
   single node in the lattice output.
 - On rare occasions, the presence of ||| in the training corpus would cause an
   invalid phrase table to be generated; this is now fixed.


Portage 1.4.2 2010-10-01
 This is a maintenance release.

 Minor changes:
 - The detokenizer for French no longer glues % to numbers.
 - The new script fix-en-fr-numbers.pl patches numbers copied from English
   to French by reformatting them using French conventions.  Intended for use
   in postprocess_plugin.
 - Support for gcc 4.5.1.
 - The HMM aligner now splits very long sentences in <= 200 word chunks,
   to avoid running an excessively long time on very long sentences, which are
   mostly useless for this model anyway.
 - When inserting Portage translations back into a TMX file, ce_tmx.pl now sets
   TU attribute usagecount to 0 and deletes BPT, EPT, IT and PH elements
   (native formatting codes) unless -keeptags is set.
 - Made n-best list management in cow.sh more efficient.

 Minor bug fixes:
 - Added missing implementation of vocab-filtering of binary TTables.
 - Fixed a rare crash situation in textpt2tppt.sh.
 - Made arpalm2tplm.sh robust to extra white space in the ARPA LM files.
 - Fixed cow.sh to remove unintended 10000 line limit in dev set.


Portage 1.4.1 2010-07-30
 This is a maintenance release, fixing several issues in v1.4.0.

 Minor changes:
 - PortageLive now supports installing multiple contexts on the same server,
   via both the CGI and SOAP interfaces.  The SOAP interface has been enhanced
   to let the user specify which context to use, whether confidence estimation
   is required, and to handle TMX files.  The CGI interface has several
   improvements.
 - In PortageLive, the duplicate copy of the SOAP code for secure servers has
   been removed, and replaced with a mechanism that generates it automatically.
 - When packaging models for a PortageLive context, tmtext-apply-weights.pl is
   no longer used by default, since it is unstable and the gain is not that
   significant.
 - Dependencies on bash extensions are now made explicit, so that Portage can
   run on Ubuntu even when /bin/sh is dash.
 - Portage is now compatible with g++ 4.4.4 and 4.5.0, boost 1.43 and bash 4.
 - New script filter-long-lines.pl can be used to filter excessively long lines
   from a parallel corpus.
 - plog.pl now outputs statistics in a clearer format.
 - Improved installation instructions, in particular regarding dependencies.

 Bug fixes:
 - Fixed crash in gen_phrase_tables and a few other programs when compilation
   with ICU is disabled and the user locale is *.utf-8.
 - Fixed issue where TPTs might be corrupted when the file was larger than 4GB.
 - Fixed several more minor issues.


Portage 1.4.0 2010-05-31
 This update incorporates improvements intended to help the performance of
 Portage as an online translation service, as well as scientific progress we've
 made in the last year.

 PORTAGEshared has been renamed "Portage" with a version number, i.e., this
 package is now known as "Portage 1.4".  References to PORTAGEshared within the
 documentation or the code are considered to refer to Portage 1.4, or to
 PORTAGEshared 1.0 to 1.3 if the context refers to older versions.

 Note: with 1.4.2, the original 1.4 was renamed 1.4.0, as it should have been
 named in the first place.  Now, 1.4 refers to any of the 1.4.x updates.

 Major changes:

 - Tightly Packed Tries (TPT) (see Germann, Joanis and Larkin, SETQA-NLP 2009)
   use memory mapped IO for optimized access to highly compact representations
   of models.  When used together, TPLMs (Tightly Packed Language Models),
   TPPTs (Tightly Packed Phrase Tables) and TPLDMs (Tightly Package Lexicalized
   Distortion Models) reduce the decoder start time to nearly nothing, while
   maintaining good decoder speeds.  Furthermore, when a translation server
   uses this technology, the file caching mechanism of the operating system
   holds in memory what was read by a previous instance of the decoder, so that
   once the server has traslated a few sentences, a significant speed gain can
   be observed as disk access becomes less and less necessary.  This technology
   is ideal for a live translation server, whether it is delivered as a web
   service or otherwise.  Tightly packed models are integrated in the decoder,
   the rescoring module, the confidence estimation module, the truecasing
   module, i.e., everywhere these models can be used.

 - PortageLive.  Portage now comes ready to deploy as a translation server, via
   a web service using SOAP, via a web page, or connecting to the translation
   server via ssh.  Documentation is included on how to do so as a Virtual
   Appliance that can run on any virtual machine architecture (from local
   infrastructure to cloud computing), or on a dedicated translation server.
   The TPT technology mentioned earlier significantly reduces the memory
   requirements for such deployment.  See Paul et al (MT Summit 2009) for
   details (paper accompanying the Technology Showcase, available at
   http://www.mt-archive.info/MTS-2009-TOC.htm).

 - The peak memory required to train a phrase table has been reduced by about
   50%: instead of invoking only gen_phrase_tables, one can parallelize the
   counting process (the first half of what gen_phrase_tables does) with
   gen-jpt-parallel.sh and invoke joint2cond_phrase_tables -reduce-mem to do
   run the estimation process in a small memory footprint.

 - Experimental framework:
   - The framework was enhanced in many ways, reflecting the changes in the
     software, incorporating new modules, and following the evolution of the
     recommended procedures.
   - Resource monitoring: the framework now keeps track of resources used at
     all stages of processing, to identify peek memory usage more readily, and
     to be able to see where the time is spent.  The major scripts in Portage
     track memory usage and CPU time: cow.sh, rat.sh, cat.sh, run-parallel.sh,
     canoe-parallel.sh, etc.  The utility script time-mem is used to do the
     same for other programs and program suites, and can be used outside
     Portage as well.
   - Tutorial: the framework-toy.pdf document has been revised to reflect
     current code, and improved as a general tutorial for Portage.

 - New decoder features:
   - Adirectional scores for phrase pairs are now supported as decoder
     features.  Previously, Portage only supported forward and backward scores,
     intended to model P(t|s) and P(s|t), respectively.  Adirectional features
     allow the use of arbitrary functions f(s,t) associated with each phrase
     pair, without any implied semantics.  The association features of Chen,
     Foster and Kuhn (MT Summit 2009) are an example of such features.
     The adirectional scores are stored as the "4th" column in multi-prob
     phrase tables.  See the user manual for details.
   - Lexicalized distortion models (see Koehn et al, ACL-2007)
   - Levenshtein or N-gram distance from a reference, useful especially for the
     fuzzy mode in phrase_tm_align.  By default, phrase_tm_align only returns
     an alignment if phrase pairs exist to exactly cover the two sentences to
     align.  In fuzzy mode, phrase_tm_align is allowed to consider other
     translations of the source that are close to the target, where "close" is
     defined as having low Levenshtein or N-gram distance.  This distance is
     used as a feature in the log-linear model, and optimized jointly with the
     rest of the model by the decoder.

 - New script tmx2lfl.pl designed to extract parallel corpora in plain text
   from TMX (Translation Memory eXchange) files.

 - New script tmtext-apply-weights.pl pre-applies log-linear weights learned
   using cow.sh to create more compact models for use by a translation server.

 - The phrase extraction process (gen_phrase_tables) has been modified to
   require that a phrase pair have at least one actually linked word pair.
   (Previously, unaligned words were allowed to be considered a phrase pair.)

 - The new program joint2multi_cpt solves the "phrase hole" problem.  When
   multiple phrase tables are used together, phrase pairs that appear in one
   but not all tables are penalized too aggressively by default, yielding what
   we call the "phrase hole" problem.  This problem is especially severe when
   one table is much smaller than the other; often, cow.sh learned to stronly
   discount that table's opinion because it was not appropriately smoothed.
   joint2multi_cpt solves this problem by smoothing phrase tables considered as
   a set, giving reasonable smoothed estimates in each table for all phrase
   pairs, including those appearing only in other tables.

 - MERT (cow.sh): improved stability of the search for optimal parameters.
   See Foster and Kuhn (WMT 2009) for details.

 - Confidence Estimation.  Portage now comes with a module that produces
   confidence estimates accompanying the decoder output.  See Simard and
   Isabelle (MT Summit 2009) for details.

 Minor changes:

 - Word alignment:
   - New "sri" alignment reader (for word_align_tool, eval_word_alignment).
   - New "gale" and "uli" alignnment writers.
   - New word aligners: IBMDiagAligner, HybridPostAligner, ExternalAligner.
   - Bug fix in the HMM alignment model: the end-distribution semantics is a
     bit cleaner, and p0/up0 can now be arbitrarily high, because the effective
     p0 is capped at .999.  (High p0/up0 have been found to be good for the
     phrase extraction process, so this is important).
 - Sentence alignment (ssal):
   - added support for IBM1 models and documented a multi-pass procedure for
     producing improved sentence alignments.
   - new hard boundaries within the input text allow handling a collection of
     distinct documents collected together in a single file pair, as well as
     the use of external sources of information such as section or paragraph
     boundaries.
 - New phrase smoothers in gen_phrase_tables and joint2cond_phrase_tables:
   JointFreqs, alpha-smoothing option to RFSmoother and IndicatorSmoother
 - canoe decoder:
   - The forward TM scores are now included in future score calculation.
   - Now supports the -bind PID option, exiting automatically when the
     master process disappears.
   - Faster 1-best decoding by discarding recombined states on the fly.
   - All boolean options now have a -no variant so that something turned on in
     the canoe.ini can be turned off on the command line.
 - Rescoring / MERT:
   - New features: Levenshtein distance and N-gram distance to a specified
     reference.
   - Various improvements to the module in general.
 - Tokenizer:
   - Now reliably fast, taking linear time regardless of paragraph length (used
     to be quadratic on the length of each paragraph).
   - Supports a -notok switch to perform sentence splitting only.
 - Many improvements to run-parallel.sh.
 - parallelize.sh: new -w switch to determine the number of blocks from a
   minimum number of lines per block instead of a fixed number of blocks, with
   the -n switch capping the number of blocks used.
 - Adaptation is no longer dependent on SRILM programs.
 - New length-dependant phrase table pruning option to filter_models yields
   better results than the fixed ttable-limit decoder parameter.
 - Miscellaneous new programs:
   - binlm2arpalm converts an LM file from our binary format to the standard
     ARPA format.
   - ibmcat displays binary word-alignment model files in plain text.
   - wc_stat is like wc but displays more statistics.
   - al-diff.py compares different sentence alignments for a given text.
 - Save dependency information per source file during compilation instead of
   re-processing each file to generate Makefile.depends.
 - More unit-testing test suites.
 - Some code clean-up using Klocwork Insight, fixing potential future problems.
 - Some code documentation clean-up.
 - Logs now go consistently to STDERR, even for programs without a primary
   output on STDOUT.
 - MagicStream: our library that handles reading and writing compressed files
   on the fly uses gzip by default, but now falls back to using zlib (via the
   boost::iostreams library) when gzip fails to start due to memory limits.
 - New script canoe-timing-stats.pl helps track time spent loading models
   versus time spent actually doing translation; cow-timing.pl summarizes the
   time spent in the various parts of cow.sh.
 - New module textutils/ groups together utility scripts and programs for basic
   text manipulation, making them easier to find.
 - Truecaser: fixed issues for handling different encodings.
 - Quench the Copyright notices that were printed much too often.
 - The obsolete src/api/ directory was deleted; PortageLive supercedes it.


PORTAGEshared 1.3 2009-01-21
 This update to PORTAGEshared is primarily intended to incorporate the new HMM
 word alignment module, and related functionality.  We have also taken the
 opportunity to migrate many improvements from Portage, add a new experimental
 framework, and improve the documentation.

 Major changes:

 - HMM word alignment models, including a number of variants.  We have
   implemented the base model described in Och and Ney (CL, 2003), a class
   based variant also based on Och's work (though our implementation is based
   on the baseline system description in He, ACL/WMT-2007), as well as the
   variants described by Liang, Taskar and Klein (HLT-2006), including their
   symmetrization method, and He's (WMT-2007) lexicalized MAP (Bayesian) model.
   gen_phrase_tables also has a new PosteriorAligner based on Liang et al
   (HLT-2006).

   The HMM word alignment models are trained using train_ibm, they can be used
   for word alignment directly via align-words, and gen_phrase_tables can use
   them to generate phrase tables.  In our current state of the art, we
   typically perform word alignment using IBM2 and HMM models separately, then
   use the resulting phrase tables with cow.sh for maximum BLEU training,
   either separately or merged together into one table.

 - Added a generic HMM toolkit, used by the HMM word alignment models,
   supporting state or arc emitting HMMs, and implementing the Viterbi and
   Baum-Welch algorithms.  (Optimized for densely connected HMMs.)

 - Parallelized training of IBM1/2/HMM word alignment models via the new cat.sh
   script, and a binary format for TTables and all intermediate count files,
   for fast reading and writing of these model and count files.

 - An experimental framework is now included with PORTAGEshared, as a potential
   starting point for your experiments.  Besides demonstrating how to use
   PORTAGEshared, this framework embeds choices of options which we think are a
   reasonable starting point.

   Previously, the only full usage examples we provided were not suitable for
   this use.  The toy example was designed to run fast at all cost, regardless
   of the quality fo the output, while the small-vocabulary regression test
   suite was mostly intended to exercise the code.  The new framework is
   specifically designed to be both a tutorial and a reasonable starting point.
   Of course, you will still need to experiment in order to optimize
   performance for your setting.

   Even if you used PORTAGEshared before, we recommend you read
   framework-toy.pdf in the framework directory, as it includes a full
   description of how to use PORTAGEshared, including important features which
   are not all highlighted elsewhere.  If you have built your own experimental
   framework, you may find useful suggestions when following the toy example
   described in this document.

 - We've now included our truecasing module, which no longer requires external
   software at truecasing time.  The Perl script truecase.pl performs
   truecasing using canoe.  The program compile_truecase_map compiles the
   truecase map for truecase.pl.  Training the Language Model itself requires
   an external language modelling toolkit, e.g., SRILM (if your licensing
   requirements permit it) or IRSTLM.

 - Many improvements to run-parallel.sh, useful if you're working on a cluster:
   - uses Perl sockets instead of netcat, resulting in a significant reduction
     of overhead, from seconds down to hundredths of seconds per job, and one
     fewer dependency on external software;
   - now more stable, with more coherent behaviour in case of errors, which can
     be controled by the user via the new -on-error switch;
   - more thorough clean up at exit time or in case of errors;
   - all temporary files are hidden away in a workdir instead of polluting the
     directory run-parallel.sh is invoked from (most scripts using temporary
     files now do the same too);
   - new -c switch to run a single command via psub/qsub, acting as a blocking
     qsub for clusters that don't support blocking qsub - this is useful to
     have a Makefile run commands on a cluster via psub/qsub;
   - number of CPUs requested by the master job is propagated to the workers;
   - on clusters running Torque, take advantage of the job array feature to
     speed up and reduce the overhead of worker submission via psub/qsub.

 Minor changes:
 - User configuration is now centralized in src/Makefile.user-conf.
 - We've added some unit testing and a unit testing framework, using CxxTest,
   run automatically by doing "make test" in src/.
 - We've moved our legacy test programs into subdirectories of the source code,
   run automatically by doing "make test" in src/.
 - Make the code compile with g++ 4.3 without any warnings.
 - Streamlined c++ includes, in part to speed up compilation.
 - Use tr1::unordered_map instead of the soon to be deprecated
   __gnu_ext::hash_map.
 - Various code refactorings to make maintenance and documentation easier,
   including improvements to the compilation mechanism, removal of doxygen
   errors, and many more small details.
 - New section in the documentation with the usage info from all programs.
 - A few programs now have a -final-cleanup switch: for efficiency, we often
   don't delete models just before exiting, since the OS does so immediately
   after exiting.  In programs that support it, the -final-cleanup switch
   delete all models; useful for memory leak detection and other debugging.
 - Removed obsolete champollion.breakparts.pl, unsplit-sentence.pl,
   merge_ttables, maxphrase.pl, CalculateHypothesisProb.pl, and
   find_sentence_phrases.
 - Utils modules (src/utils):
   - New utf8 casemapping functionality (requires ICU)
   - Program utf8_filter performs strict validation of utf8 input.
   - diff-round.pl now supports compressed files automatically, and has new
     -sort, -q and -min options.
   - New template class BiVector models a vector with positive and negatives
     indices.
   - Fixed memory leak in short array allocator ArrayMemPool.
   - Added support for .lzma files in all C++ programs, via MagicStreams.
 - Preprocessing module (src/preprocessing):
   - Significantly improved the tokenizing and detokenizing for French;
     slightly improved for English as well.  Better lists of abbreviations in
     both languages.  Smart quotes and other characters from the cp-1252
     repertoire are now recognized, and optionally replaced by the closest
     iso-8859-1 characters.
   - New udetokenize.pl script performs detokenization on French and English
     text encoded in utf8.
 - Language Modelling module (src/lm):
   - New caching mechanism for LM queries, intended for use with expensive LM
     classes.  Currently only enabled for LMMix (Dynamic LM mixture model).
   - Minor improvements to ngram-count-big.sh: reports errors more carefully and
     removes the merge tree since we noticed a single multi-way merge is faster.
   - Refactored the LM classes to make adding new ones easier; added a new
     Factory Method creator object for each class.
   - New LM class: LMDynMap.  Used for dynamic mapping of case or numbers.
   - Tally statistics over LM queries, used by canoe in particular.
   - Renamed lmtext2binlm to the more precise name arpalm2binlm.
   - New script lm_sort_filter.sh sorts an ARPA-format LM in a way that
     typically increases its compression ratio with gzip.
   - New script lm-order.pl determines the order of an ARPA-format LM file.
 - Translation Modelling and Word Alignment module (src/tm):
   - New -prune1 option to gen_phrase_tables and joint2cond_phrase_tables
     prunes long tails before calculating probabilities.  Especially useful for
     phrase tables trained on noisy corpora.
   - New program word_align_tool allows manipulation and conversion of word
     alignment files.
   - Support for more word alignment formats via a generic module,
     word_align_io, which is easily extensible to support further formats.
   - Use smoothing for OOVs consistently in all alignment model queries.
   - New file handling-unaligned-words.txt explains how unaligned words are
     handled in the phrase extraction process.
   - New program eval_word_alignment calculates F-measure with respect to a
     reference alignment, as suggested by Fraser and Marcu (CL, Sept 2008).
   - New merge_counts program does fast merging of counts; used by
     gen-jpt-parallel.sh to remove large memory requirement at the end.
 - Eval module (src/eval):
   - Refactored PER, WER and BLEU calculations to support PER and WER
     optimisation in rescore_train.
   - bleucompare can now perform PER or WER calculations instead.
   - Support for NIST style BLEU computation by setting the environment
     variable PORTAGE_NIST_STYLE_BLEU (has different brevity penalty
     definition).
 - Decoding module (src/canoe):
   - canoe-parallel.sh now supports the use of load balancing in -append mode.
   - New preprocessing script canoe-escapes.pl adds or removes escapes expected
     by canoe as needed.
   - Bug fix in -soft-limit mode for filter_models keeps phrases pairs that
     were incorrectly deleted before.  This sometimes results in higher memory
     requirements in canoe, which can be addressed via gen_phrase_tables's new
     -prune1 switch.
   - New filter_models options: -ttable-limit overrides the limit in the
     canoe.ini file, -no-per-sent disables per-sentence LM filtering, using
     the less effective global-vocabulary LM filtering instead.
   - New rule decoder feature allows one to set weights for canoe markups via
     -rule-weigths, and tune them in cow.sh, instead of using hard-coded
     weights.  Supports multiple classes of rules with their separate weights.
   - Robustness fix: canoe now accepts non-finite numbers in phrase tables,
     issuing a warning and treating them as if they had been 0.
 - Rescoring module (src/rescoring):
   - New rescoring features: RatioFF, HMMTgtGivenSrc, HMMSrcGivenTgt,
     HMMVitTgtGivenSrc, HMMVitSrcGivenTgt, WerPostedit, PerPostedit,
     BleuPostedit, BackwardLM.
   - Generic rescoring feature SCRIPT invokes any script of your choice, for
     easy creation and prototyping of new features, as well as integration
     of features not part of PORTAGEshared.
   - New program uniq_nbest removes duplicates in an n-best list.
   - cow.sh and rescore_train now optionally optimize WER or PER instead of
     BLEU.
   - New micro tuning mode for cow.sh looks for per-sentence optimal weigths
     for a few iterations before looking for globally optimal weights.
   - Other new cow.sh options: -rescore-options, -no-lb, -s.
   - rescore_train -l saves a log of Powell runs for tracing the optimisation
     process; -rf randomizes the order in which features are considered by
     Powell's algorithm.
   - New script cowpie.py extracts useful statistics from a cow.sh log.
   - rescore_translate now supports Minimum Bayes Risk rescoring.
   - Rat.sh now uses hard phrase table filtering before translation, thus
     reducing memory requirements in canoe (can be disabled with -no-filt).
   - New rat.sh options: -rescore-opts, -per, -wer, -dep, -no-filt.
   - gen-features-parallel.pl converted to Perl (from bash) and made more
     stable.


PORTAGEshared 1.2 2008-01-28
 This is a significant update to PORTAGEshared, incorporating most of the
 changes we have made to Portage since the initial release of PORTAGEshared.

 Major changes:

 - Soft TM filtering: joint filtering of several phrase tables in such a way
   that, no matter what weights are used, the top L hypotheses will have been
   kept, i.e., discards entries that can never make it to the top L, under any
   set of non-negative weights.  Also described in Badr et al (CORES-2007).
   - Used by cow.sh when the -filt option is specified (recommended).

 - LM filtering based on per-sentence-vocabulary, as described in Badr et al
   (CORES-2007) (see the annotated bibliography in the user manual for all
   paper references).  In short, keeps an n-gram only if all the words it
   contains can occur together in the translation of at least one source
   sentence.  Typical LM filtering uses one global vocabulary; this technique
   efficiently keeps track of a separate vocabulary for each input sentence to
   translate.  In decoding, this approach can save some 25% of the memory
   required for large LMs, or as much as 50% when combined with soft TM
   filtering.  In lm_eval and the LM rescoring function, significantly higher
   savings are possible.  Works with both the text (ARPA) and our BinLM file
   formats.
   - Automatically used by canoe while loading language models;
   - automatically used by the rescoring module for the NgramFF feature;
   - used by lm_eval when the -per-sent-limit option is specified.

 - Implemented Huang and Chiang's (ACL-2007) cube pruning algorithm.  Can yield an
   order-of-magnitude speed up in decoding in most circumstances.  Requires
   careful re-tuning of some decoding parameters, however, especially S (stack
   size) since its meaning is not the same as with regular decoding.  Run canoe
   -h for details on enabling cube pruning.

 - New module implementing George Foster's LM and TM adaptation work, with
   integration of the resulting mixture models in the decoder - details in
   Foster and Kuhn (WMT-2007).

 - Optimization of various programs throughout the Portage suite, including
   canoe.

 - Significantly optimized string splitting routines (src/utils/str_utils.h)
   and consequently the loading of many types of input and data files.
   In particular, the new Voc::addConverter functor directly converts a
   sentence or a phrase from a string to a vector<Uint> with no intermediate
   storage, yield a noticeable speed up in several programs.

 - Implemented new language model heuristics for decoding, including
   "incremental", the default in several other MT systems, and now also the
   default in PORTAGEshared.

 - Monotonic decoding with phrase swaps can now be done using canoe's
   "-distortion-limit 0 -dist-phrase-swap" combination of options, optionally
   using the new PhraseDisplacement distortion model instead of, or in
   combination with, the standard distortion penalty (WordDisplacement).

 - New IBM1Forward decoder feature.

 - The main rescoring script, rat.sh, was overhauled to be easier to use.  The
   model is now specified with the same syntax as for rescore_train (documented
   in rescore_train -H): rat.sh transparently handles generating the features,
   managing temporary files (now all tucked away in a working sub-directory)
   and giving rescore_train an appropriately transformed model file.  See
   rat.sh -h for details and test-suite/regress-small-voc/28_rat_train.pl for
   an example rescoring model in this simplified syntax.

 - New rescoring features (run rescore_train -H for the full list):
   - IBM1DocTgtGivenSrc calculates p(tgt-sent|src-doc), using a file of docids
     to determine what parts of the source file constitute documents (the
     docids file should have one line for each line in the source text,
     containing an ID in any format (no whitespace allowed); lines with
     identical IDs are considered to come from the same document).
   - nbest*Post* - posterior probability features for confidence estimation
     rescoring - see Ueffing and Ney (HLT-EMNLP 2005), Zens and Ney (WMT-2006).
     These papers not included in our annotated bibliography provide even more
     background and depth:
      - Blatz et al. (2003).  Confidence Estimation for Machine Translation.
        JHU/CLSP Summer Workshop.
      - Ueffing (2006).  Word Confidence Measures for Machine Translation.
        Ph.D. thesis.
   - Consensus and ConsensusWin - WER-based consensus over N-best list (very
     expensive features - not recommended for general use) - features based on
     Mangu et al (1999).
   - BLEUrisk - Minimum Bayes Risk using BLEU loss function - see Kumar and
     Byrne (HLT-NAACL 2004).
   - ParMismatch and QuotMismatch - count mismatched parentheses and quotes.
   - CacheLM - cache LM over docs defined in docid files (see above) - see
     Kuhn and De Mori (1990-2).

 - Overhauled the regress-small-voc test suite:
   - exercices more aspects of the code;
   - includes two top-level scripts, one to run a minimal end-to-end suite, and
     a second one that also runs various extensions;
   - renumbered scripts so that they can be run in numerical sequence, as was
     originally the intention.

 Major bug fixes:
 - When canoe was used with -weight-f (weights on the forward translation
   models), the forward score did not get added to the total score, so that the
   forward score was in fact ignored during search.  Any experiments using
   -weight-f should be redone, as the result might be drastically different
   now.  Our own such experiments now have better results.

 Minor changes:
 - The new script process-memory-usage.pl monitors resource usage by a job,
   especially in a cluster setting.
 - Made the code compile mostly cleanly with g++ 4.2.0.
 - Formatting changes in many files to improve consistency: prefer 3-character
   intentation, avoid the tab character, streamline documentation, etc.
 - New utf8-adaptation of our English and French tokenizer script.
 - canoe:
   - added markup syntax documentation to the -help message;
   - new -check-input-only option to validate an input file, not decode;
   - improved error messages when invalid markup is encountered;
   - improved error recovery in -tolerate-markup-errors mode;
   - modified the markup language rules to "do what the user probably meant" by
     default when possible: an unescaped <, > or \ is treated literally (with a
     warning) if it can't be interpreted as markup.
   - option -segmentation-args is replaced by -segmentation-model model#args;
   - compress nbest and ffvals files on the fly;
   - the canoe.ini file format now allows lists to be separated by whitespace
     or ':', and allows comments introduced by '#';
   - new -ttable-log-zero option offers crude control of phrase table smoothing;
   - new -options switch prints a compact list of all canoe options;
   - new decoder feature based on forward IBM1 probabilities;
   - -dist-limit-ext and -dist-phrase-swap modify the semantics of the
     distortion limit;
   - new -input option allows input from any file, not just stdin;
   - new -append mode concatenates nbest, ffvals and pal files on the fly;
   - when using -random-weights (for cow.sh -mad), a specific distribution can
     be specified for each feature.
 - phrase_tm_align overhaul:
   - now reads a canoe.ini file and combines multiple models as canoe does,
     instead of accepting only one phrase table;
   - prints ffvals for all models in the canoe.ini file, even those that have
     no impact on alignment (e.g., lm).
 - canoe-parallel.sh:
   - new -lb option performs load balancing;
   - new -resume option attempts to recover a job that died previously, keeping
     the parts of the job that had been successful;
   - support for all the new canoe options.
 - filter_models and find_sentence_phrases:
   - use the same InputParser class as canoe for consistent handling of markup.
 - removed the broken markup handling from train_ibm, align-words and run_ibm:
   they now only accept tokenized, line-aligned plain text, as one would expect.
 - new program join_phrasetables generates a multi-prob (i.e., multi-column)
   phrase table from individual tables in the older single-column format.
 - normalized file names in src/rescoring: C++ code files are .cc, not .cpp,
   C++ template code files are -cc.h, not .cpp.
 - renamed ibm1aaron.{h,cc} ibm1wtrans.{h,cc} and the classes it contains to
   reflect the name of the feature implemented.
 - IBM 1 and 2 models: changed the default smoothing value from 1e-50 to the
   more appropriate value 1e-07.
 - run_ibm now optionally shows the log prob of each individual sentence pair
   and/or the perplexity for each input file pair.
 - Friendlier error message when trying to open a .gz file that doesn't exit.
 - get_voc: new -s option sorts results by reverse counts.
 - monitor-process.sh: now supports a polling frequency in 10ths of a second.
 - run-parallel.sh:
   - removed the dependency on faucet, using sockets directly instead;
   - using sockets reduced the per job overhead from about one second to about
     10ms, making it feasible to distribute thousands of tiny jobs accross
     several hosts;
   - dynamic options let the user add or remove workers, and kill the whole job
     easily.
 - psub:
   - support for any number of CPUs per node, specification of job priority,
     specification of node properties and monitoring of memory usage;
   - cleaned up to remove options that were clearly NRC specific.
 - PTrie class (utils/trie.h):
   - read_binary() now supports a mapper that remaps the keys while loading;
   - PTrie::iterator now allows insertion and deletion during traversal;
   - now supports an internal node value on the root;
 - VocMap (utils/voc_map.h):
   - new add() operation to insert a new word in both local and global vocab;
   - new operator() so it can be used as a functor in a fast split() call;
   - new local index<->word conversion methods.
 - bestbleu: new -bi option to display the best translation index.
 - bleumain, bestbleu, and all modules using bleu:
   - smoothing value 0 means no smoothing,
   - new options -y and -u to control the size of the ngrams considered,
   - many code optimizations.
 - BinLMs now run about 10% faster
 - lm/lm_eval replaces lm/testlm and can now calculate document perplexity.
 - The LM classes now properly support open-vocabulary language models.  As an
   optimization, open-voc LMs are assumed to only contain a probability
   P(<UNK>), and no other events involving <UNK>.  The LM classes also support
   LMs assigning probabilities and back-off weights to other events involving
   <UNK>, at the cost of some speed, so no program in PORTAGEshared enables
   this feature.  Changes will be required to the code to use general
   open-vocabulary LMs.
 - ngram-count-big.sh, our alternative to SRILM's make-batch-counts and
   merge-batch-counts, can now take advantage of parallel computing resources.
 - tm/ibm.{h,cc}: cleaned up the inheritance structure between classes IBM1 and
   IBM2 and filled in some missing documentation.
 - tm/joint2cond_phrase_tables:
   - new -j switch for merging joint frequency phrase tables;
   - by default, assume lexical model is IBM2, not the rarely used IBM1.
 - added gen-jpt-parallel.sh: parallel joint frequency phrase table training;
   the output can the be used by joint2cond_phrase_tables to estimate probs, as
   gen_phrase_tables would have done in a single non-parallel step.
 - gen_phrase_tables and joint2cond_phrase_tables:
   - new phrase smoother IBM1Smoother explicitely does IBM1 lexical smoothing,
     while IBMSmoother does IBM2 lexical smoothing if the model provided is
     IBM2, or IBM1 smoothing otherwise;
   - new Indicator phrase smoother creates a constant column.
 - canoe, api/paddle_server and api/portage_api.{h,cc}: model files in a
   canoe.ini are now assumed to be relative to the path of that canoe.ini file.
 - canoe/decoder.cc and canoe/basicmodel.cc: the CanoeConfig object is passed
   deeper instead of long parameter lists.
 - cow.sh:
   - one of -nofloor (new option) or -floor is required since there does not
     exist an appropriate default for this option;
   - temporary files are now compressed on the fly, unless -Z is specified;
   - -filt now uses the Soft TM filtering describe in Major changes above; the
     old grep-like filtering mode is preserved (via -filt-no-ttable-limit), but
     its use is not recommended;
   - other new options: -e, -lb.
 - When modifying and testing a single program, say src/<subdir>/<progname>,
   one can quickly rebuild it and its dependencies (and nothing else) by
   running this command in src/:
      gmake OT=<progname> <subdir>_progs
 - rat.sh:
   - by default, the output is now MODEL.out, rather than overwriting the input;
 - gen-features-parallel.sh now supports horizontal parallelism: besides
   running features in different processes (on different machines, if using a
   cluster), expensive features are split in several small blocks for better
   load balancing among available machines.
 - rescore_train:
   - can now specify the distribution to use for randomly generating starting
     points in the search for an optimum (run rescore_train -h for details);
   - significantly optimized, runs some 40% faster than before;
   - when compiled with OpenMP (edit build/Makefile.incl to enable this), runs
     in multi-thread mode, gaining another near two-fold speed up;
   - new -e switch implements a stopping criterion based on the statistical
     expectation of finding a better local optimum, given the distribution of
     previous ones found;
   - other new switches: -win, -r, -s, -y, -u, -sm.
 - rescore_translate:
   - new options: -s, -c, -kout (run rescore_translate -h for details);
 - rescoring in general:
   - new -y and -u switches control n-gram length for BLEU calculations;
   - the nebulous VFileFF feature has been removed; a much cleaner solution to
     the problem it solved is now implemented in rat.sh itself;
   - sets of n-best lists no longer need to have the same number of candidate
     translations for each source sentence, a feature that makes cow.sh more
     efficient;
   - features can now depend on the target language vocabulary (e.g., NgramFF
     uses this for filtering, as mentioned in Major Changes above).

 Minor bug fixes:
 - rescoring/featurefunction.cc: on some systems, what appears like a compiler
   bug generates broken code with std::for_each, so we stopped using it.
 - IBM features in rescoring: did not correctly allow NULL alignments.
 - block_mem_pool.h and index_block_mem_pool.h: in some cases, the destructor
   was called twice on the same object.
 - voc.cc: fixed mismatched malloc/free new/delete use.
 - rescoring feature IBM1Deletion was available but not hooked in properly.


PORTAGEshared 1.1 2007-04-01
 - Added the BinLM format for fast loading of LMs, supported by all
   programs loading LMs, with utility lmtext2binlm to generate them.
 - Made ibm model 2 the default in gen_phrase_tables.
 - Added the VFileFF rescoring "function" to rescore_* / rat.sh /
   gen_feature_values; now used in regress-small-voc/28_rat_train.pl.
 - Added -o option to rat.sh, which no longer generates .w files.


PORTAGEshared 1.0 2006-10-31
   Initial release.