-
Notifications
You must be signed in to change notification settings - Fork 0
/
RELEASES
1174 lines (1058 loc) · 63.6 KB
/
RELEASES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Portage-SMT-TAS
Traitement multilingue de textes / Multilingual Text Processing
Centre de recherche en technologies numériques / Digital Technologies Research Centre
Conseil national de recherches Canada / National Research Council Canada
Copyright 2004-2021, Sa Majesté la Reine du Chef du Canada
Copyright 2004-2021, Her Majesty in Right of Canada
MIT License - see LICENSE
See NOTICE for the Copyright notices of 3rd party libraries.
Release History
(with a summary of important changes)
Portage-SMT-TAS is the name we use for the open-source release of the Portage
Statistical Machine Translation code base, although it does not constitute a
"Release" as understood in this file. The latest release is still PortageII 4.0.
PortageII is the second generation of SMT software released by the NRC.
log4j patch 2022-01-??
We have analyzed the Portage/PortageII code base to assess its vulnerability
to the log4shell CVE reported in December 2021. We estimate that the risk of
exploiting the log4j vulnerabilities on a machine with Portage/PortageII
is minimal, because the code using log4j is not exposed on a PortageLive
run-time server. It is only used during the training of Portage models, and
only to print statistics produced by the model-training software itself, and
never user data.
Furthermore, Portage uses the older version of log4j, 1.2.14, which also has
known vulnerabilities, but much less severe than the log4shell CVE found in
log4j 2.* (before 2.17).
We therefore do not consider it essential to patch any deployed Portage
systems.
However, as of January 2022, the current main branch is patched for log4j
vulnerabilities.
The relevant changes are:
- Java >= 1.8 is now required, since log4j 2.17+ is not available for earlier
versions.
- structpred.jar has been updated to use log4j 2.17.1 instead of 1.2.14
(The pre-compiled code for decoder weight optimization is found in
structpred.jar, located in src/rescoring/ and/or installed in bin/.)
Patching instructions:
- if you have PortageII-3.0 or PortageII-4.0 installed, replace the file
structpred.jar, normally installed in $PORTAGE/bin/, by the version in the
current main branch under src/rescoring/.
- if the machine is using java 1.6 or java 1.7, update java to 1.8. On CentOS
7, this can be done by running:
sudo yum install java-1.8.0-openjdk
or
sudo yum install java-1.8.0-openjdk-headless
PortageII 4.0 2018-07-19
This release adds two significant new modules to PortageII:
- Training of Neural Network Joint Models (NNJMs) on your own data
- Incremental document adaptation
Major changes:
- PortageII 3.0 introduced NNJMs, but only supported models that were trained
at the NRC. PortageII 4.0 now provides the training software so you can
train these models on your own data and on your own GPU-enabled PortageII
training server.
Warning: this will not work with the Python you installed for previous
versions of PortageII; requires Python through Miniconda2, as documented in
INSTALL and the TheanoInstallation page of the user manual.
- Added the incremental document adaptation module, and augmented the API to
support it. With this module, when a translator post-edits a document, they
can push the changes back to the PortageLive server to create a small
document-specific model. Subsequent translation request for the same
document will benefit from the post-edited sentence pairs that were
previously pushed.
Note: this is not incremental retraining of the global model, only
incremental updates of small, document-specific models.
Warning: requires php version 5.4 or more recent.
See doc/user-manual/IncrementalDocumentAdaptation.html for details.
- Release of the Portage Generic Model 2.1: ready to deploy, fully trained
systems for French and English in both direction, trained on 43.7 million
sentence pairs. The LM, TM and NNJM from these systems are available to use
as updated pretrained models to combine with your own data to create
en<->fr systems.
Minor changes:
- New demo page is much nicer, in JavaScript with communication to the server
via the SOAP API.
- A new REST API is available, which you can use instead of the SOAP API if
you prefer, especially for incremental documentation adaptation. However,
it is not feature complete: it supports translating sentences and
paragraphs, but not whole documents at once.
- The user manual is now generated using asciidoctor.
Major bug fixes:
- Identified and fixed software vulnerabilities in the PortageLive web pages.
Minor bug fixes:
- Release 3.0.1 did not make the SOAP API as backwards compatible as we
intended: the 3.0.1 API worked with 2.2 and 3.0, but not 2.1 or earlier.
PortageII 3.0.1 2016-10-07
This maintenance release makes the SOAP API backwards compatible, so that
clients written against the SOAP API from Portage 1.4.3 or PortageII 1.0,
2.0, 2.1, 2.2 or 3.0 will work with the PortageII 3.0.1 API.
PortageII 3.0 2016-07-26
This release incorporates significant improvements from the research world
into the core machine translation engine. Although end users will not see
visible feature changes, they should appreciate the significant improvements
to the quality of the translations produced by PortageII 3.0.
Significant changes were made to the training procedure, to the plugins and to
the SOAP API, however. Maintainers and developers should carefully review
doc/PortageAPIComparison.pdf, which shows charts comparing the SOAP API, the
plugin architecture, and the training parameters and recommendations betweeen
versions 1.0, 2.0, 2.1, 2.2 and 3.0.
Note: to take advantage of all these improvements, we recommend that you
retrain all models trained with previous versions of PortageII.
Please consult README for more details about upgrading to 3.0.
Major changes:
- Since the previous version of PortageII, we have conducted extensive
experiments to update our recommended use of all the options available.
Framework defaults have been updated to reflect our new recommendations.
- NNJM decoder feature: add support for NRC's implementation of the Neural
Network Joint Model, the ground-breaking deep learning approach of Devlin
et al (ACL 2014).
- New sparse features, including the discriminative reordering model, a
significant improvement over previous reordering models.
- New coarse BiLM features take into account source word classes in the
context and give good empirical results, improving translation quality.
Major bug fix:
- A vulnerability was found in plive.cgi that could allow a carefully crafted
URL to execute arbitrary code on the PortageLive server as user Apache.
This has been fixed for PortageII 3.0, and released as a security patch for
2.0, 2.1 and 2.2.
Minor changes:
- Use of generic models is optimized on PortageLive servers, via new
plive-optimized-pretrained.sh script.
- Arabic is now supported as a source language.
- Added support for SRI format for word alignment files (for fastalign)
- align-words: added GDF and GDFA as self-documenting aliases for
IBMOchAligner 3 and 4.
- General clean up of the eval module, with addition of per-sentence BLEU
and NIST's 2009 BLEU definition, as published in mteval-v13a.pl.
- New Zens pruning of phrase tables is quick and simple, and quite effective.
- Added a number of experimental phrase table smoothers
- Sentence aligner, ssal, now runs faster, using Moore's diagonal beam
approach.
- Added -s option to dmcount to sort its output.
- Many new options and features in the canoe decoder, including:
- canoe can now prune its lattices before outputting them, for faster
handling by the lattice MIRA tuning algorithm. (-lattice-* options)
- -filter-features option to use some distortion models as hard
constraints instead of soft ones.
- New walls and zones features allows imposing specific reordering
constraints during decoding. Can be treated as soft constraints (via
-distortion-model) or as hard constraints (via -filter-features).
- The decoder's phrase table limit can now take into account all features,
including the LM heuristic ("-ttable-prune-type full" decoder option).
- Implemented LM context minimization, for faster decoding, following Li &
Khudanpur (2008).
- -nosent option to decode sentence fragments.
- -describe-model-only option describes the model in a canoe.ini file.
- -hierarchy option to create the output files in a hierarchical way when
there are too many to hold in one directory.
- canoe-parallel.sh now supports sentence-level load balancing, helping
parallel decoding finish faster.
- new coarse LM models.
- new RestCostLMs give better LM heuristic during decoding.
- Many design improvements to the decoder (canoe module), with clean up of
code in many places, more flexibility for future extensions, etc.
To mention just a few examples:
- Save the phrase partial score with the phrase info, so it does not need
to be recomputed each time the phrase is used.
- Annotation lists allow phrase pairs to be augmented with arbitrary
annotations within the decoder, making for more flexible and faster
decoder features.
- Singleton pointer to the global vocabulary so it's accessible everywhere
it's needed without needed to be passed all over the place.
- Removed obsolete reversing of phrase tables via #REVERSED.
- Input reader significantly streamlined, making it much easier to use.
- PhraseInfo and ForwardBackwardPhraseInfo classes merged.
- configtool:
- "configtool memmap" now accounts for all models, including BiLMs.
- many new commands to support sparse features and other changes
- filter_models:
- supports for "combined" and "full" phrase table pruning types
- -plp switch to have filtered (H)LDM names preserve path info
Minor bug fixes:
- When writing LM files in ARPA format, explicitly write the 0-backoff
weights where they are required; binlm2arpalm can also be used to add them
back to a file where they are missing.
- Python scripts within PortageII now token-split on spaces and tabs only,
like the rest of PortageII does; fixes a bug found in truecasing.
- Bug in markup_canoe_output: in rare circumstances, a seg fault in
markup_canoe_output made the truecasing module crash and prevent
PortageLive from producing any output. This has been resolved in 3.0 and is
available as an optional patch to 2.2.
PortageII 2.2 2014-12-01
This is a feature release, adding fixed terms handling.
Changes:
- PortageLiveAPI:
- Extended the API with updateFixedTerms to populate a context with fixed
terms.
- Extended the API with getFixedTerms to retrieve a context's fixed terms
list.
- soap.php was augmented to facilitate testing updateFixedTerms &&
getFixedTerms.
- Added fixed_term2tm.pl to create a phrase table from a list of fixed terms.
- framework:
- Updated to prepare new systems to handle fixed terms.
PortageII 2.1 2014-10-20
This is a maintenance release, rolling in all the small patches that have been
released since PortageII 2.0 was published.
Minor changes:
- PortageLiveAPI:
- Now supports handling markup tags in getTranslation(), the method to
translate one or a few sentences of plain text at a time in a
synchronous way.
- Now supports three interpretations of newline characters, that can be
selected by the user: paragraph end, sentence end, or just plain
whitespace. The output returned follows the same interpretation for
newline characters.
- Support submitting PortageLive translation requests with more than one
thread.
Bug fixes:
- Binary distro creation, as well as PortageLive installation scripts, now
include previously omitted required libraries.
- Fixed bug in PortageLive web layout installation script.
- Detokenizer patch for correct handling of smart-quote style apostrophes for
contractions like 's in English and qu' in French.
- SDLXLIFF handling patched to avoid generating illegal empty sdl:seg-defs
entities.
PortageII 2.0 2013-02-15
This release provides significant user improvements for translators through
the handling of markup tags and the XML Localization Interchange File Format
(xliff) often used to package translation projects.
Major changes:
- Support for the transfer of tags found in the source sentence by applying
them to the corresponding text in the target sentence.
- To enable this functionality, set xtags to true when calling
translateTMXCE() or translateSDLXLIFFCE() in the SOAP PortageLive API.
- To support this functionality, store word alignments in phrase tables
(regular and tightly packed) and output it in the decoder's output
- As a side effect, the transfer of source case information in truecase.pl
is improved when the word-alignment is available.
- Support for the xliff file format via the PortageLive API.
Minor changes:
- Support for Danish tokenization and detokenization.
- Updated software for weight tuning is more stable.
- New Huffman encoding class with memory-mapped IO implementation
- Improved the parallelism configuration for sigprune.sh
- joint2cond_phrase_tables, joint2multi_cpt and train_ibm now overwrite
existing files when found, instead of aborting with an error message.
- Support for training and using sentence-level LM mixture adaptation.
- Minor bug fixes.
- Clean up of documentation.
PortageII 1.0 2012-09-04
PortageII marks a significant leap forward in NRC's statistical machine
translation technology. With version 1.0, described here, we bring in
significant improvements to translation engine itself that result in better
translations, and contributed to our success at the NIST Open Machine
Translation 2012 Evaluation (OpenMT12), as well as a number of other
improvements we have made to the system.
Major changes:
- Added the key features of our NIST 2012 Chinese system:
- Tuning decoder weights using Batch Lattice MIRA (new tune.py script), as
well as other tuning algorithms. We have done extensive testing of
various tuning methods, and Lattice MIRA works best, yielding significant
gains over N-best MERT, our previous default, and reliably beating other
methods. (For details, see Cherry and Foster, NAACL 2012.)
- Tuning rescoring weights using N-Best MIRA (-a mira option to rat.sh).
- HLDM: Hierarchical lexicalized distortion models, following Galley and
Manning (EMNLP 2008) and Cherry, Moore and Quick (WMT 2012). Includes an
ITG parser used during decoding, which also provides a new distortion
limit implementation, -dist-limit-itg.
- Framework support for lattice mira and HLDMs, both enabled by default.
- Framework support for the use of IBM4 alignments produced by an external
tool: we obtain our best results when we combine phrase pair counts and
HLDM counts obtained from IBM4, HMM and IBM2 alignments, rather than
applying only the single best alignment method.
- Alignment indicator features tell the decoder which alignment(s) provided
each phrase pair, allowing the tuning algorithm to weigh the aligners's
opinions automatically.
- MixTM: linear mixture of phrase tables for domain adaptation, with proper
support in the framework, as well as improved framework support for MixLMs.
In both cases, framework support for combining externally trained generic
LMs and/or TMs with in-domain data, to get improved results, especially
for smaller in-domain training corpora.
- Framework and PortageLive support for Chinese->English pipeline.
- For French->English and English->French: added generic models that can be
combined with in-domain corpora to supplement them and improve the
translation quality. This can help for all domains, but especially for
those with smaller amounts of training data: the generic models will help
the system handle the general constructs of the language, while the
in-domain material will provide the domain-specific vocabulary and
contructions.
Changes of intermediate importance:
- New -filter-singletons switch to train_ibm reduces memory bubble at the
beginning of the IBM1 model training procedure for large corpora.
- Speed up significance pruning by roughly an order of magnitude.
- Reduce by a factor of two the amount of memory required for the creation of
Tightly Packed Suffix Arrays (TPSAs), and therefore also for significance
pruning.
- Phrase table filtering and loading now takes linear time in the size of the
phrase table, and no longer has a quadratic term in the number of target
phrases per source phrase.
- New decoder features and options:
- -dist-limit-simple options, which helps at least for Chinese.
- minimum diversity criterion added to coverage pruning
- BiLM (following Niehues et al, WMT-2011)
- carry joint count information in the phrase table (for future use)
- carry alignment information in the phrase table, and optionally include
it in the decoder output.
- unal features count the number of words left unaligned in phrase pairs
(following Guzman et al, MT Summit 2009).
- distortion limit based on ITG constraints
- LeftDistance experimental distortion model
- -maxlen
- diversity and -diversity-stack-increment for regular stack decoding
- Forced decoding is now done by canoe itself, using all available models.
This functionality replaces the obsolete phrase_tm_align program, which was
much more limited in functionality.
- TPPTs can now store fourth column (adirectional) scores and joint counts.
- More phrase table smoothers, enhanced phrase smoother library, and the
ability to generate adirectional scores via joint2multi_cpt.
- Framework support for optionally tuning and testing using multiple tuning
variants, which are typically 90% sample subsets of the main tuning set.
This is helpful for assessing quality when using less stable tuning methods
such as N-best MERT, but the current default, Lattice MIRA, is quite stable.
- Support for pre-loading models into memory in PortageLive using a new
priming script (prime.sh).
- PortageLive support in the framework for the new model features (MixLMs,
MixTMs, HLDMs, etc.).
- The global word-alignment model option helps MixTM handle small components
better.
- An alternative alignment symmetrization method, grow-diag-final-and, which
is becoming the standard in SMT, and yields better results, has been
implemented in PortageII and is now enabled by default.
- When PortageLive processes a TMX file, as well as when tmx2lfl.pl extracts
training data from one, handle Trados and MS Word style non-breaking and
optional hyphens better: replace the non-breaking hyphen by a regular one,
and remove the optional one.
Minor changes:
- New -w switch to joint2multi_cpt.
- New -sort switch to ibmcat.
- -w switch to translate.pl
- filter_models is smarter about not doing or redoing unecessary work.
- tokenizer: new -pretok option for text that is already tokenized; better
handling of sequences of periods; and make sure -notok -ss and -pretok -ss
don't modify tokens, but only split sentences.
- MagicStream now always opens gzipped files using the gzip library in boost
instead of a pipe: this has been shown empirically to be faster, and it
prevents crashes due to the standard implementation of fork() when working
close to memory limits.
- Improved stability for run-parallel.sh.
- Significance pruning calculates the significance level with better precision,
thanks to boost's high-precision implementation of lgamma().
- align-words -H now documents word-alignment output formats.
- Removed some obsolete scripts and source code files.
- Various optimizations, code clean-up, resolve Klocwork warnings, etc.
- New python coding conventions and utility library portage_utils.py.
- New "-s trimmed", "-s max", -t switches to summarize-canoe-results.py, as
well as support for current framework structure.
- Framework can calculate OOV rates for test translations.
- PortageLive now supports specifying the language and country codes for the
TMX files separately for each context.
- PortageLive now has proper support for counting words in Chinese text:
following standard text processing software, we count each Chinese character
as a word.
Minor bug fixes:
- TPSAs, significance pruning and other parts of the tpt/ module now respect
the convention that tokens in tokenized text are separated by one or more
space or tab characters.
- Make sure the future score of a complete decoder hypothesis is 0.
- rescore_train in PER and WER mode: the stopping criterion was broken and has
been fixed.
- In forced decoding (previously phrase_tm_align), under some situations the
target sentence might not have been fully covered; this has now been fixed.
- In PortageLive setup, pre-linking of external libraries must be undone when
making the server copy.
Portage 1.x is the first generation of SMT software released by the NRC. We
keep its history here to document how our SMT technology evolved over the
years.
Portage 1.4.3 2011-10-03
This release is primarily intended distribute our improved truecasing module,
and its integration in the experimental framework. It also includes
improvements we've made in various parts of the system since the last release.
Major changes:
- Improved truecasing workflow, which takes into account casing information
from the source sentence.
- The framework is adjusted to use the new truecasing workflow by default, or
optionally the old one.
- Significance pruning is now available (following Johnson et al, EMNLP 2007),
and integrated in the framework.
- The framework now supports using a phrase table merged from the IBM2 and HMM
ones, instead of using them separately. This is now the recommended
procedure, and the default. When you move to the 1.4.3 framework, you will
notice that PT_TYPES is now set to "merged_cpt" by default, instead of
"ibm2_cpt hmm3_cpt", which would restore the old behaviour.
Changes of intermediate importance:
- The tokenizer and detokenizer (utf-8 version only) now support Spanish.
- TPPT (Tightly Packed Phrase Table) and TPLM (Tightly Packed Language Model)
generation is now significantly faster and requires less memory.
- Added OS X (Darwin) support.
Minor changes:
- When PortageLive processes a TMX with confidence estimation enabled, the CE
score is included in every translation unit as a property of type "Txt::CE".
- PortageLive now allows overridden programs to exist in each model's /bin and
/lib directory, thus allowing one to have model specific versions of some
programs.
- The boost and zlib libraries are now linked dynamically.
- Linking with TCMalloc, a fast memory allocator, is now supported.
- gen_phrase_tables has better memory and performance behaviour on large
training corpora.
- TPSA (Tightly Packed Suffix Array) generation now requires less memory.
- lm_eval now optionally outputs per sentence log-prob or perplexity.
- gen_phrase_tables/joint2cond_phrase_tables now support -prune1w:
length-dependent phrase table pruning.
- parallelize.pl supports a new striped mode with fewer temporary files.
- A few more test suites were added.
- New time-mem-tally.pl script makes "make time-mem" faster in the framework.
- New merge_multi_column_counts program can merge Lexicalized Distortion Model
count files.
- New script summarize-canoe-results.py makes it easier to see and compare the
result of multiple experiments and/or system training runs.
- Better dependency and sanity checking at installation time.
- lzma is supported but no longer required.
- Support multiple training file pairs in gen-jpt-parallel.sh.
- wc_stats is now about 2.5 times faster.
- Removed obsolete single prob phrase table format (gen_phrase_tables -tmtext
and canoe -ttable-file*).
- canoe -h and -options output improved.
- canoe now has a -use-ftm switch to enable forward translation models with
default weights, instead of having to use -ftm with the right number of 1's.
Minor bug fixes:
- run-parallel.sh is now more stable (it would sometimes hang if the file
system was very busy: this is fixed).
- Fixed problem with TPLMs where, under rare boundary conditions, a parameter
of the language model might be lost.
- When PortageLive receives text that collides with the "Magic number" of some
mime types, the file was not recognized as plain text -- this is now fixed.
- The tpt module did not allow UNK to appear as a real token, now it does.
- cow.sh is not longer limited to dev sets have fewer than 10000 sentences.
- canoe -lattice output is now about 8 times more compact: recombined nodes
were previously unecessarily expanded, now they are correctly kept as a
single node in the lattice output.
- On rare occasions, the presence of ||| in the training corpus would cause an
invalid phrase table to be generated; this is now fixed.
Portage 1.4.2 2010-10-01
This is a maintenance release.
Minor changes:
- The detokenizer for French no longer glues % to numbers.
- The new script fix-en-fr-numbers.pl patches numbers copied from English
to French by reformatting them using French conventions. Intended for use
in postprocess_plugin.
- Support for gcc 4.5.1.
- The HMM aligner now splits very long sentences in <= 200 word chunks,
to avoid running an excessively long time on very long sentences, which are
mostly useless for this model anyway.
- When inserting Portage translations back into a TMX file, ce_tmx.pl now sets
TU attribute usagecount to 0 and deletes BPT, EPT, IT and PH elements
(native formatting codes) unless -keeptags is set.
- Made n-best list management in cow.sh more efficient.
Minor bug fixes:
- Added missing implementation of vocab-filtering of binary TTables.
- Fixed a rare crash situation in textpt2tppt.sh.
- Made arpalm2tplm.sh robust to extra white space in the ARPA LM files.
- Fixed cow.sh to remove unintended 10000 line limit in dev set.
Portage 1.4.1 2010-07-30
This is a maintenance release, fixing several issues in v1.4.0.
Minor changes:
- PortageLive now supports installing multiple contexts on the same server,
via both the CGI and SOAP interfaces. The SOAP interface has been enhanced
to let the user specify which context to use, whether confidence estimation
is required, and to handle TMX files. The CGI interface has several
improvements.
- In PortageLive, the duplicate copy of the SOAP code for secure servers has
been removed, and replaced with a mechanism that generates it automatically.
- When packaging models for a PortageLive context, tmtext-apply-weights.pl is
no longer used by default, since it is unstable and the gain is not that
significant.
- Dependencies on bash extensions are now made explicit, so that Portage can
run on Ubuntu even when /bin/sh is dash.
- Portage is now compatible with g++ 4.4.4 and 4.5.0, boost 1.43 and bash 4.
- New script filter-long-lines.pl can be used to filter excessively long lines
from a parallel corpus.
- plog.pl now outputs statistics in a clearer format.
- Improved installation instructions, in particular regarding dependencies.
Bug fixes:
- Fixed crash in gen_phrase_tables and a few other programs when compilation
with ICU is disabled and the user locale is *.utf-8.
- Fixed issue where TPTs might be corrupted when the file was larger than 4GB.
- Fixed several more minor issues.
Portage 1.4.0 2010-05-31
This update incorporates improvements intended to help the performance of
Portage as an online translation service, as well as scientific progress we've
made in the last year.
PORTAGEshared has been renamed "Portage" with a version number, i.e., this
package is now known as "Portage 1.4". References to PORTAGEshared within the
documentation or the code are considered to refer to Portage 1.4, or to
PORTAGEshared 1.0 to 1.3 if the context refers to older versions.
Note: with 1.4.2, the original 1.4 was renamed 1.4.0, as it should have been
named in the first place. Now, 1.4 refers to any of the 1.4.x updates.
Major changes:
- Tightly Packed Tries (TPT) (see Germann, Joanis and Larkin, SETQA-NLP 2009)
use memory mapped IO for optimized access to highly compact representations
of models. When used together, TPLMs (Tightly Packed Language Models),
TPPTs (Tightly Packed Phrase Tables) and TPLDMs (Tightly Package Lexicalized
Distortion Models) reduce the decoder start time to nearly nothing, while
maintaining good decoder speeds. Furthermore, when a translation server
uses this technology, the file caching mechanism of the operating system
holds in memory what was read by a previous instance of the decoder, so that
once the server has traslated a few sentences, a significant speed gain can
be observed as disk access becomes less and less necessary. This technology
is ideal for a live translation server, whether it is delivered as a web
service or otherwise. Tightly packed models are integrated in the decoder,
the rescoring module, the confidence estimation module, the truecasing
module, i.e., everywhere these models can be used.
- PortageLive. Portage now comes ready to deploy as a translation server, via
a web service using SOAP, via a web page, or connecting to the translation
server via ssh. Documentation is included on how to do so as a Virtual
Appliance that can run on any virtual machine architecture (from local
infrastructure to cloud computing), or on a dedicated translation server.
The TPT technology mentioned earlier significantly reduces the memory
requirements for such deployment. See Paul et al (MT Summit 2009) for
details (paper accompanying the Technology Showcase, available at
http://www.mt-archive.info/MTS-2009-TOC.htm).
- The peak memory required to train a phrase table has been reduced by about
50%: instead of invoking only gen_phrase_tables, one can parallelize the
counting process (the first half of what gen_phrase_tables does) with
gen-jpt-parallel.sh and invoke joint2cond_phrase_tables -reduce-mem to do
run the estimation process in a small memory footprint.
- Experimental framework:
- The framework was enhanced in many ways, reflecting the changes in the
software, incorporating new modules, and following the evolution of the
recommended procedures.
- Resource monitoring: the framework now keeps track of resources used at
all stages of processing, to identify peek memory usage more readily, and
to be able to see where the time is spent. The major scripts in Portage
track memory usage and CPU time: cow.sh, rat.sh, cat.sh, run-parallel.sh,
canoe-parallel.sh, etc. The utility script time-mem is used to do the
same for other programs and program suites, and can be used outside
Portage as well.
- Tutorial: the framework-toy.pdf document has been revised to reflect
current code, and improved as a general tutorial for Portage.
- New decoder features:
- Adirectional scores for phrase pairs are now supported as decoder
features. Previously, Portage only supported forward and backward scores,
intended to model P(t|s) and P(s|t), respectively. Adirectional features
allow the use of arbitrary functions f(s,t) associated with each phrase
pair, without any implied semantics. The association features of Chen,
Foster and Kuhn (MT Summit 2009) are an example of such features.
The adirectional scores are stored as the "4th" column in multi-prob
phrase tables. See the user manual for details.
- Lexicalized distortion models (see Koehn et al, ACL-2007)
- Levenshtein or N-gram distance from a reference, useful especially for the
fuzzy mode in phrase_tm_align. By default, phrase_tm_align only returns
an alignment if phrase pairs exist to exactly cover the two sentences to
align. In fuzzy mode, phrase_tm_align is allowed to consider other
translations of the source that are close to the target, where "close" is
defined as having low Levenshtein or N-gram distance. This distance is
used as a feature in the log-linear model, and optimized jointly with the
rest of the model by the decoder.
- New script tmx2lfl.pl designed to extract parallel corpora in plain text
from TMX (Translation Memory eXchange) files.
- New script tmtext-apply-weights.pl pre-applies log-linear weights learned
using cow.sh to create more compact models for use by a translation server.
- The phrase extraction process (gen_phrase_tables) has been modified to
require that a phrase pair have at least one actually linked word pair.
(Previously, unaligned words were allowed to be considered a phrase pair.)
- The new program joint2multi_cpt solves the "phrase hole" problem. When
multiple phrase tables are used together, phrase pairs that appear in one
but not all tables are penalized too aggressively by default, yielding what
we call the "phrase hole" problem. This problem is especially severe when
one table is much smaller than the other; often, cow.sh learned to stronly
discount that table's opinion because it was not appropriately smoothed.
joint2multi_cpt solves this problem by smoothing phrase tables considered as
a set, giving reasonable smoothed estimates in each table for all phrase
pairs, including those appearing only in other tables.
- MERT (cow.sh): improved stability of the search for optimal parameters.
See Foster and Kuhn (WMT 2009) for details.
- Confidence Estimation. Portage now comes with a module that produces
confidence estimates accompanying the decoder output. See Simard and
Isabelle (MT Summit 2009) for details.
Minor changes:
- Word alignment:
- New "sri" alignment reader (for word_align_tool, eval_word_alignment).
- New "gale" and "uli" alignnment writers.
- New word aligners: IBMDiagAligner, HybridPostAligner, ExternalAligner.
- Bug fix in the HMM alignment model: the end-distribution semantics is a
bit cleaner, and p0/up0 can now be arbitrarily high, because the effective
p0 is capped at .999. (High p0/up0 have been found to be good for the
phrase extraction process, so this is important).
- Sentence alignment (ssal):
- added support for IBM1 models and documented a multi-pass procedure for
producing improved sentence alignments.
- new hard boundaries within the input text allow handling a collection of
distinct documents collected together in a single file pair, as well as
the use of external sources of information such as section or paragraph
boundaries.
- New phrase smoothers in gen_phrase_tables and joint2cond_phrase_tables:
JointFreqs, alpha-smoothing option to RFSmoother and IndicatorSmoother
- canoe decoder:
- The forward TM scores are now included in future score calculation.
- Now supports the -bind PID option, exiting automatically when the
master process disappears.
- Faster 1-best decoding by discarding recombined states on the fly.
- All boolean options now have a -no variant so that something turned on in
the canoe.ini can be turned off on the command line.
- Rescoring / MERT:
- New features: Levenshtein distance and N-gram distance to a specified
reference.
- Various improvements to the module in general.
- Tokenizer:
- Now reliably fast, taking linear time regardless of paragraph length (used
to be quadratic on the length of each paragraph).
- Supports a -notok switch to perform sentence splitting only.
- Many improvements to run-parallel.sh.
- parallelize.sh: new -w switch to determine the number of blocks from a
minimum number of lines per block instead of a fixed number of blocks, with
the -n switch capping the number of blocks used.
- Adaptation is no longer dependent on SRILM programs.
- New length-dependant phrase table pruning option to filter_models yields
better results than the fixed ttable-limit decoder parameter.
- Miscellaneous new programs:
- binlm2arpalm converts an LM file from our binary format to the standard
ARPA format.
- ibmcat displays binary word-alignment model files in plain text.
- wc_stat is like wc but displays more statistics.
- al-diff.py compares different sentence alignments for a given text.
- Save dependency information per source file during compilation instead of
re-processing each file to generate Makefile.depends.
- More unit-testing test suites.
- Some code clean-up using Klocwork Insight, fixing potential future problems.
- Some code documentation clean-up.
- Logs now go consistently to STDERR, even for programs without a primary
output on STDOUT.
- MagicStream: our library that handles reading and writing compressed files
on the fly uses gzip by default, but now falls back to using zlib (via the
boost::iostreams library) when gzip fails to start due to memory limits.
- New script canoe-timing-stats.pl helps track time spent loading models
versus time spent actually doing translation; cow-timing.pl summarizes the
time spent in the various parts of cow.sh.
- New module textutils/ groups together utility scripts and programs for basic
text manipulation, making them easier to find.
- Truecaser: fixed issues for handling different encodings.
- Quench the Copyright notices that were printed much too often.
- The obsolete src/api/ directory was deleted; PortageLive supercedes it.
PORTAGEshared 1.3 2009-01-21
This update to PORTAGEshared is primarily intended to incorporate the new HMM
word alignment module, and related functionality. We have also taken the
opportunity to migrate many improvements from Portage, add a new experimental
framework, and improve the documentation.
Major changes:
- HMM word alignment models, including a number of variants. We have
implemented the base model described in Och and Ney (CL, 2003), a class
based variant also based on Och's work (though our implementation is based
on the baseline system description in He, ACL/WMT-2007), as well as the
variants described by Liang, Taskar and Klein (HLT-2006), including their
symmetrization method, and He's (WMT-2007) lexicalized MAP (Bayesian) model.
gen_phrase_tables also has a new PosteriorAligner based on Liang et al
(HLT-2006).
The HMM word alignment models are trained using train_ibm, they can be used
for word alignment directly via align-words, and gen_phrase_tables can use
them to generate phrase tables. In our current state of the art, we
typically perform word alignment using IBM2 and HMM models separately, then
use the resulting phrase tables with cow.sh for maximum BLEU training,
either separately or merged together into one table.
- Added a generic HMM toolkit, used by the HMM word alignment models,
supporting state or arc emitting HMMs, and implementing the Viterbi and
Baum-Welch algorithms. (Optimized for densely connected HMMs.)
- Parallelized training of IBM1/2/HMM word alignment models via the new cat.sh
script, and a binary format for TTables and all intermediate count files,
for fast reading and writing of these model and count files.
- An experimental framework is now included with PORTAGEshared, as a potential
starting point for your experiments. Besides demonstrating how to use
PORTAGEshared, this framework embeds choices of options which we think are a
reasonable starting point.
Previously, the only full usage examples we provided were not suitable for
this use. The toy example was designed to run fast at all cost, regardless
of the quality fo the output, while the small-vocabulary regression test
suite was mostly intended to exercise the code. The new framework is
specifically designed to be both a tutorial and a reasonable starting point.
Of course, you will still need to experiment in order to optimize
performance for your setting.
Even if you used PORTAGEshared before, we recommend you read
framework-toy.pdf in the framework directory, as it includes a full
description of how to use PORTAGEshared, including important features which
are not all highlighted elsewhere. If you have built your own experimental
framework, you may find useful suggestions when following the toy example
described in this document.
- We've now included our truecasing module, which no longer requires external
software at truecasing time. The Perl script truecase.pl performs
truecasing using canoe. The program compile_truecase_map compiles the
truecase map for truecase.pl. Training the Language Model itself requires
an external language modelling toolkit, e.g., SRILM (if your licensing
requirements permit it) or IRSTLM.
- Many improvements to run-parallel.sh, useful if you're working on a cluster:
- uses Perl sockets instead of netcat, resulting in a significant reduction
of overhead, from seconds down to hundredths of seconds per job, and one
fewer dependency on external software;
- now more stable, with more coherent behaviour in case of errors, which can
be controled by the user via the new -on-error switch;
- more thorough clean up at exit time or in case of errors;
- all temporary files are hidden away in a workdir instead of polluting the
directory run-parallel.sh is invoked from (most scripts using temporary
files now do the same too);
- new -c switch to run a single command via psub/qsub, acting as a blocking
qsub for clusters that don't support blocking qsub - this is useful to
have a Makefile run commands on a cluster via psub/qsub;
- number of CPUs requested by the master job is propagated to the workers;
- on clusters running Torque, take advantage of the job array feature to
speed up and reduce the overhead of worker submission via psub/qsub.
Minor changes:
- User configuration is now centralized in src/Makefile.user-conf.
- We've added some unit testing and a unit testing framework, using CxxTest,
run automatically by doing "make test" in src/.
- We've moved our legacy test programs into subdirectories of the source code,
run automatically by doing "make test" in src/.
- Make the code compile with g++ 4.3 without any warnings.
- Streamlined c++ includes, in part to speed up compilation.
- Use tr1::unordered_map instead of the soon to be deprecated
__gnu_ext::hash_map.
- Various code refactorings to make maintenance and documentation easier,
including improvements to the compilation mechanism, removal of doxygen
errors, and many more small details.
- New section in the documentation with the usage info from all programs.
- A few programs now have a -final-cleanup switch: for efficiency, we often
don't delete models just before exiting, since the OS does so immediately
after exiting. In programs that support it, the -final-cleanup switch
delete all models; useful for memory leak detection and other debugging.
- Removed obsolete champollion.breakparts.pl, unsplit-sentence.pl,
merge_ttables, maxphrase.pl, CalculateHypothesisProb.pl, and
find_sentence_phrases.
- Utils modules (src/utils):
- New utf8 casemapping functionality (requires ICU)
- Program utf8_filter performs strict validation of utf8 input.
- diff-round.pl now supports compressed files automatically, and has new
-sort, -q and -min options.
- New template class BiVector models a vector with positive and negatives
indices.
- Fixed memory leak in short array allocator ArrayMemPool.
- Added support for .lzma files in all C++ programs, via MagicStreams.
- Preprocessing module (src/preprocessing):
- Significantly improved the tokenizing and detokenizing for French;
slightly improved for English as well. Better lists of abbreviations in
both languages. Smart quotes and other characters from the cp-1252
repertoire are now recognized, and optionally replaced by the closest
iso-8859-1 characters.
- New udetokenize.pl script performs detokenization on French and English
text encoded in utf8.
- Language Modelling module (src/lm):
- New caching mechanism for LM queries, intended for use with expensive LM
classes. Currently only enabled for LMMix (Dynamic LM mixture model).
- Minor improvements to ngram-count-big.sh: reports errors more carefully and
removes the merge tree since we noticed a single multi-way merge is faster.
- Refactored the LM classes to make adding new ones easier; added a new
Factory Method creator object for each class.
- New LM class: LMDynMap. Used for dynamic mapping of case or numbers.
- Tally statistics over LM queries, used by canoe in particular.
- Renamed lmtext2binlm to the more precise name arpalm2binlm.
- New script lm_sort_filter.sh sorts an ARPA-format LM in a way that
typically increases its compression ratio with gzip.
- New script lm-order.pl determines the order of an ARPA-format LM file.
- Translation Modelling and Word Alignment module (src/tm):
- New -prune1 option to gen_phrase_tables and joint2cond_phrase_tables
prunes long tails before calculating probabilities. Especially useful for
phrase tables trained on noisy corpora.
- New program word_align_tool allows manipulation and conversion of word
alignment files.
- Support for more word alignment formats via a generic module,
word_align_io, which is easily extensible to support further formats.
- Use smoothing for OOVs consistently in all alignment model queries.
- New file handling-unaligned-words.txt explains how unaligned words are
handled in the phrase extraction process.
- New program eval_word_alignment calculates F-measure with respect to a
reference alignment, as suggested by Fraser and Marcu (CL, Sept 2008).
- New merge_counts program does fast merging of counts; used by
gen-jpt-parallel.sh to remove large memory requirement at the end.
- Eval module (src/eval):
- Refactored PER, WER and BLEU calculations to support PER and WER
optimisation in rescore_train.
- bleucompare can now perform PER or WER calculations instead.
- Support for NIST style BLEU computation by setting the environment
variable PORTAGE_NIST_STYLE_BLEU (has different brevity penalty
definition).
- Decoding module (src/canoe):
- canoe-parallel.sh now supports the use of load balancing in -append mode.
- New preprocessing script canoe-escapes.pl adds or removes escapes expected
by canoe as needed.
- Bug fix in -soft-limit mode for filter_models keeps phrases pairs that
were incorrectly deleted before. This sometimes results in higher memory
requirements in canoe, which can be addressed via gen_phrase_tables's new
-prune1 switch.
- New filter_models options: -ttable-limit overrides the limit in the
canoe.ini file, -no-per-sent disables per-sentence LM filtering, using
the less effective global-vocabulary LM filtering instead.
- New rule decoder feature allows one to set weights for canoe markups via
-rule-weigths, and tune them in cow.sh, instead of using hard-coded
weights. Supports multiple classes of rules with their separate weights.
- Robustness fix: canoe now accepts non-finite numbers in phrase tables,
issuing a warning and treating them as if they had been 0.
- Rescoring module (src/rescoring):
- New rescoring features: RatioFF, HMMTgtGivenSrc, HMMSrcGivenTgt,
HMMVitTgtGivenSrc, HMMVitSrcGivenTgt, WerPostedit, PerPostedit,
BleuPostedit, BackwardLM.
- Generic rescoring feature SCRIPT invokes any script of your choice, for
easy creation and prototyping of new features, as well as integration
of features not part of PORTAGEshared.
- New program uniq_nbest removes duplicates in an n-best list.
- cow.sh and rescore_train now optionally optimize WER or PER instead of
BLEU.
- New micro tuning mode for cow.sh looks for per-sentence optimal weigths
for a few iterations before looking for globally optimal weights.
- Other new cow.sh options: -rescore-options, -no-lb, -s.
- rescore_train -l saves a log of Powell runs for tracing the optimisation
process; -rf randomizes the order in which features are considered by
Powell's algorithm.
- New script cowpie.py extracts useful statistics from a cow.sh log.
- rescore_translate now supports Minimum Bayes Risk rescoring.
- Rat.sh now uses hard phrase table filtering before translation, thus
reducing memory requirements in canoe (can be disabled with -no-filt).
- New rat.sh options: -rescore-opts, -per, -wer, -dep, -no-filt.
- gen-features-parallel.pl converted to Perl (from bash) and made more
stable.
PORTAGEshared 1.2 2008-01-28
This is a significant update to PORTAGEshared, incorporating most of the
changes we have made to Portage since the initial release of PORTAGEshared.
Major changes:
- Soft TM filtering: joint filtering of several phrase tables in such a way
that, no matter what weights are used, the top L hypotheses will have been
kept, i.e., discards entries that can never make it to the top L, under any
set of non-negative weights. Also described in Badr et al (CORES-2007).
- Used by cow.sh when the -filt option is specified (recommended).
- LM filtering based on per-sentence-vocabulary, as described in Badr et al
(CORES-2007) (see the annotated bibliography in the user manual for all
paper references). In short, keeps an n-gram only if all the words it
contains can occur together in the translation of at least one source
sentence. Typical LM filtering uses one global vocabulary; this technique
efficiently keeps track of a separate vocabulary for each input sentence to
translate. In decoding, this approach can save some 25% of the memory
required for large LMs, or as much as 50% when combined with soft TM
filtering. In lm_eval and the LM rescoring function, significantly higher
savings are possible. Works with both the text (ARPA) and our BinLM file
formats.
- Automatically used by canoe while loading language models;
- automatically used by the rescoring module for the NgramFF feature;
- used by lm_eval when the -per-sent-limit option is specified.
- Implemented Huang and Chiang's (ACL-2007) cube pruning algorithm. Can yield an
order-of-magnitude speed up in decoding in most circumstances. Requires
careful re-tuning of some decoding parameters, however, especially S (stack
size) since its meaning is not the same as with regular decoding. Run canoe
-h for details on enabling cube pruning.
- New module implementing George Foster's LM and TM adaptation work, with
integration of the resulting mixture models in the decoder - details in
Foster and Kuhn (WMT-2007).
- Optimization of various programs throughout the Portage suite, including
canoe.
- Significantly optimized string splitting routines (src/utils/str_utils.h)
and consequently the loading of many types of input and data files.
In particular, the new Voc::addConverter functor directly converts a
sentence or a phrase from a string to a vector<Uint> with no intermediate
storage, yield a noticeable speed up in several programs.
- Implemented new language model heuristics for decoding, including
"incremental", the default in several other MT systems, and now also the
default in PORTAGEshared.
- Monotonic decoding with phrase swaps can now be done using canoe's
"-distortion-limit 0 -dist-phrase-swap" combination of options, optionally
using the new PhraseDisplacement distortion model instead of, or in
combination with, the standard distortion penalty (WordDisplacement).
- New IBM1Forward decoder feature.
- The main rescoring script, rat.sh, was overhauled to be easier to use. The
model is now specified with the same syntax as for rescore_train (documented
in rescore_train -H): rat.sh transparently handles generating the features,
managing temporary files (now all tucked away in a working sub-directory)
and giving rescore_train an appropriately transformed model file. See
rat.sh -h for details and test-suite/regress-small-voc/28_rat_train.pl for
an example rescoring model in this simplified syntax.
- New rescoring features (run rescore_train -H for the full list):
- IBM1DocTgtGivenSrc calculates p(tgt-sent|src-doc), using a file of docids
to determine what parts of the source file constitute documents (the
docids file should have one line for each line in the source text,
containing an ID in any format (no whitespace allowed); lines with
identical IDs are considered to come from the same document).
- nbest*Post* - posterior probability features for confidence estimation
rescoring - see Ueffing and Ney (HLT-EMNLP 2005), Zens and Ney (WMT-2006).
These papers not included in our annotated bibliography provide even more
background and depth:
- Blatz et al. (2003). Confidence Estimation for Machine Translation.
JHU/CLSP Summer Workshop.
- Ueffing (2006). Word Confidence Measures for Machine Translation.
Ph.D. thesis.
- Consensus and ConsensusWin - WER-based consensus over N-best list (very
expensive features - not recommended for general use) - features based on
Mangu et al (1999).
- BLEUrisk - Minimum Bayes Risk using BLEU loss function - see Kumar and
Byrne (HLT-NAACL 2004).
- ParMismatch and QuotMismatch - count mismatched parentheses and quotes.
- CacheLM - cache LM over docs defined in docid files (see above) - see
Kuhn and De Mori (1990-2).
- Overhauled the regress-small-voc test suite:
- exercices more aspects of the code;
- includes two top-level scripts, one to run a minimal end-to-end suite, and
a second one that also runs various extensions;
- renumbered scripts so that they can be run in numerical sequence, as was
originally the intention.
Major bug fixes: