-
Notifications
You must be signed in to change notification settings - Fork 24
/
Copy pathCHANGES
2168 lines (1788 loc) · 94.6 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Version History
0.90 29 Jun 95 first working code, n-gram models only
0.91 02 Aug 95 snapshot for fosler@icsi, minor bug fixes
0.92 13 Aug 95 added BayesMix, VarNgram LMs
0.93 27 Aug 95 included all LM95 code
0.94 13 Oct 95
* new directory structure mirroring DECIPHER layout.
* man pages added
* added support for Decipher N-best list rescoring
* added Null LM
* added new utility scripts
* bug fixes
0.95 08 Sep 96 as of WS96
* added Trellis class, disambig program
* added support for pause tokens (-pau-) in sentences
(these are ignored for sentence prob computation)
* added -tolower mapping
* added word reversal
* made Ngram model reading much faster (optimized floating point parsing)
* added template class for ngram count tries (to use either integer or
float count value)
* added optional noise tag skipping
* added SkipNgram model
* added Witten-Bell backoff
* ported to native Sun and SGI C++ compilers (see doc/c++porting-notes),
* suppress log10(0.0) warnings
0.96 05 Jun 97
* Honor -gtNmin parameter even when discounting of higher counts
is effectively disabled. (Allows building maximum likelihood LMs
smoothed only by low-count ngram elimination.)
* Ignore pauses and noise in nbest-lattice alignments (also added
-noise option).
* ngram now supports mixtures of up to 6 ngram models.
* added HiddenSNgram LM.
* warn about multiple uses of '-' file for input or output
* zio now handles incomplete reading of compressed file without error
* Fixed interaction between deletion and iterations
* Fixed handling of OOVs in cache model
* Fixed decipher N-best rescoring: we now duplicate even the
roundoff errors incurred by bytelogs. Also added -decipher flag
to ngram to allow replication of recognizer LM scores.
Also, takes into account that Decipher (incorrectly) applies WTW
even to pauses.
* Enhanced decipher-rescore script to deal with NBestList2.0 format,
with -bytelog and -nodecipherlm options .
* Added tools to convert bigram and trigram backoff LMs into
Decipher PFSG format (pfsg-from-ngram).
* Enable DecipherNgram models order higher than bigram
(ngram -decipher-order flag). Default is still bigram.
* Fixed bug that caused float command line arguments to be parsed
incorrectly on SunOS4 systems (missing declaration in system header).
0.97 30 Aug 97 as of WS97
* New programs: segment and segment-nbest (moved here from
development code).
* Made low-level NgramLM access functions public
(findProb, findBOW, insertProb, insertBOW).
* Fixed nbest-lattice to use normalized posterior word
probabilities in lattice.
* NBest, nbest-lattice: added N-best error computation.
* WordLattice, nbest-lattice: added lattice error computation.
* WordLattice: base all alignments on edit distance costs defined
in WordAlign.h.
* contextID() now also returns length of context used.
Added contextID() implementations for NullLM and BayesMix.
* Fixed contextID() for Ngram: don't truncate context if BOW = 1.
* Fixed SArray, LHash to avoid assignment operator on remove().
* Fixed add-ppls, subtract-ppls to handle -ppl -debug 2 output.
* Lots of memory management fixes.
* SArrayIter and LHashIter now work even while underlying object is
being moved (as when containing data structure is enlarged).
* Added HTK Lattice tool interface (htk/ directory).
* Made Trellis into a template class.
* Allow arbitrary n-gram orders with disambig(1).
* Added forward-backward decoding and posterior probability computation
to disambig(1).
* Added disambig -lmw and -mapw options.
* Added HMMofNGrams model (ngram -hmm option).
* VocabMap reader now warns about duplicate entries
0.98 18 April 98
* Allow ngram to disable Decipher LM backoff hack, for rescoring
new exact lattices (ngram -decipher-nobackoff).
* N-best list vocabulary is now always expanded dynamically
(no more OOVs in N-best lists).
* Added wrapper script for nbest-lattice to compute N-best error rate
(nbest-error).
* Skip ngrams exceeding model order when reading.
* Fixed memory bug in generateSentence().
* Changed libmisc to work with Tcl version > 7.
* Compute word error correctly for empty N-best list.
* Added ngram pruning based on model perplexity change
(ngram-count -prune and ngram -prune).
* Old ngram -prune option renamed -varprune.
* New lattice word error minimization (nbest-lattice -lattice-wer).
* Fixed ngram -gen bug due to omissions in SunOS4 header files.
* merge-batch-counts removes merged source files
* Added ngram -prune-lowprobs function to do the work of
remove-lowprob-ngrams, but much faster and using less memory.
* Added support for new Decipher NBestList2.0 format.
* Added word error count and posterior probability fields to NBestHyp
structure.
* Added optional factor argument to countSentence() (convenient
to compute fractional sufficient statistics for alternative
training methods).
* Don't make special symbols (<s>, </s>, <unk>) member of SubVocab
by default.
* Ported to gcc 2.8.1 .
0.99 31 July 1999
* Added hidden-ngram (word-boundary tagger).
* Removed line length limit for File object.
* Added disambig -continuous flag.
* Fixed backward computation in disambig (again).
* Generalized compute-best-mix to N > 2 models
* Added AdaptiveMix LM class
* Added nbest-mix utility (interpolation of N-best posteriors)
* Added ngram -unk flag to handle open-class LMs
* Added disambig and hidden-ngram -text-map option
* Script enhancements:
- New script to convert nbest-lattice word graphs to PFSG
(wlat-to-pfsg)
- Added switches include probabilities in wlat-to-dot and pfsg-to-dot
output.
- Conversion to/from AT&T FSM format: fsm-to-pfsg and pfsg-to-fsm
* ngram -rescore and associated scripts no longer set a hyp
probability to zero if it contains OOVs. Instead, the probability
is computed ignoring those words (more useful in practice).
A warning is output as always.
* Added ngram-count -float-counts option.
* Added build support for Linux/i686 platform.
1.00 8 June 2000
* Added ClassNgram class and ngram -classes option.
* Capability to convert class ngrams into word ngrams.
* New program ngram-class for automatic word class induction.
* Fixed interaction of ngram -mix-lm -bayes with non-standard n-grams:
can now build an interpolation of the non-standard (hidden-event,
class-based, etc.) n-gram with the additional, standard n-grams.
* Replaced LM.noiseTag with LM.noiseVocab (list of noise tags to
be ignored). Tools now take -noise-vocab option (as well as -noise
for backward compatibility).
* Made ngram -counts work for non-n-gram models.
* Added nbest-lattice -posterior-{amw,lmw,wtw} options to compute
word posteriors with different weightings from the one used in
hypothesis ranking. Also added -deletion-bias flag for explicit
control of del/ins errors (-use-mesh mode only).
* NBest rescoring methods now have optional acoustic model weight
(defaulting to 1 as before).
* New class RefList (list of reference transcripts).
* New class NBestSet (set of N-Best lists).
* NBest, NBestSet, and nbest-lattice optionally split multiwords into
their components on reading (-multiwords option).
* New nbest-optimize tool for finding near-optimal score combination
weights for word error minimizing N-best rescoring.
* New anti-ngram program, for computing posterior-weighted N-gram
counts from N-best lists.
* New nbest-rover script allows ROVER-style combination of hypotheses
from multiple N-best lists.
* New rescore-decipher -norescore option, to reformat N-best lists
without LM rescoring.
* Fixed bugs related to missing <s> and </s> in change-lm-vocab and
make-ngram-pfsg.
* Significant speedups in LMs involving dynamic programming
(HiddenNgram, DFNgram, HMMofNgrams) when interpolating with other
models or running in "ngram -debug 2" mode.
* Allow absolute discounting on fractional counts, for more
effective construction of models from fractional counts.
* Added ngram-merge -float-counts option, and allow "-" (stdin) as
input file.
* ngram-count ensures <s> unigram (with prob 0) is defined to avoid
breaking other programs.
* Added make-abs-discount script to compute absolute discounting
constants from Good-Turing statistics.
* compute-sclite and compare-sclite now take -multiwords option to
split compound words prior to scoring.
* Changed option handling so that unsigned option arguments are forced
to be non-negative.
* Added Map2 (2D Map) class to libdstruct.
* Much better string hash function (borrowed from Tcl).
* New man pages: training-scripts(1), lm-scripts(1), ppl-scripts(1),
pfsg-scripts(1), nbest-scripts(1), lm-format(5), classes-format(5),
pfsg-format(5), nbest-format(5).
1.0.1 12 July 2000
Functionality:
* wordError() and nbest-lattice -dump-errors now also output the
location of deletions in the alignment (NOTE: possible code
incompatibility).
* New reverse-ngram-counts script.
Bug fixes:
* Workarounds for shortcomings in Linux gcc, math library, and linker.
* make-ngram-pfsg: don't ignore bigram states with zero BOW (bugfix).
* nbest-rover: fixed problem with handling of + lines.
1.1 21 May 2001
Functionality:
* HiddenNgram class generalized to deal with disfluency-type events
that manipulate the N-gram context.
* rescore-reweight script now accepts additional score directories
(and associated score weights) for combination of an arbitrary number
of knowledge sources.
* Enhanced rescore-decipher functionality:
- Option -lm-only to produce output containing LM scores only
- Option -pretty to perform word mapping on the fly.
- Warn about and handle LM scores that are NaN.
* New class VocabMultiMap, implementing dictionary-style mappings of
words to strings from another vocabulary.
* Added support for pronunciation-based word alignments in
WordMesh and nbest-lattice -use-mesh .
* Added nbest-lattice -keep-noise option to preserve pauses and noises
in alignments.
* Support for multiwords: - make-multiword-pfsg expands PFSGs to use
multiwords (using AT&T FSM tools).
- multi-ngram expands N-gram LM to include multiwords.
* Added support for Decipher Intlog scaled log probabilities.
* Added ngram -seed option to initialize random sentence generation
(contributed by Eric Fosler).
* New add-pauses-to-pfsg pause= and version= options to allow
generation of Nuance-compatible PFSGs (see man page for details).
* The NBest class and scripts handle NBestList2.0 format containing
phone and/or state backtraces (by ignoring them).
* Added Amoeba search option to nbest-optimize (contributed by
Dimitra Vergyri).
* Added standard 1-best optimization mode to nbest-optimize.
* wlat-to-pfsg script now also processes confusion networks output by
nbest-lattice -use-mesh .
Bug fixes:
* ngram -decipher-nobackoff now applies to the -lm ngram as well if
option -decipher is also specified.
* ngram -expand-classes no longer dumps core when handling
"context-free" class expansions (though those aren't supported).
* gawk path in scripts is now adjusted prior to installation
(/usr/bin/gawk for Linux, /usr/local/bin/gawk elsewhere).
* Fixed numerical problems in nbest-rover/nbest-posteriors.
* ngram-counts -float-counts behaved differently from equivalent
integer-count estimation; both integer and float counts now use
the same estimation code.
* Reduced memory requirements of nbest-optimize by about 25%.
* Minor changes for gcc-2.95.3.
1.1.1 20 July 2001
Functionality:
* WordMesh: new interface to record reference word string in alignment.
* nbest-lattice: confusion networks can now record reference words
if specified with -reference, and are preserved by -write/-read.
* replace-words-with-classes now has option to process ngram count
files (have_counts=1).
* merge-nbest: new utility to merge N-best hyps from multiple lists.
* wlat-stats: new utility to compute statistics of word posterior
lattices.
Bug fixes:
* GT discounting: fixed anomaly due to different floating point
precision on x86 platforms.
* anti-ngram(1): documented options previously omitted.
* WordMesh: reading/writing of confusion networks now preserves
total posterior mass.
* Changed the hypothesis alignment order in nbest-optimize to be
more compatible with decoding in nbest-lattice: first align nbest
hyps in order of decreasing (initial) scores, then align reference.
nbest-optimize -no-reorder keeps the old behavior (with references
anchoring the alignment). All scores and initial lambdas are now
used to compute initial posterior hyp probabilities to guide the
hypothesis alignment; thus, it now makes sense to restart an
optimization with partially optimized weights to revised the
alignments.
* nbest-optimize now warns about missing or incomplete score files.
* Fixed a memory access error in nbest-optimize -1best.
* Fixed weight normalization in nbest-optimize when first element is 0.
* Miscellaneous fixes for compile under RH Linux 7.0.
1.2 20 November 2001
Functionality:
* nbest-lattice -dictionary allows word alignments to be guided by
dictionary pronunciations.
* nbest-lattice -use-mesh -record-hyps records the rank of N-best hyps
contributing to each word hypothesis in the confusion network.
* nbest-lattice -no-rescore and -decipher-format options make it
more convenient as an N-best format conversion tool.
* VocabDistance: new class and subclasses to represent distance metrics
(e.g., phonetic distance) over vocabularies.
* WordMesh: output word hyps in order of decreasing posteriors.
* WordMesh: reading/writing of confusion networks now includes hyp IDs
from alignment.
* NBest/MultiAlign/WordMesh: support for keeping extra word-level
information (NBeSTWordInfo).
* nbest-lattice: unified single and multiple file processing.
New option -write-dir to write multiple output lattices.
New option -refs to supply multiple references.
Options -nbest-errors and -lattice-errors are replaced by
switches -nbest-error/-lattice-error, in conjunction with
-references/-refs. Outputs are now prefixed by utterance IDs
when processing multiple files.
* nbest-lattice -nbest-backtrace enables processing of backtrace
information from N-best lists; combined with -use-mesh this produces
sausages that contain word-level scores and alignment information,
as well as phone backtraces (see new wlat-format(5) man page).
* wlat-stats script now also computes error statistics when processing
confusion networks with references.
* nbest-rover now handles N-best lists in Decipher format.
* hidden-ngram and disambig: new option -fw-only to use only forward
probabilities for posterior computation.
* rescore-decipher -filter option to apply textual rewriting filters
to hypotheses before rescoring.
* segment-nbest -write-nbest-dir option for dumping rescored N-best
lists to a directory instead of to stdout.
* segment-nbest -start-tag and -end-tag options to insert tags at
margins of N-best hyps.
Bug fixes:
* WordMesh: computation of deletion costs using a dictionary distance
was completely bogus (only affected undocumented nbest-lattice
-dictionary option).
* nbest-lattice: correctly process -nbest-files using -dictionary in
alignment.
* nbest-rover: fixed to work on Linux
* hidden-ngram: don't abort when an event posterior is 0.
* hidden-ngram: avoid abort when *noevent* occurs in -hidden-vocab list.
* segment-nbest: now correctly uses ngram contexts longer than trigram.
* segment-nbest: optimized -bias 0 case by disallowing sentence
boundary states altogether.
* multi-ngram -prune-unseen-ngrams prevents insertion of multiword
N-grams whose component N-grams were not in the original model.
* ngram: fixed computation of mixture lambda for second LM when three
or more models are interpolated.
* nbest-posterior (and thus nbest-rover) no longer split multiwords by
themselves. To split multiwords with nbest-rover, append the
-multiwords option to the argument list, which is passed on to
nbest-lattice to achieve the desired effect.
* ngram -renorm now applies BEFORE class expansion or pruning of
model (in case input model is unnormalized).
* make-nbest-pfsg bug involving transition into final node fixed.
* Minor script changes to avoid warnings with gawk 3.1.0.
1.3 11 February 2002
Functionality:
* Trellis class, disambig and hidden-ngram tools: added support for
N-best decoding (contributed by Anand Venkataraman).
* MultiwordLM wrapper LM class as a convenient way to split multiwords
prior to LM evaluation.
* New MultiwordVocab class to support MultiwordLM.
* Added ngram -multiwords option (based on MultiwordLM wrapper).
* Added support for Chen & Goodman's Modified Kneser-Ney smoothing
and interpolated backoff estimates. See ngram-count options
-kndiscount[1-6], -kn[1-6], and interpolate[1-6].
* New library and tool for lattice manipulation: lattice-tool.
* New nbest-mix -set-am-scores and -set-lm-scores options. These allow
setting either the AM or the LM scores in the N-best output to simulate
the combined posteriors, while preserving the other scores.
* Added some regression tests (test/ subdirectory).
* Support for Windows via CYGWIN porting layer (MACHINE_TYPE=cygwin).
See doc/README.windows for details.
Bug fixes:
* Trellis: deallocate old trellis nodes on demand in init(), rather
than preemptively in clear(). Greatly speeds up forward computation
for trellis-based LMs (e.g., ClassNgram).
* Textstats: fix to handle zero denominator in ppl computation.
* disambig: fixed off-by-one error indexing into trellis.
* Miscellaneous small fixes for compilation and operation under Windows
(using the CYGWIN environment).
Warning: See doc/README.x86 about a gcc compiler bug that might
affect you on Intel platforms.
1.3.1 25 June 2002
Functionality:
* nbest-optimize -write-rover-control option conveniently dumps a
control file for nbest-rover that encodes the optimized parameters.
* New regression tests for nbest-rover (i.e., nbest-lattice) and
nbest-optimize.
* nbest-posteriors, combine-acoustic-scores now all handle and
preserve Decipher N-best formats. This allows nbest-rover to
generate sausages with backtrace information if input N-best lists
contain it (using -nbest-backtrace option).
* New tool nbest-pron-score for computing pronunciation and pause LM
scores from N-best hypotheses.
* Added disambig -totals option to compute total string probabilities
(same as in hidden-ngram).
* reverse-lm: simple filter to reverse a bigram backoff LM.
* lattice-tool -collapse-same-words reduces lattices by merging all
nodes with identical words (but also creates new paths in lattice).
* nbest-lattice -prime-with-refs option uses reference strings
to improve sausage alignment.
* compute-best-sentence-mix: new script to optimize sentence-level
interpolation of LMs.
* nbest-lattice -lattice-files option to align multiple word lattices;
currently only works with -use-mesh (sausages).
* hidden-ngram now supports mixture and class N-gram LMs.
* New class SimpleClassNgram, a more efficient implementation of
ClassNgram's where each word is assumed to belong to at most one
class and class expansions are exactly one word long.
Enabled by -simple-classes switch in ngram, lattice-tool, and
hidden-ngram.
* ngram -counts now handles escaped input lines and LM state change
directives embedded in the input.
* New tool nbest-pron-score for scoring pronunciations and pauses in
N-best hypotheses.
* NgramStats::parseNgram() new function to parse N-gram counts from
a character string.
* LM::pplCountsFile() new function to evaluate LM on counts read from
a file.
Bug fixes:
* make-ngram-pfsg is no longer limited to trigram models.
* Avoid NaN values in disambig and hidden-ngram, in cases where lmw or
mapw are zero and the corresponding log probabilities are -Infinity.
* Avoid numerical problems in N-best posterior computation by using
AddLogP() to compute normalizer.
* anti-ngram no longer requires -refs argument with -all-ngrams.
* Fixed bug removing noise from N-best lists with backtrace.
* Code fixes for clean compiles with gcc 3.x.
* nbest-rover more efficient by using a single invocation of
nbest-lattice for all input N-best lists.
* ClassNgram: fixed handling of words that appear as members of a class
with zero probability, or have zero membership probability.
* nbest-lattice -record-hyps now outputs hyp ids according to the
original N-best order, rather than the sorted one.
* make-hiddens-lm now gives proper unigram probability to hidden-S tag.
* Compute acoustic scores in Decipher N-best-2 format by subtracting
token LM scores from total score. This deals correctly with cases where
the total scores have been adjusted by summing merged hyps, and are no
longer the sum of all AC and LM word scores.
* Gawk scripts that test for alphabetic or lowercase characters are
more portable and handle non-ascii and multibyte characters.
The package now includes a paper on SRILM, to appear in ICSLP-2002,
that gives an overview of the software and its design (doc/paper.ps).
1.3.2 3 September 2002
New functionality:
* Added ngram-count and ngram-count -nonevents option to specify a
subset of words that are to be non-events, i.e., tokens that can only
occur in contexts (such as <s>).
* Extended ngram-count discounting options for up to 9-grams.
* Added support in Vocab and Ngram classes for processing meta-counts
(counts-of-counts).
* Added ngram-count -meta-tag and -kn-counts-modified options to
support make-big-lm.
* Added ngram-count -read-with-mincounts flag to suppress counts
below cuttoff thresholds at reading time. This dramatically lowers
memory consumption, and speeds up make-big-lm operation (which used
to use a gawk script for the same purpose).
* Added option to specify vocabulary to add-pauses-to-pfsg for cases
where heuristics fail.
* lattice-tool can now handle arbitrary order LMs for expanding
lattices. The old trigram expansion algorithm is still available
with -old-expansion; the compact trigram algorithm is unchanged with
-compact-expansion.
* To better support lattice expansion, two new functions have been
added to the LM interface: contextID() takes an optional word
argument, to compute the context needed to predict a specific word,
and contextBOW() is a new interface to compute the backoff weight
associated with truncating a history.
* Added makefile support to generate executable versions that use
"compact" data structures. See item 9 in INSTALL for details, and
doc/time-space-tradeoff for a simple benchmark result.
Bug fixes:
* Convert pseudo-log(0) value (-99) in DARPA backoff models back to
true log(0) on reading. This ensures that non-event words in the
input are treated as zeroprobs (by the perplexity computation and
otherwise).
* Avoid NaN floating point results in N-best rescoring and
nbest-optimize, by handling 0 * log(0) more carefully.
* Handle -Inf AM and LM scores in SRILM N-best format.
* make-big-lm was reworked to support KN in addition to GT discounting.
Warning: the modified lower-order counts for KN are created using
merge-batch-counts and can get almost as big as the original counts.
Beware of the additional disk space and run time requirement!
* Clear out old parameters before reading or estimating N-gram models.
* Reading in new class definitions into ClassNgram object now deletes
old definitions (unless classes file is empty).
* Destructors for Ngram and ClassNgram now free N-gram and class
definition memory.
* nbest-pron-score: avoid core dump when pronunciation information is
missing from N-best list.
* make-ngram-pfsg: fixed generation of unigram PFSGs.
* Avoid use of toupper() in add-pauses-to-pfsg.
* Handle ngram-count -order 0 and print warning.
* Avoid using zcat in scripts since it behaves differently on different
systems and depending on PATH setting.
* nbest-lattice and nbest-optimize no longer strip a filename part
following '.' to derive utterance ids; only known file suffixes
are removed.
* Fixed bugs in member declarations that were preventing TaggedVocab,
TaggedNgramStats, and StopNgramStats from working correctly.
* compute-sclite now ignores utterances with a reference of
"ignore_time_segment_in_scoring", consistent with NIST STM scoring.
* Vocab.h now defines SArray_compareKey() for strings over VocabIndex,
allowing use as keys in sorted arrays.
* ClassNgram now uses the processed words as the context after an OOV.
This works better when the input contains context cue tags.
* i386-solaris platform was not being detected by machine-type script.
1.3.3 2 March 2003
New functionality:
* Increased maximum number of interpolated LMs in ngram, hidden-ngram,
and lattice-tool to 10.
* ngram now computes static interpolation (N-gram merging) of up to 10
input LMs (consistent with handling of dynamic interpolation).
* ngram and lattice-tool -limit-vocab option limits LM reading to
those parameters that pertain to words specified by -vocab.
The LM:read() function got an optional second argument for this
purpose.
ngram -limit-vocab -renorm now effectively does the same as the
change-lm-vocab script. However, the main purpose of -limit-vocab
is to save memory by discarding N-grams that are not relevant to a
test set.
* rescore-decipher -limit-vocab precomputes the vocabulary used by
N-best lists and invokes ngram -limit-vocab to allow rescoring with
very large models on machines with little memory.
* Ngram::mixProbs() now has version that destructively merges an Ngram
into an existing model. ngram -mix-lm now uses this version, instead
of the old, non-destructive one, thereby achieving considerable time
and space savings (only two models, rather than 3, have to be kept in
memory at a time).
* ngram-count and ngram -map-unk option, to change the "unknown" word
token string.
* compute-sclite, compare-sclite now understand multiple -S options to
specify intersections of several utterance subsets for scoring.
* make-batch-counts now ignores lines in input file list that start
with # (allowing comments in the file list).
* Added replace-words-with-classes partial=1 option to prevent
multi-word replacements that include multiple whitespace characters
(i.e., "a b" is only replaced with a single space between the words).
* New LM script: sort-lm, reorders N-grams lexicographically, as
required by some other software (e.g., Sphinx3, pointed out by
Mikko Kurimo <[email protected]>).
* New training script: reverse-text, reverses word order in text file.
* New pfsg script: pfsg-vocab, extracts vocabulary used in PFSGs.
Bug fixes:
* disambig and hidden-ngram -keep-unk now also causes LM to be
treated as open-vocabulary.
* HiddenNgram class (debug level 2) was omitting the event after
the last word from the Viterbi backtrace.
* ngram -expand-classes was including -pau- word in expanded LM.
* Made backoff computation in Ngram:wordProbBO() more efficient,
avoiding multiple lookups in the context trie. Gives about a 30%
speedup in ngram -debug 3 -ppl.
* ngram -lm reading is faster by about 8% due to a code optimization.
* ngram-count -order 2 -kndiscount3 no longer aborts with an error.
The -order option effectively limits the discounting parameters
computed, so that the model order can be changed without having to
adjust the smoothing options.
* make-big-lm -trust-totals option is ignored with KN discounting,
they don't work well together.
* make-big-lm now checks that input counts files are not stdin.
* Reading N-best lists in Decipher format now sets the number-of-words
score, so that weight rescoring, optimization etc. can use them.
* ngram-count normalizes the N-gram probabilities for a context to 1
if the backoff distribution for that context has probability mass 0.
The latter can happen e.g. if all N-grams for a context have been
observed and received discounted probabilities. The fix ensures that
the overall distribution is normalized in this case.
* rescore-reweight now accepts Decipher N-best lists.
* nbest-posteriors and nbest-rover now handle Decipher version 2
N-best lists better (allowing LM and WT weights to be applied).
* Initialize locale in all top-level programs. disambig, hidden-ngram,
segment, and segment-nbest were missing it, causing potential problems
with non-ASCII characters.
* nbest-lattice -write-vocab option to find vocabulary used in N-best
list.
* nbest-pron-score now uses idFromFilename() function to avoid
over-truncating filenames when inferring sentence ids.
* Added more strippable filename suffixes in idFromFilename() function.
* NBest: correctly read in phone backtraces that are time-reversed.
* compute-oov-rate ignores -pau- tokens.
* Various N-best scripts now process input directories containing links
(rather than plain files) correctly.
* Lattice class takes care to limit range of intlog transition
probabilities in PFSG output, so as to avoid overflow when converting
to bytelog scale.
* make-ngram-pfsg removes temporary file (now placed in /tmp) even
when killed by signal.
* Hidden-event and DF N-gram models are documented in detail in ngram
man page.
* Test suite result comparisons against reference output now use a
script that ignores small numerical discrepancies, so as to produce
fewer false alarms.
Portability:
* Compiles under MacOS X (MACHINE_TYPE=macosx), thanks to help from
1.4 14 February 2004
New functionality:
* Added support for factored language models, developed by Katrin
Kirchhoff and Jeff Bilmes, and implemented by Jeff Bilmes.
A new library, libflm.a, and two new tools, fngram-count and fngram
are built in the flm/ directory. A conference paper and a technical
report are included as documentation in flm/doc/. Questions and bug
reports should be directed to [email protected].
FLM support has also been integrated into some of the standard
tools (ngram and hidden-ngram) and is enabled by the -factored option.
* Added support in lattice-tool to read/write and rescore HTK lattices.
See lattice-tool man page for details.
* The lattice expansion algorithm for general LMs now preserves
pause and null nodes. Consequently, lattice-tool no longer eliminates
pause and null nodes prior to applying this algorithm, unless
-no-pause or -compact-pause was specified.
* Implemented a new algorithm to build word meshes (confusion networks,
sausages) from lattices, that is faster than the original Mangu et al.
method. lattice-tool -posterior-decode uses this to extract 1-best
word hypotheses, and lattice-tool -write-mesh allows writing of
sausages to file.
* The "compact" lattice expansion algorithm that uses backoff nodes
(described in Weng et al. 1998) has been generalized to handle
LMs of arbitrary order. As before, this algorithm is triggered by
lattice-tool -compact-expansion. (To get the old version, which
handles only trigrams and produces non-identical results, use
lattice-tool -compact-expansion -old-expansion.)
* lattice-tool -density allows pruning of lattices to a specified
density (in addition to the posterior threshold).
* lattice-tool -multi-char option allows designating characters other
than underscore as multiword delimiters.
* Added a "LatticeLM" class that emulates a language model using the
transition probabilities in a lattice. This is useful for debugging
and comparing the probabilities assigned by lattices to corresponding
LM probabiltiies. A new option lattice-tool -ppl makes use of this
class (analogous to ngram -ppl).
* lattice-tool lattice algebra operations (or, concatenate) can now
be applied to multiple input lattices, always using the same lattice
as second operand.
* ngram has enhanced N-best rescoring functionality, allowing
multiple input lists to be rescored (-nbest-files, -write-nbest-dir,
-decipher-nbest, -no-reorder, -split-multiwords).
* rescore-decipher -fast enables a faster rescoring mode that uses
only the built-in functions of ngram, thus running much faster.
* New option ngram -rescore-ngram to recompute the probabilities in
an N-gram model using an arbitrary other LM.
* Added original (unmodified) Kneser-Ney discounting (ngram-count
-ukndiscountN options). Contributed by Jeff Bilmes.
* New disambig -classes option to read vocabulary maps in
classes-format(5).
* New disambig -write-counts option to output word/class substitution
bigram counts (useful to reestimate class membership probabilities).
* nbest-pron-score -pause-score-weight creates weighted combination
of pronunciation and pause LM scores.
* compute-sclite -noperiods option to delete periods from hyps
for scoring purposes.
* New script empty-sentence-lm to modify existing LM to allow
the empty sentence with a given probability.
* compute-sclite handles CTM files in RT-03 format.
* ngram-class -debug 2 prints the initial word-to-class assignments,
so that the entire class tree can be reconstructed from the output.
* RefList class has option to read and look up reference words without
associated ID strings (indexed by integers).
* Enhanced WordMesh and WordLattice classes to have an optional
"name" field, used to record utterance ids.
* New select-vocab command to implement likelihood-optimizing
vocabulary selection from multiple corpura. Contributed by
Anand Venkataraman and Wen Wang. See man page for details.
Bug fixes:
* ngram avoids reading classes file multiple times if -limit-vocab
is not being used (otherwise it is unavoidable, and will lead to
errors if the reading is from stdin).
* Fixed some bugs in compare-sclite and compute-sclite.
* Modified ngram and compute-best-mix so that the latter works
with ngram -counts output. ngram -counts now outputs the count
values != 1 for each N-gram so that compute-best-mix can take them
into account in the optimization.
* rescore-reweight and nbest-rover were not handling Decipher N-best
lists correctly when additional score directories are given.
* nbest-rover -wer disables use of nbest-lattice -use-mesh option,
so nbest-rover can be used for old-style word error minimization
(or even 1-best rescoring, by also specifying -max-rescore 1).
* lattice-tool -ref-file and -ref-list were being ignored when
processing only a single input lattice. Fixed so that lattice error
can now be computed with either -input-lattice or -input-lattice-list.
* Enhanced MultiwordLM class with new contextID() and contextBOW()
versions that better reflect the backoff behavior of the wrapped LM
class. Makes it much more efficient to use the lattice-tool -multiword
option, i.e., expand a multiword lattice with a non-multiword LM.
* rescore-decipher -pretty had a bug that caused mapping to be applied
to the score fields as well, potentially corrupting the format.
* Fixed bugs in mixture lambda computation (ngram, hidden-ngram,
lattice-tool), triggered by more than one lambda being zero, or using
more than 5 mixtures.
* lattice-tool algebra operations used to crash if operand lattices
contained NULL nodes.
* Non-compressed files ending in .gz can now be read successfully.
* Catch a possible 0/0 problem in the Good-Turing discount estimator.
* Fixed memory management for strings returned by TaggedVocab::getWord()
thereby avoiding garbled results.
* lattice-tool -pre-reduce-iterate and post-reduce-iterate arguments
where not being used to control number of lattice reduction iterations.
* Fixed an unitialized memory bug that could produce random results
in posterior probability computation (and hence in lattice pruning).
* Fixed a bug in lattice pruning triggered by unnormalized posteriors
greater than 1.
Portability:
* Fixed some problems compiling with gcc-3.2.2; eliminated compile-
time warnings about division by zero in constant definitions.
* Rewrote some code to work around limitations and warnings in the
Intel C++ compiler. (In return, got compiled code that runs 10-20%
faster!) For processor-specific optimizations, use
make MACHINE_TYPE=i686-p4 .
* Fixed some script problems that surfaced in latest gawk version.
* Fixed some problems compiling with Tcl/Tk-8.4.1.
* FreeBSD support (contributed by Zhang Le <[email protected]>).
* Updated Nuance-related features in PFSG scripts and man page.
* Note: Integration of FLM support required some changes to the
Vocab and Ngram class interface. In particular, several member
variables (e.g., Boolean Vocab::unkIndex) have been replaced by virtual
member functions that return references to the variables (e.g.,
Boolean &Vocab::unkIndex()). This requires, albeit trivial, changes
to any client code that accesses these variables.
1.4.1 9 May 2004
Functionality:
* New option lattice-tool -htk-quotes to enable the HTK quoting
mechanism that allows whitespace and non-printable characters to be
used in word labels. (This is disabled by default since other SRILM
tools don't allow such word strings.)
* New option lattice-tool -add-refs to add a path corresponding to
the reference word string to each lattice.
* New option ngram -counts-entropy to compute entropy (log probabilties
weighted by joint N-gram probability) from counts.
Bugs fixed:
* nbest-lattice could core dump if references where not supplied.
* FLM/ProductVocab: fixed problems with mapping of <s> and </s> to
factored form.
* Lattice algebra operations (or, concatenate) now preserve HTK link
information and lattice names.
* Fixed LM::contextProb() handling of <s> and other non-event tokens.
This also allowed Ngram:computeContextProb() to be eliminated.
* LatticeFollowIter iterator no longer takes lookahead parameter --
lookahead is unlimited and cycles are avoided by keeping a table of
visited nodes. This also greatly speeds up lattice expansion in
some cases.
* Detect negative discounts in modified Kneser-Ney method, arising
from non-monotonic counts-of-counts.
* Fixed various debugging output messages in the Lattice class.
Portability:
* Matthias Thomae <[email protected]> found that make-ngram-pfsg
(and probably other gawk scripts) may not work correctly with recent
versions of gawk unless the environment is set to LC_NUMERIC=C.
1.4.2 19 October 2004
Functionality:
* lattice-tool -factored option to handle factored LMs (analogous
to ngram and hidden-ngram).
* lattice-tool -nbest-decode generates N-best lists from lattices
(contributed by Dustin Hillard, University of Washington).
* lattice-tool -output-ctm option to generate CTM-formatted 1-best
output, either with -viterbi-decode or with -posterior-decode.
Of course this requires HTK input lattices containing timemarks.
* Added version of WordMesh::minimizeWordError() that returns acoustic
information in a NBestWordInfo array, to support the above.
* lattice-tool -insert-pause option to insert optional pause nodes in
lattices.
* lattice-tool -unk will map unknown words to <unk> instead of
automatically augmenting the vocabulary (the -map-unk option allows
the mapping of unknown words to be customized).
* lattice-tool -acoustic-mesh records word times, scores, and phone
alignments when confusion networks are built.
* lattice-tool -ignore-vocab option to define the set of words that
are ignored in LM processing (like pause nodes).
* lattice-tool -write-ngrams option to compute expected N-gram counts
from lattices.
* HTK lattices now supports up to three "extra" score fields (x1..x3),
which can be used to rescore hypotheses with arbitrary non-standard
knowledge sources.
* Added support for the "s" key in HTK lattices (used to encode
state alignment info).
* anti-ngram -min-count option to prune N-grams with expected frequency
below specified threshold.
* ngram -adapt-marginals and related options to trigger use of
unigram marginals adaptation, following Kneser et al. (Eurospeech 97).
* New LM class AdaptMarginals to support the above.
* nbest-lattice and lattice-tool -hidden-vocab option allows specifying
a subvocabulary that should not be aligned with regular words when
building confusion networks.
* New VocabDistance subclass SubvocabDistance, to support the above.
* nbest-optimize -combine-linear and -non-negative options, useful to
optimize linear combinations of posterior probability scores.
Bugs fixed:
* lattice-tool: Avoid disconnecting lattice in density pruning.
* Utility script installation was not working for Cygwin hosts.
* ProductNgram::contextID() now returns hash code of context used,
instead of zero, and limits context-used length to order-1.
* HTK lattice output was omitting wdpenalty value.
* Improved collision-prone hash function for VocabIndex arrays.
* Documented order of operations in lattice-tool(1).
* Fixed excessive /tmp space usage in nbest-rover script, so as to
avoid frequent incomplete output with large N-best data as a result
of running out of disk space.
* Fixed bug in compute-sclite that would garble STM references without
the optional 6th field.
* Fixed bug in Trie::insert(), which would always set foundP = true,
even if a new entry was created.
* Preserve Lattice:limitIntlogs flags in lattice algebra operations.
* Use sorted node map iteration in lattice-tool expansion algorithms,
so that results are not subject to pseudo-random hash table ordering.
* HTK lattice output no longer has more nodes/links than input
(provided -no-htk-nulls, -htk-scores-on-nodes, or -htk-words-on-nodes
are NOT used).
* Take default lattice name from input filename, rather than output
filename (which may not be defined), however:
* The embedded names of output lattices from binary lattice operations
are derived from the output file name.
* Fixed bug in reading of word meshes (confusion networks) introduced
in release 1.4.
* Fixed a bug in alignments of multiple confusion networks, affecting
cases where the inputs have posterior masses != 1.
1.4.3 3 December 2004
Functionality:
* Increased the number of extra scores supported in HTK lattices
(x1, x2, ... x9).
* lattice-tool -nbest-viterbi option to use Viterbi N-best algorithm,
which uses less memory (contributed by Jing Zheng).
* Added nbest-lattice -output-ctm analoguous to lattice-tool.
* Make -output-ctm output word posteriors in the confidence field.
* Extend the meaning of the nbest-lattice -max-rescore option so that,
in lattice mode, it limits the number of hypotheses that are aligned.
(The meaning of -max-rescore was previously only defined in N-best
rescoring mode).
* Added -version option to all top-level programs.
Bug fixes:
* Improved efficiency and duplicate elimination in A-star N-best
generation (contributed by Jing Zheng).
* Worked around a problem with gawk scripts in Linux handling of
/dev/stderr device which can cause a file to be truncated if stderr is
redirected to it.
* MultiAlign::addWords() was not preserving NBestWordInfo.
Other:
* Various small code changes for compilation with gcc 3.4.3.
* Maintenance scripts moved to $SRILM/sbin/.
* Support for commercial releases excluding third-party code
contributions.
1.4.4 6 May 2005
Functionality:
* ngram-count now allows use of -wbdiscount, -kndiscount, etc.,
without a specified N-gram order, to set the default discounting
method for all N-gram orders. As before, this can be overridden by
-wbdiscount[1-9], -kndiscount[1-9], etc., for specific N-gram
lengths (suggested by Anand).
* lattice-tool -keep-pause has additional side-effects if used with
-nonevents and -ignore-vocab (making pauses behave like regular words).
* lattice-tool -dictionary-align option triggers use of dictionary
pronunciations for word mesh alignment (contributed by Dustin Hillard).
* New option lattice-tool -nbest-duplicates allows control over the
number of duplicate word hypotheses to output (from Dustin Hillard).
* Update to the FLM tools from Kevin Duh, to make fngram-count use the
-vocab option to limit the vocabulary of the estimated model.
* Added nbest-optimize -hidden-vocab option to constrain the alignment
of a subvocabulary (analogous to nbest-lattice -hidden-vocab).
* wlat-stats computes the posterior expected number of words in the
input lattice.
Bug fixes:
* ngram -unk maps unknown words in N-best hyps to <unk> instead of
adding them to the vocabulary.
* lattice-tool: Don't punt when encountering a NULL word node with
pronunciation, output a warning instead.
* lattice-tool -nbest-decode now uses a double-ended heap data
structure, and -nbest-max-stack drops hypotheses from the bottom
of the heap instead of the top (contributed by Dustin Hillard).
* lattice-tool -nbest-decode now does more thorough duplicate removal
(not just adjacent duplicates are removed).
* lattice-tool no longer gives an error if input lattice has posteriors
specified on nodes (even though they are effectively ignored).
* select-vocab: miscellaneous bug fixes from Anand.
* nbest-lattice: fixed various bugs with -nbest-backtrace option.
* compute-sclite: work around bug in csrfilt.sh -dh affecting waveform
names containing hyphens.
* Minor tweaks for MacOSX build.
1.4.5 28 August 2005
Functionality:
* ngram -debug 0 -ppl now outputs statistics for each input section
delimited by escape lines, in addition to overall results (based on
a modification by Dustin Hillard). ngram -debug 1 and higher behave as
before.
* ngram -loglinear-mix implements log-linear mixture LMs.
* LoglinearMix: new class to support the above.
* VocabMap: added remove(.) method to remove all entries for given
source word.
* WordMesh: added wordColumn() function to return confusion set at
given position (contributed by Dustin).
* Lattice: added readMesh() function to read in confusion networks
(from Dustin).
* lattice-tool -read-mesh allows handling in confusion network format
(from Dustin).
* nbest-optimize -1best-first implements a heuristic strategy whereby
the relative score weights are first optimized in -1best mode, followed
by full optimization together with posterior scale.
* nbest-optimize -max-time forces search to time out if new best
weights aren't found within a certain number of seconds.
* New script combine-rover-controls to merge multiple nbest-rover
control files for system combination.
Bug fixes:
* disambig clears old map entries when encountering a duplicate
definition for a source word.
* nbest-optimize: posterior scaling of fixed weights was broken.
* WordMesh, nbest-lattice: do better error checking on reading
confusion network files, handle numalign and posterior specs out of
order.
* lattice-tool had a bug in the handling of HTK format lattices that
do not contain an explicit specification of initial/final nodes.
* Added proper copy constructors and assignment operators for
Array, SArray, and LHash classes. This in turn makes the copy
constructor for NgramLM and other classes work properly.
(Assignment still doesn't work for some higher-level classes because
of reference (&) variable members.)
* Fixed minor bug in the ngram -skipoovs implementation, found by
Alexandre Patry.
Portability:
* Port to win32-mingw platform (by Jing Zheng). Doesn't support
compressed file i/o, or the -max-time options in nbest-optimize and
lattice-tool.
* Minor tweaks for compilation with gcc-4.0.1.
* Renamed HTKLink class to HTKWordInfo, which is more appropriate and
avoids a naming conflict with SRI's Decipher software.