-
Notifications
You must be signed in to change notification settings - Fork 11
n gram model format
epico edited this page Jun 29, 2011
·
6 revisions
Proposals of new n-gram model format
- a formal n-gram model textual format for exchanging, as required by the policy of Fedora, Debian etc.
- extensible, even when new smoothing methods added to n-gram or doing prune, the n-gram format is still the same.
- simple, will be very easy to write a parser, also some tool will be provided to ease this.
- line type: from beginning of every line, with leading “\”. this is used to claim which kind of data current line has.
- <…>: used for special token, like <english>, <unknown>, <start>, <end>…, which represents english word, unknown word, start of the line, end of the line.
- normal word, which follows the line type, the number of normal words is dictated by the line type.
- tagname tagvalue, these tokens follows normal word, and always appears as a pair, we treat this as a hash. (this is extensible.)
- “…”: this can be used in special token, normal word, tagname tagvalue, only when you need input escape string sequence.
- every line is a entity and described by line type.
The following are all possible line types. - \data, the begin of the data.
allowed tag: model (interpolation/back-off) - \end, the end of data.
- \<n>-gram, begin of n-gram where n is from <n>:
\<n>-gram {tagname tagvalue}+
possible tags are count, how many items.
This affects the following line types: \<n>-param, \item.
The allowed \<n>-param is from 0 to n-1, where n from <n>-gram. - \<n>-param, describes additional param for n-gram, like bow value in back-off model.
\<n>-param is followed by n normal word, when n is zero no normal word following.
additional tagnames and tagvalues is possible. - \item a single item in n-gram, followed by n normal words which depends on \<n>-gram, additional tagnames and tagvalues are possible.
a. interplotion
\\data model interpolation
\\1-gram count 100
\\0-param lambda-interpolation 0.6711
\\item <start> count 66
...
\\2-gram count 2000
\\item 中国 人 count 100
...
\\end
\\data model interpolation
\\1-gram count 100
\\0-param lambda-interpolation 0.6711
\\item <start> count 66
...
\\2-gram count 2000
\\item 中国 人 count 100
...
\\end
b. back-off
\\data model back-off
\\1-gram count 100
\\item <start> freq 0.066 bow 0.1 back-off-level 0 back-off-index 0
...
\\2-gram count 2000
\\item 中国 人 freq 0.1 bow 0.2 back-off-level 1 back-off-index 3355
...
\\3-gram count 10000
\\item <start> 中国 人 freq 0.2 back-off-level 2 back-off-index 2000
\\end
\\data model back-off
\\1-gram count 100
\\item <start> freq 0.066 bow 0.1 back-off-level 0 back-off-index 0
...
\\2-gram count 2000
\\item 中国 人 freq 0.1 bow 0.2 back-off-level 1 back-off-index 3355
...
\\3-gram count 10000
\\item <start> 中国 人 freq 0.2 back-off-level 2 back-off-index 2000
\\end
c. k mixture model
\\data model "k mixture model" count 1000 N 10 total_freq 1100
\\1-gram
\\item <start> count 50 freq 51
...
\\2-gram
\\item 你好 啊 count 3 T 3 N_n_0 2 n_1 1 Mr 2
...
\\end
\\data model "k mixture model" count 1000 N 10 total_freq 1100
\\1-gram
\\item <start> count 50 freq 51
...
\\2-gram
\\item 你好 啊 count 3 T 3 N_n_0 2 n_1 1 Mr 2
...
\\end
4. tools:
1. export tools for exporting from interpolations and back-off model.
2. import tools for importing to various models, produce error when missing required tagname or tagvalue.
1. export tools for exporting from interpolations and back-off model.
2. import tools for importing to various models, produce error when missing required tagname or tagvalue.
Refer URL:
http://www-speech.sri.com/projects/srilm/manpages/ngram-format.5.html