Skip to content
epico edited this page Nov 30, 2010 · 1 revision

libpinyin Internal

1. the introduction of the components.

a. Pinyin Table

Functionality:
1. provide pinyin sequence to word token id look up;
2. compress the consecutive token sequences to reduce the binary search in bi-gram, or tri-gram. (see PhraseIndexRange)

b. Phrase Table

Functionality:
1. provide UCS2 character sequence to word id look up.

c. Phrase Index

Functionality:
1. given the word token id, we can look up the following information in phrase index:
 a. the length of the phrase;
    b. the number of pronunciations;
    c. the frequency of the phrase in uni-gram;
    d. all pronunciations;
    e. the phrase in the UCS2 format.

d. Bi-gram

Functionality:
1. provide the bi-gram possibility of two word sequence, eg P(W2|W1).

e. Tri-gram

To be written...

2. data structues of each component.

a. Pinyin Table

The pinyin table is a two-level index table.
The first level is indexed by the first Pinyin Key (represented by a short16, consist of initial key, final key and tone) of each phrase. When search in the first index level, fuzzy pinyin match is considered.
The second level is indexed by the length of the phrase. (Actually it is the length of the phrase by one, as the first pinyin key is in the first index level)
The final item array is a sorted pinyin keys and token id pairs by pinyin key order. From the benchmark, binary search is fast enough, and space is compact.
One more thing, after the look up step, the pinyin table will compress the consecutive token ids into a token range, which will greatly fast the bi-gram search. As we can issue a token range binary search, instead of one binary search for each token id.

b. Phrase Table

The phrase table uses very similar data structure like the pinyin table. But as phrase table don’t need to deal with fuzzy pinyin, the phrase table is simpler than the pinyin table. All characters is stored in UCS2 format.
The first level is indexed by the first UCS2 character.
The second level is indexed by the length of the phrase. (It is also the length of the phrase by one, as the first UCS2 character is in the first index level.)
The final item array is a sorted UCS2 characters and token id pairs by UCS2 characters order.
By the way, as one string only corresponds to one phrase, so every search only returns one token id, not as the pinyin table returns many token ids.

c. Phrase Index

The phrase index is separated by sub phrase index, as we need to handle multiple user dictionary.
When you look up  a token id, 4-bit of the token id indicates the sub phrase index. Using 24-bit of token id to find the offset in the content chunk from the index chunk.

d. Bi-gram

The bi-gram contains P(W2|W1) information, W1 can be the special token -- sentence start.
To look up the P(W2|W1) information:
1. Load the entire information about P(*|W1) from Berkeley db (Hash table indexed).
2. Use the SingleGram to find the P(W2|W1) information.
Note:
bool search(/* in */ PhraseIndexRange * range, /* out */ BigramPhraseArray array);
This is designed specially for pinyin look up, as pinyin table can return compressed token id as range (for characters with multiple pronunciations, the consecutive token ids occurs very frequently).

3. summary

The current implementation is more clear than the first version, and much simpler. As this project merges sunpinyin and novel-pinyin, the code contribution is very welcome from either sides or others.

PS: Frankly speaking, I don’t want to put too much restrictions on the component interface, as libpinyin is still under heavily development. And when new language model is invented, and new storage component is needed, hope there is enough freedom for the new component writer to fit the need of the new language model and algorithms.