-
Notifications
You must be signed in to change notification settings - Fork 11
New Word Recognizer
-
The word should occur multiple times, estimated by frequency;
-
The word should follow/before many words, estimated by information entropy;
-
create the initial sqlite3 databases; "CREATE TABLE ngram (words TEXT NOT NULL, freq INTEGER NOT NULL);"
-
estimate the frequency threshold;
-
get all frequency for existing words;
-
get the threshold by the frequency of word in the 60% position.
-
-
get all paritial words;
-
get all word pairs whose frequency is above the frequency threshold;
-
recursively merge the word pairs in all sqlite3 databases.
-
from higher-gram to lower-gram like n-gram ⇒ n-1-gram, …, 3-gram ⇒ 2-gram, 2-gram ⇒ 1-gram;
-
-
-
estimate the information entropy threshold;
-
get all prefix information entropy for existing words;
-
get the prefix threshold by the prefix information entropy of word in the 69% position.
-
get all postfix information entropy for existing words;
-
get the postfix threshold by the postfix information entropy of word in the 69% position.
-
-
filter all partial words to get new words;
-
only keep the word whose information entropy above both prefix and postfix threshold.
-