HebMorph flowchart

Here is a flowchart of HebMorph, illustrating how it works at its current state.

HebMorph flowchart

Following the numbers in the chart:

When MorphAnalyzer is initialized, morphology data is loaded from hspell. Most data is in hspell data files (provided with the project), and some is compiled into the project. Support for one versioned dictionary file is planned.
Consumer is calling HebMorph, either as:
- A tokenizer, to just tokenize a long string known to contain Hebrew words. The Hebrew tokenizer provided with HebMorph will tokenize Hebrew better than other tokenizers, not compatible with Hebrew.
- A lemmatizer, to give back a lemma and some more morphological information on a given word.
- A stream lemmatizer – given a string known to contain Hebrew words, StreamLemmatizer will tokenize it and then for each Hebrew token it will provide morphological analysis. Non-Hebrew tokens are also returned.
When lemmatizers are used, lookup can either be exact or tolerant, using the lookup tolerator functions. This is an extendable interface.
Token objects are returned from Tokenizer, Lemmatizer and StreamLemmatizer. For Hebrew tokens, this is a HebrewToken instance, deriving from Token, and containing morphological data.
Lemma filters are available and can be plugged into the process. Filtering can be done based on score (=confidence), length, by looking on a stop list, and so on.

Provide feedback