A very simple python tokenizer for Hebrew text.
No batteries included - No dependencies needed!
Added support for alphanumeric non english characters.
Hebrew accents/punctuation(ניקוד) support added.
accented letters support was added - great work Daniel 👊
Added support for Python 3.8+.
The Code look like shit now BUT it's working:smile:.
In the near future I will wrap it up nicely.
- via pip
pip install hebrew_tokenizer
- from source
- Download / Clone
python setup.py install
import hebrew_tokenizer as ht
hebrew_text = "אתמול, 8.6.2018, בשעה 17:00 הלכתי עם אמא למכולת"
tokens = ht.tokenize(hebrew_text) # tokenize returns a generator!
for grp, token, token_num, (start_index, end_index) in tokens:
print('{}, {}'.format(grp, token))
>>> HEBREW, 'אתמול'
>>> PUNCTUATION, ','
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ','
>>> HEBREW, 'בשעה'
>>> HOUR, '17:00'
>>> HEBREW, 'הלכתי'
>>> HEBREW, 'עם'
>>> HEBREW, 'אמא'
>>> HEBREW, 'למכולת'
# by default it doesn't return whitespaces but it can be done easily
tokens = ht.tokenize(hebrew_text, with_whitespaces=True) # notice the with_whitespace flag
for grp, token, token_num, (start_index, end_index) in tokens:
print('{}, {}'.format(grp, token))
>>> HEBREW, 'אתמול'
>>> WHITESPACE, ''
>>> PUNCTUATION, ','
>>> WHITESPACE, ''
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ','
>>> WHITESPACE, ''
>>> HEBREW, 'בשעה'
>>> WHITESPACE, ''
>>> HOUR, '17:00'
>>> WHITESPACE, ''
>>> HEBREW, 'הלכתי'
>>> WHITESPACE, ''
>>> HEBREW, 'עם'
>>> WHITESPACE, ''
>>> HEBREW, 'אמא'
>>> WHITESPACE, ''
>>> HEBREW, 'למכולת'
This is NOT a POS tagger.
If that is what you are looking for - check out yap.
Found a special case where the tokenizer fails?
- Try to fix it on your own by improving the regex patterns the tokenizer is based on.
- Make sure that your improvement doesn't break the other scenarios in test.py
- Add you special case to test.py
- Commit a pull request.
For other great Hebrew resources check out NLPH