Hebrew Tokenizer

A very simple python tokenizer for Hebrew text.
No batteries included - No dependencies needed!

UPDATES

13/11/21

Added support for alphanumeric non english characters.

23/10/21

Hebrew accents/punctuation(ניקוד) support added.

28/04/21

accented letters support was added - great work Daniel 👊

23/10/2020

Added support for Python 3.8+.
The Code look like shit now BUT it's working:smile:.
In the near future I will wrap it up nicely.

Installation

via pip
1. pip install hebrew_tokenizer
from source
1. Download / Clone
2. python setup.py install

Usage

import hebrew_tokenizer as ht
hebrew_text = "אתמול, 8.6.2018, בשעה 17:00 הלכתי עם אמא למכולת"
tokens = ht.tokenize(hebrew_text)  # tokenize returns a generator!
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))

>>> HEBREW, 'אתמול'
>>> PUNCTUATION, ',' 
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ',' 
>>> HEBREW, 'בשעה'
>>> HOUR, '17:00'
>>> HEBREW, 'הלכתי'
>>> HEBREW, 'עם'
>>> HEBREW, 'אמא'
>>> HEBREW, 'למכולת'

# by default it doesn't return whitespaces but it can be done easily
tokens = ht.tokenize(hebrew_text, with_whitespaces=True)  # notice the with_whitespace flag
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))
  
>>> HEBREW, 'אתמול'
>>> WHITESPACE, ''
>>> PUNCTUATION, ',' 
>>> WHITESPACE, ''
>>> DATE, '8.6.2018'
>>> PUNCTUATION, ','
>>> WHITESPACE, ''
>>> HEBREW, 'בשעה'
>>> WHITESPACE, ''
>>> HOUR, '17:00'
>>> WHITESPACE, ''
>>> HEBREW, 'הלכתי'
>>> WHITESPACE, ''
>>> HEBREW, 'עם'
>>> WHITESPACE, ''
>>> HEBREW, 'אמא'
>>> WHITESPACE, ''
>>> HEBREW, 'למכולת'

Disclaimer

This is NOT a POS tagger.
If that is what you are looking for - check out yap.

Contribute

Found a special case where the tokenizer fails?

Try to fix it on your own by improving the regex patterns the tokenizer is based on.
Make sure that your improvement doesn't break the other scenarios in test.py
Add you special case to test.py
Commit a pull request.

NLPH

For other great Hebrew resources check out NLPH

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
hebrew_tokenizer		hebrew_tokenizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hebrew Tokenizer

UPDATES

13/11/21

23/10/21

28/04/21

23/10/2020

Installation

Usage

Disclaimer

Contribute

NLPH

About

Releases

Packages

Contributors 2

Languages

License

YontiLevin/Hebrew-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Hebrew Tokenizer

UPDATES

13/11/21

23/10/21

28/04/21

23/10/2020

Installation

Usage

Disclaimer

Contribute

NLPH

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages