Hebrew Tokenizer

A very simple python tokenizer for Hebrew text.
No batteries included - No dependencies needed!

Installation

via pip
1. pip install hebrew_tokenizer
from source
1. Download / Clone
2. python setup.py install

Usage

import hebrew_tokenizer as ht
hebrew_text = "אתמול, 8.6.2018, בשעה 17:00 הלכתי עם אמא למכולת"
tokens = ht.tokenize(hebrew_text)  # tokenize returns a generator!
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))

>>> Groups.HEBREW, 'אתמול'
>>> Groups.PUNCTUATION, ',' 
>>> Groups.DATE, '8.6.2018'
>>> Groups.PUNCTUATION, ',' 
>>> Groups.HEBREW, 'בשעה'
>>> Groups.HOUR, '17:00'
>>> Groups.HEBREW, 'הלכתי'
>>> Groups.HEBREW, 'עם'
>>> Groups.HEBREW, 'אמא'
>>> Groups.HEBREW, 'למכולת'

# by default it doesn't return whitespaces but it can be done easily
tokens = ht.tokenize(hebrew_text, with_whitespaces=True)  # notice the with_whitespace flag
for grp, token, token_num, (start_index, end_index) in tokens:
    print('{}, {}'.format(grp, token))
  
>>> Groups.HEBREW, 'אתמול'
>>> Groups.WHITESPACE, ''
>>> Groups.PUNCTUATION, ',' 
>>> Groups.WHITESPACE, ''
>>> Groups.DATE, '8.6.2018'
>>> Groups.PUNCTUATION, ','
>>> Groups.WHITESPACE, ''
>>> Groups.HEBREW, 'בשעה'
>>> Groups.WHITESPACE, ''
>>> Groups.HOUR, '17:00'
>>> Groups.WHITESPACE, ''
>>> Groups.HEBREW, 'הלכתי'
>>> Groups.WHITESPACE, ''
>>> Groups.HEBREW, 'עם'
>>> Groups.WHITESPACE, ''
>>> Groups.HEBREW, 'אמא'
>>> Groups.WHITESPACE, ''
>>> Groups.HEBREW, 'למכולת'

Disclaimer

This is NOT a POS tagger.
If that is what you are looking for - check out yap.

Contribute

Found a special case where the tokenizer fails?

Try to fix it on your own by improving the regex patterns the tokenizer is based on.
Make sure that your improvement doesn't break the other scenarios in test.py
Add you special case to test.py
Commit a pull request.

NLPH

For other great Hebrew resources check out NLPH

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
hebrew_tokenizer		hebrew_tokenizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hebrew Tokenizer

Installation

Usage

Disclaimer

Contribute

NLPH

About

Releases

Packages

Languages

License

maorkavod/Hebrew-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Hebrew Tokenizer

Installation

Usage

Disclaimer

Contribute

NLPH

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages