- Refactor tokenizer
- Inline pipelines
- Lazy predicates
- Refactor interpretation
- Reimplement rule relations
See natasha#48 for more
Major update required to support recursive grammar.
- Library API changed including grammar DSL and parser API
- Components updated: tokenizer, pipelines, parser, interpretation
- Initial object interpretation support
- Replaced custom tree-like struct used in
combinator.resolve_matches
method withIntervalTree
number_match
andcase_match
labels now have understand same arguments asgnc_match
label
- Support
match_all_disambiguation_forms
argument ingnc_match
label - New token types -
ROMN
for roman numbers, likeXXI
,EMAIL
for emails andPHONE
for phone numbers - New labels -
and_
&or_
- Implemented
get_normalized_text
function that returns normalized text from parsed tokens
- Partial morphology disambiguation solving support (
gnc_match
label now accepts optional boolean argumentsolve_disambiguation
, which when is True, reduces number of token forms in result match) - Rewrited labels, now they're function-based
- Rewrited tokenizer's
transform
function for better extending - Tokenizer now adds different types of grammemes for different types of quotes (e.g.
L-QUOTE
for«
quote) - Implemented DAWG-based pipeline, which shows better performance over dictionary-based pipeline
- Reimplemented
resolve_matches
method inCombinator
- [fix] Fixed error at parsing float range with comma as delimiter
- [fix] Additional checks for terminal rule at
reduce
grammars method - [fix] Fixed requirements in setup.py
- [fix] Tokenizer now correctly understands range values on Python2.x and PyPy platforms
- [fix] Create new grammars with terminal rule instead of appending it to original one
- Replaced shift-reduce parser with GLR parser, because it provides much more performance on multiple grammars (linear time with GLR vs. exponental time with shift-reduce parser).
- Pipelines support
- Added support for Python 2.7 & PyPy
- Implemented
gnc-match
,in
labels
- Implemented
gender-match
,number-match
,case-match
labels - Replace
is-title
label withis-capitalized
label, due to http://bugs.python.org/issue7008 - Tokenizer now understands integer and float ranges (actually, two numbers separated by dash)
- Initial release