- (models/fro) Updated the model
- (deps) Moved to PaPie
- (models/fro) Fixed regression for apostrophes
- (models/fro) Added [REF:...] excluder
- (models/Freem) Update model to handle morph
- (models/Freem+Fr) Added reference excluder
- (models/Freem+Fr) Excluders updated to use CharRegistry
- (pipeline/excluders) reworked ApostropheExcluder and the like to use CharRegistry
- (CLI) Allows to specify
--max-tokens
at the CLI level
- (model/lasla) Updated model to use multi-part pie model
- (model/grc) Added a
[REF:.*]
excluder - (Tagger) Passed the argument for lower to the main Tagger (might not change a thing)
- (requirements) Upgrade pie version because I (@ponteineptique) messed up.
- (requirements) Upgrade pie version
- (models/grc) Added morphology tags and updated to 0.0.2
- (requirements) Upgrade pie version
- (models/lasla) Support ignoring character tokens through
[IGN:char]
- (pipeline/excluders) Made sure excluder would use the same replacement character through a CharRegistry dictionary
- (models/lasla) Use model LASLA+ from 0.0.5b trained on PyTorch 1.3.1
- Added a
max_tokens
per sentence limit in DataIterators.
- (models/fro) Updated model fro to 0.3.0 using multiple tasks
- (models/dum) Added a new model with Middle Dutch thanks to Mike Kestemont
- (tokenizers) Added a SimpleTokenizer based on length
- (models/lasla) Apply unidecode
- (models/lasla) Use model LASLA+ from 0.0.5alpha trained on PyTorch 1.3.1
- (models/lasla) Updated the abbreviation list
- (CI) Added Github Actions
- (Documentation) Added a warning about supported python versions
- (Documentation) Fixed the example
- (pipeline) Created
AbbreviationsRemoverExcluder
- (dependencies) Cleaned the version requirements due to pip update
- (models/LASLA) Fixed a bug where clitics are not split correctly after nouns
- Fixed multiple typos in CHANGES.md in version numbers
- New Latin model which handles capitalized input, entities and better disambiguation.
- (Latin Model) Fixed a long standing bug where Latin would not tag Gender because I forgot it in the GlueProcessor... Big Facepalm
- Fixed the way the DataIterator deals with documents ending with a sentence formed of excluded tokens only.
- Fixed a typo in an import pattern
- (Latin Model) Dealt with some weird Unicode numerals which unexpectedly broke our
.isnumeric()
usage (e.g. ↀ )
- Added a way to tag texts where word are already tokenized: new lines are word separator, double new lines are sentence separator
- Reworked the way preprocessing of special chars is done prior to sentence tokenization and after it.
Creation of the class Excluder (pie_extended.pipeline.tokenizers.utils.excluder)
- Allows for more code sharing across models.
- Fixed a typo that would prevent to tag with FREEM (and nobody saw that ! ;) )
- Fixed Early Modern French Model (reusing processor and tokenizer of FR model)
- Added Ancient Greek Model (Very basic addition, need more work probably ?)
- Added Early Modern French Model (reusing processor and tokenizer of FR model)
- Hotfixed columns order in tsv output
- Hotfixed lowercase for latin model
Unfilled TODO
Unfilled TODO
PIE_EXTENDED_DOWNLOADS
environment variable can be used to set up a non default directory for models and linked data.- eg.
PIE_EXTENDED_DOWNLOADS=~/PieData pie-extended download fr
- eg.
- (Breaking) Postprocessors now must return a list of dict instead of a dict with
.get_dict()
methods (#c8be021) - Added a better tokenizer for Classical French
- Keeps aujourd'hui intact
- Keeps union dash with pronouns for
-le
in sentence such asmange-le
. - Keeps the
-t
euphonique together with the pronoun:mange-t-il
becomesmange
and-t-il
- Its removed from lemmatization until a new model is trained (old model had
-t
on the verb) - Elision work as intended for non euphonique such as "Va-t'en" ->
va
,-t'
,en
- Its removed from lemmatization until a new model is trained (old model had
- Updated Classical French models (#15)
- Added a post-processor to split tokens (#17)