Include some pre-packaged NLP tools #95

amir-zeldes · 2018-09-19T14:54:01Z

e.g. make a builtin tokenizer addressable not as an external REST API

lgessler · 2018-09-19T20:08:11Z

NLTK has several tokenizers that we could allow users to choose from using a line in the config

amir-zeldes · 2018-09-20T03:05:27Z

One issue with NLTK is that it's not XML preserving: if users need to be able to transform data to spreadsheet mode, we need a tokenizer that produces TT-SGML (or we offer different ways of transforming to spreadsheets). The TreeTagger tokenizer does this, but is in native Perl (this is what GU GitDox currently uses via a service call). But I recently ported this tokenizer to Python here:

https://github.com/amir-zeldes/HebPipe/blob/master/lib/whitespace_tokenize.py

This could be a candidate for a generic tokenizer which preserves XML, outputs TT format, and you can plug different abbreviation files to match language specific abbreviations not to split.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include some pre-packaged NLP tools #95

Include some pre-packaged NLP tools #95

amir-zeldes commented Sep 19, 2018

lgessler commented Sep 19, 2018

amir-zeldes commented Sep 20, 2018

Include some pre-packaged NLP tools #95

Include some pre-packaged NLP tools #95

Comments

amir-zeldes commented Sep 19, 2018

lgessler commented Sep 19, 2018

amir-zeldes commented Sep 20, 2018