scripts/preprocess.py should sort tokens lexicographically #43

AlekzNet · 2016-03-22T23:43:16Z

Currently the indexes are assigned to tokens on the first occurrence basis. If the text is changed (think of fixing a typo or training a pre-trained model on a different corpus) the indexes might be reassigned what will break subsequent training initialized from a checkpoint.

dgcrouse · 2017-04-27T04:56:45Z

Duplicate of #12. We are addressing this by creating a script that encodes new data using an existing JSON schema. This is faster and more space-friendly, esp. for non-text datasets.

AlekzNet mentioned this issue Apr 2, 2016

Option for non HDF5 #12

Open

dgcrouse added the duplicate label Apr 27, 2017

dgcrouse closed this as completed Apr 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts/preprocess.py should sort tokens lexicographically #43

scripts/preprocess.py should sort tokens lexicographically #43

AlekzNet commented Mar 22, 2016

dgcrouse commented Apr 27, 2017

scripts/preprocess.py should sort tokens lexicographically #43

scripts/preprocess.py should sort tokens lexicographically #43

Comments

AlekzNet commented Mar 22, 2016

dgcrouse commented Apr 27, 2017