Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scripts/preprocess.py should sort tokens lexicographically #43

Closed
AlekzNet opened this issue Mar 22, 2016 · 1 comment
Closed

scripts/preprocess.py should sort tokens lexicographically #43

AlekzNet opened this issue Mar 22, 2016 · 1 comment

Comments

@AlekzNet
Copy link

Currently the indexes are assigned to tokens on the first occurrence basis. If the text is changed (think of fixing a typo or training a pre-trained model on a different corpus) the indexes might be reassigned what will break subsequent training initialized from a checkpoint.

@dgcrouse
Copy link

Duplicate of #12. We are addressing this by creating a script that encodes new data using an existing JSON schema. This is faster and more space-friendly, esp. for non-text datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants