You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 23, 2024. It is now read-only.
If the training data is so large that it does not fit into memory, then most likely you can subsample random sentences and this won't significantly affect the quality.
Are you going to add encoding file-dataset? Now bpe.encode from list is working longer than bpe.train from file, isn't it odd? And bpe.train used less memory than bpe.encoding with full list loaded.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Right now tokenizer loads whole corpus in memory and it becomes an issue for large files.
Is it possible to read corpus file line-by-line or split it in any other way (while training as a whole)?
The text was updated successfully, but these errors were encountered: