Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading data from disk partially #8

Open
kathir-ks opened this issue Feb 17, 2024 · 2 comments
Open

Loading data from disk partially #8

kathir-ks opened this issue Feb 17, 2024 · 2 comments

Comments

@kathir-ks
Copy link

Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.

@kathir-ks kathir-ks changed the title Loading partially from Disk Loading data from disk partially Feb 17, 2024
@karpathy
Copy link
Owner

Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.

@kathir-ks
Copy link
Author

The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants