You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.
The text was updated successfully, but these errors were encountered:
kathir-ks
changed the title
Loading partially from Disk
Loading data from disk partially
Feb 17, 2024
The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.
Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.
The text was updated successfully, but these errors were encountered: