Loading data from disk partially #8

kathir-ks · 2024-02-17T17:20:49Z

Training the tokenizer is memory intensive. It needs hundreds of GBs of RAM to train a tokenizer. What about using memmap to load only the required portion of the data from the disk?. Since mostly the access is sequential in nature, I would be significantly faster compared to random access. This would significantly reduce the memory required with a trade off with training time.

karpathy · 2024-02-17T17:21:46Z

Yeah definitely, an optimized version of the code (that does not yet exist) would absolutely have to worry about this.

kathir-ks · 2024-02-17T17:29:41Z

The approach would be to load a part of the txt file (depending upon the ram available) and write the merged pairs to another file and replace the earlier version.

kathir-ks changed the title ~~Loading partially from Disk~~ Loading data from disk partially Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading data from disk partially #8

Loading data from disk partially #8

kathir-ks commented Feb 17, 2024

karpathy commented Feb 17, 2024

kathir-ks commented Feb 17, 2024

Loading data from disk partially #8

Loading data from disk partially #8

Comments

kathir-ks commented Feb 17, 2024

karpathy commented Feb 17, 2024

kathir-ks commented Feb 17, 2024