You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, thank you for creating such a great wrapper to huggingface. It truly makes things very easy to get started and is simple to use.
I am running into a problem where I am training a text classification model ("roberta-base") and I have a fairly large dataset (> 20 million text paragraph blocks and the csv file is about 1GB on disk). My workstation has a pretty hefty 256 GB of RAM so I can generally load most dataset into memory and have been working with this library for awhile and it hasn't been much of an issue. But when I try to run
happy_tc.train()
The RAM usage blows up during the "Preprocessing dataset..." stage and eventually runs out and the kernel crashes.
I'm not entirely sure why this is the case since the dataset is only 1GB on disk so even with a substantial 100x factor for preprocessing, everything should still fit in memory.
Regardless, I think the problem is ultimately because the huggingface "load_dataset" tries to load everything into memory by default. But you can pass the "streaming=True" parameter to avoid this. https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt
Is there a way to pass this parameter in the current configuration? If not would you enable it via the training args?
Otherwise, do you have any other ideas about what might be causing this problem?
Thanks
The text was updated successfully, but these errors were encountered:
Hi, thank you for creating such a great wrapper to huggingface. It truly makes things very easy to get started and is simple to use.
I am running into a problem where I am training a text classification model ("roberta-base") and I have a fairly large dataset (> 20 million text paragraph blocks and the csv file is about 1GB on disk). My workstation has a pretty hefty 256 GB of RAM so I can generally load most dataset into memory and have been working with this library for awhile and it hasn't been much of an issue. But when I try to run
happy_tc.train()
The RAM usage blows up during the "Preprocessing dataset..." stage and eventually runs out and the kernel crashes.
I'm not entirely sure why this is the case since the dataset is only 1GB on disk so even with a substantial 100x factor for preprocessing, everything should still fit in memory.
Regardless, I think the problem is ultimately because the huggingface "load_dataset" tries to load everything into memory by default. But you can pass the "streaming=True" parameter to avoid this. https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt
Is there a way to pass this parameter in the current configuration? If not would you enable it via the training args?
Otherwise, do you have any other ideas about what might be causing this problem?
Thanks
The text was updated successfully, but these errors were encountered: