-
Notifications
You must be signed in to change notification settings - Fork 2.8k
MemoryError while creating dataset from generator #7513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Upd: created a PR that can probably solve the problem: #7514 |
Hi ! We need to take the generator into account for the cache. The generator is hashed to make the dataset fingerprint used by the cache. This way you can reload the Dataset from the cache without regenerating in subsequent Maybe instead of removing generator from the hasher input, we can let users pass their own Dataset fingerprint to |
Upd: I successfully generated a dataset from my large geospatial data with Maybe letting users pass their own fingerprint to skip hashing can be a great solution to that issue! |
Describe the bug
TL:DR
Dataset.from_generator
function passes all of its arguments toBuilderConfig.create_config_id
, includinggenerator
function itself.BuilderConfig.create_config_id
function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.Maybe we should pop
generator
fromconfig_kwargs_to_add_to_suffix
before hashing to avoid it.Full description
I have a pretty large spatial imagery dataset that is generated from two xbatcher.BatchGenerators via custom
dataset_generator
function that looks like this if simplified:Then I use
datasets.Dataset.from_generator
to generate the dataset itself.It works nicely with pretty small data, but if the dataset is huge and barely fits in memory, it crashes with memory error:
Full stack trace
Memory error is an expected type of error in such case, but when I started digging down, I found out that it occurs in a kinda unexpected place - in
create_config_id
function. It tries to hashconfig_kwargs_to_add_to_suffix
, including generator function itself.I modified the
BuilderConfig.create_config_id
code like this to check which values are hashed and how much time it takes to hash them and ran it on a toy dataset:In my case the content of
config_kwargs_to_add_to_suffix
was like this:Also I noticed that hashing took a significant amount of time - 43.1482 seconds, while the overall function execution (with data loading, batching and saving dataset) took 2min 45s. The output of
create_config_id
is just a dataset id, so, it is inappropirately large amount of time.But when I added
config_kwargs_to_add_to_suffix.pop("generator", None)
, the hashing took only 0.0060 seconds.Maybe we shouldn't hash the generator function, as it can be really computationally and memory expensive.
Steps to reproduce the bug
This is a simplified example of a workflow I used to generate dataset. But I think that you can use almost any workflow to reproduce that bug.
Please, try adding
config_kwargs_to_add_to_suffix.pop("generator", None)
toBuilderConfig.create_config_id
and then measuring how much time it takes to runcode block with and without
config_kwargs_to_add_to_suffix.pop("generator", None)
In my case the difference was 3.3828 seconds without popping generator function and 0.0010 seconds with popping.
Expected behavior
Much faster hashing and no MemoryErrors
Environment info
datasets
version: 3.5.0huggingface_hub
version: 0.30.1fsspec
version: 2024.12.0The text was updated successfully, but these errors were encountered: