Skip to content

Add custom fingerprint support to from_generator #7533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

simonreise
Copy link

This PR adds dataset_id_suffix parameter to 'Dataset.from_generator' function.

Dataset.from_generator function passes all of its arguments to BuilderConfig.create_config_id, including generator function itself. BuilderConfig.create_config_id function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.

This PR allows user to pass a custom fingerprint (dataset_id_suffix) to be used as a suffix in a dataset name instead of the one generated by hashing the args.

This PR is a possible solution of #7513

@lhoestq
Copy link
Member

lhoestq commented Apr 24, 2025

This is great !

What do you think of passing config_id= directly to the builder instead of just the suffix ? This would be a power user argument though, or for internal use. And in from_generator the new argument can be fingerprint= as in Dataset.__init__()

The config_id can be defined using something like config_id = "default-fingerprint=" + fingerprint

I feel ike this could make the Dataset API more coherent if we avoid introducing a new argument while we can juste use fingerprint=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants