Add custom fingerprint support to from_generator
#7533
+49
−36
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR adds
dataset_id_suffix
parameter to 'Dataset.from_generator' function.Dataset.from_generator
function passes all of its arguments toBuilderConfig.create_config_id
, including generator function itself.BuilderConfig.create_config_id
function tries to hash all the args, which can take a large amount of time or even cause MemoryError if the dataset processed in a generator function is large enough.This PR allows user to pass a custom fingerprint (
dataset_id_suffix
) to be used as a suffix in a dataset name instead of the one generated by hashing the args.This PR is a possible solution of #7513