Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

Open
vishesh9131 opened this issue Jul 14, 2024 · 0 comments
Open

Comments

@vishesh9131
Copy link

I hope this message finds you well. I would like to discuss the possibility of adjusting the current codebase to enable streaming of datasets directly from HuggingFace, eliminating the need for downloading them. This enhancement can significantly streamline the workflow, reduce storage requirements, and improve efficiency, especially for users working with limited local storage or in environments where data download speeds are a bottleneck.

Implementing dataset streaming can be achieved by leveraging HuggingFace's datasets library, which supports on-the-fly data access. The modification would involve integrating this functionality into the existing data handling pipeline, ensuring compatibility and seamless transition for current users.

The high-level steps include:

  1. Updating the data loading functions to utilize HuggingFace's load_dataset with streaming enabled.
  2. Ensuring all downstream processes can handle data in a streamed format without requiring local storage.
  3. Conducting thorough testing to verify the integrity and performance of the streamed data pipeline.

If you are interested, I can raise a pull request with the proposed changes for your review. This would allow us to collaboratively refine and integrate this feature into the project.

Looking forward to your thoughts on this.
Best regards,

Vishesh Yadav;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant