Open
Description
Feature request
Implement a centralized, reusable utility or module for handling binary file operations which can be used across modules. This utility should:
- Standardize reading and writing headers, binary data, and indices.
- Support modular integration with existing components like
EmbeddedStreamData
and others. - Reduce code duplication while improving readability and maintainability.
Motivation
Currently, there is duplicated code for reading and writing binary files across multiple modules and functions, including:
EmbeddedStreamData
PackedDataGenerator
LargeFileLinesReader
shuffle_tokenized_data()
This redundancy increases maintenance overhead and the risk of inconsistencies. For example, reading headers, writing index data, and handling binary streams are repeated in different forms, leading to potential bugs and inefficiencies.