-
Notifications
You must be signed in to change notification settings - Fork 10
Collected Datasets
For storing collected data, this project uses HDF5.
The main hdf5 file has two attributes: topk
it was collected with, and vocab_family
of the teachers. These are used for the ability to use sideloaded datasets.
Each collected sample creates a new group in the main file, with a name convo_{id}
, id
being the original id of the sample in your text dataset.
And adds two attributes to that group:
-
content_sha
- used for synchronization of the hdf5 dataset with your text dataset, by making sure the samples under corresponding ids have the correct content shas -
sample
which contains the sample this logit data was collected from.
Under this group, datasets for the logits(distributions
) and optionally for the indices(if topK is enabled) are created and populated with data.
If you want to share your collected datasets, you only need to upload the distributions.hdf5
dataset + the input text dataset you used for collection.
Don't forget to also mention somewhere what LLMs you used as teachers, for other people to know which student models would work with that collected data.
If you want to check what exactly does your dataset contain, HDFView will help exactly with that.
Note:
HDFView's newer versions can't open fp16 datasets on windows.
If you can't open the distributions
dataset in your HDFView, this is likely the reason, and not because it is corrupted.
Using HDFView 2.x fixes the issue, 2.14 can be found here.