Skip to content

Collected Datasets

golololologol edited this page Jan 24, 2025 · 1 revision

For storing collected data, this project uses HDF5.

Dataset structure

The main hdf5 file has two attributes: topk it was collected with, and vocab_family of the teachers. These are used for the ability to use sideloaded datasets. image

Each collected sample creates a new group in the main file, with a name convo_{id}, id being the original id of the sample in your text dataset.
And adds two attributes to that group:

  • content_sha - used for synchronization of the hdf5 dataset with your text dataset, by making sure the samples under corresponding ids have the correct content shas
  • sample which contains the sample this logit data was collected from.

image

Under this group, datasets for the logits(distributions) and optionally for the indices(if topK is enabled) are created and populated with data.

Dataset Sharing

If you want to share your collected datasets, you only need to upload the distributions.hdf5 dataset + the input text dataset you used for collection.
Don't forget to also mention somewhere what LLMs you used as teachers, for other people to know which student models would work with that collected data.

HDFView

If you want to check what exactly does your dataset contain, HDFView will help exactly with that.

Note:
HDFView's newer versions can't open fp16 datasets on windows.
If you can't open the distributions dataset in your HDFView, this is likely the reason, and not because it is corrupted.
Using HDFView 2.x fixes the issue, 2.14 can be found here. image