Model Training

The core of this project is the collect_and_finetune.py script. The main function orchestrates this whole madness, managing data loading, model compatibility checks, dataset preparation, and the actual training loop. Key steps include:

Initializing paths and models, converting models that are not in .safetensors format to it.
Ensuring compatibility between teacher and student models using their vocabulary families.
Preparing datasets for training and validation.
Setting model parameters.
Synchronizing the h5 datasets with the current text dataset, by remapping, deleting, and collecting necessary samples.
Training the student model, optionally doing chunked data collection and training, if the calculated size of the h5 dataset is > than the max cache size specified. Doing Validation every so often.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model Training

Clone this wiki locally