-
Notifications
You must be signed in to change notification settings - Fork 10
Model Training
The core of this project is the collect_and_finetune.py
script. The main
function orchestrates this whole madness, managing data loading, model compatibility checks, dataset preparation, and the actual training loop. Key steps include:
-
Initializing paths and models, converting models that are not in
.safetensors
format to it. -
Ensuring compatibility between teacher and student models using their vocabulary families.
-
Preparing datasets for training and validation.
-
Setting model parameters.
-
Synchronizing the h5 datasets with the current text dataset, by remapping, deleting, and collecting necessary samples.
-
Training the student model, optionally doing chunked data collection and training, if the calculated size of the h5 dataset is > than the max cache size specified. Doing Validation every so often.