This repository contains the code for the workshp paper "A Model Zoo of Vision Transformers" for the ICLR 2025 Workshop on Neural Network Weights as a New Data Modality. The paper can be found here: arxiv. We are currently procuring storage for the dataset and will be updating this repository as soon as it is available.
The availability of large, structured populations of neural networks — called “model zoos” — has led to the development of a multitude of downstream tasks ranging from model analysis, to representation learning on model weights or generative modeling of neural network parameters. However, existing model zoos are limited in size and architecture and neglect the transformer, which is among the currently most successful neural network architectures. We address this gap by introducing the first model zoo of vision transformers (ViT). To better represent recent training approaches, we develop a new blueprint for model zoo generation that encompasses both pre-training and fine-tuning steps, and publish 250 unique models. They are carefully generated with a large span of generating factors, and their diversity is validated using a thorough choice of weight-space and behavioral metrics. To further motivate the utility of our proposed dataset, we suggest multiple possible applications grounded in both extensive exploratory experiments and a number of examples from the existing literature. By extending previous lines of similar work, our model zoo allows researchers to push their model population based methods from the small model regime to state-of-the-art architectures.
For this project we use python 3.10.6. The packages required to run the code are listed in the requirements.txt
file. You can install them using the following command:
pip install -r requirements.txt
The dataset is structured as follows:
- dataset
- pretraining
- model 1
- checkpoint_0000XY
- config.json
- result.json
- ...
- finetuning
- pretrained model 1
- model 1
- checkpoint_0000XY
- config.json
- result.json
- ...
The pretrained models are stored in the pretraining
folders, their names include the generating factors used to create them. The finetuned models are stored in the finetuning
folders and are ordered per pretrained model they were finetuned from. Each model directory contains a config.json
file with the model configuration result.json
file with the model performance metrics. Moreover, it includes multiple checkpoints of the models during training in the format checkpoint_0000XY/checkppints
.
To instantiate a model, simply instantiate the NNModule
class with the config.json
file of the model you want to load. To load a checkpoint, load the state dictionary of the checkpoint and load it into the model. The following code snippet demonstrates how to load a model and a checkpoint:
import torch
import json
from experiment_runner.def_net import NNModule
# Load the model configuration
config = json.load(open('path/to/config.json'))
model = NNModule(config)
# Load the model checkpoint
checkpoint = torch.load('path/to/checkpoint_0049/checkpoints')
model.model.load_state_dict(checkpoint)
To train a model, data preprocessing is required, which is detailed in the section below. After preprocessing the data, models can be trained using ray tune. The process is similar to before, but now includes an experiment runner that handles the training process. We include sample scripts and configurations for both pre-training and finetuning of the language and vision models, see the experiments
folder.
The data preprocessing is different for the language and vision models. For the language models, the data is tokenized and split into training and validation sets. For the vision models, the data is loaded from the dataset and split into training and validation sets.
We include sample scripts for the data preprocessing in the data_preparation
folder. These scripts are used to create dataset dumps that can be loaded into the training scripts. Once a dataset dump is created, it can be loaded automatically by providing the path to the dump in the configuration.
config['dataset::dump'] = 'path/to/dataset/dump'
It is also possible to load the dataset directly in the experiment file and pass it to the model. The following code snippet demonstrates how to load the dataset directly in the experiment file:
from torchvision.datasets import CIFAR100
trainset = CIFAR100(root="data", train=True, download=True)
testset = CIFAR100(root="data", train=False, download=True)
dataset = {
"trainset": trainset,
"testset": testset,
}
dataset_ref = ray.put(dataset)
config["dataset::ref"] = dataset_ref
To train a model from scratch, simply instantiate a configuration file and pass it to the experiment runner. We include sample configurations for the language and vision models in the example_configs
folder as well as corresponding experiment runners in the experiments
folder.
The following code snippet demonstrates how to train a model from scratch:
import logging
from experiment_runner.def_nn_experiment import NN_tune_trainable
import ray
import json
import torch
from pathlib import Path
PATH_ROOT = Path(".")
def main():
cpus = 12
gpus = 1
cpu_per_trial = 12
gpu_fraction = ((gpus * 100) // (cpus / cpu_per_trial)) / 100
resources_per_trial = {"cpu": cpu_per_trial, "gpu": gpu_fraction}
config = json.load(open("example_configs/config_vit_finetuning.json", "r"))
config["cuda"] = True if gpus > 0 and torch.cuda.is_available() else False
config["dataset::dump"] = "path/to/dataset/dump"
experiment_name = "VIT_FINETUNING"
net_dir = PATH_ROOT.joinpath(experiment_name)
try:
net_dir.mkdir(parents=True, exist_ok=False)
except FileExistsError:
pass
print(f"Saving models to {net_dir.absolute()}")
context = ray.init(
num_cpus=cpus,
num_gpus=gpus,
)
assert ray.is_initialized() == True
if config.get("dataset::dump", None) is not None:
print(f"loading dataset from {config['dataset::dump']}")
dataset = torch.load(config["dataset::dump"])
dataset_ref = ray.put(dataset)
config["dataset::ref"] = dataset_ref
experiment = ray.tune.Experiment(
name = experiment_name,
run = NN_tune_trainable,
stop = {
"training_iteration": config["training::epochs_train"]
},
checkpoint_config= ray.air.CheckpointConfig(
num_to_keep=None,
checkpoint_frequency=config.get("training::output_epoch", 1),
checkpoint_at_end=True,
),
config=config,
local_dir=net_dir.absolute(),
resources_per_trial=resources_per_trial,
)
ray.tune.run_experiments(
experiments = experiment,
resume = False,
reuse_actors = False,
verbose = 3,
)
ray.shutdown()
if __name__ == "__main__":
main()
To finetune from a pretrained model, the path to the pretrained model checkpoint must be provided in the configuration. The following code snippet demonstrates how to finetune a model:
config['pretrained::model::path'] = 'path/to/pretrained/model/checkpoint_0000XY/checkpoints'
This will automatically load the pretrained model checkpoint into the model before training and replace the classification head for finetuning. The training process is the same as before, but now the model is loaded with the pretrained weights.
To generate populations of models, a grid of hyperparameters must be defined. The grid is defined in the configuration file using tune.grid_search()
The following code snippet demonstrates how to define a grid:
config['optim::lr'] = tune.grid_search([3e-3, 1e-3, 1e-4])
config['optim::weight_decay'] = tune.grid_search([1e-3, 0])
Ray tune will generate a model for each combination of hyperparameters in the grid. The population can be generated by running the experiment runner with the grid configuration.
Please refer to the example_configs folder for sample configurations for vision models. The configurations are used to define the hyperparameters for training and evaluation of the models.
Example configurations for the vision models are defined in the config_cl_pretraining.json
, config_sl_pretraining.json
, and config_vit_finetuning.json
files. The configuration includes the following parameters:
model::name
: The name of the model to be trained.model::o_dim
: The output dimension of the model.model::type
: The type of the model, eithervit_s_16
for Vision Transformer orbert
for BERT.dataset::name
: The name of the dataset to be used for training.dataset::dump
: The path to the dataset dump.optim::optimizer
: The optimizer to use for training.optim::lr
: The learning rate for training.optim::weight_decay
: The weight decay for training.validation::loss
: The loss function to use for validation, if different from the training loss.training::loss
: The loss function to use for training.training::batchsize
: The batch size for training.training::epochs_train
: The number of epochs to train the model.training::output_epoch
: The frequency of saving checkpoints during training.training::mode
: The training mode, eithercl
for contrastive learning orsl
for supervised learning.training::mixup
: Mixup ratio to use for training, None if not used.training::cutmix
: Cutmix ratio to use for training, None if not used.training::randerase
: Randerase ratio to use for training, None if not used.trainloader::num_workers
: The number of workers for data loading.cuda
: Whether to use CUDA for training.seed
: The random seed for training.
For finetuning, the configuration is defined in the config_vit_finetuning.json
file. The configuration includes the same parameters as the pretraining configuration. If the model is finetuned from a pretrained model, the path to the pretrained model checkpoint must be provided in the configuration.
pretrained::model::path
: The path to the pretrained model checkpoint.
If you use the code or model zoo, please consider citing our paper:
@misc{falk2025modelzoovisiontransformers,
title={A Model Zoo of Vision Transformers},
author={Damian Falk and Léo Meynent and Florence Pfammatter and Konstantin Schürholt and Damian Borth},
year={2025},
eprint={2504.10231},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.10231},
}