ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

By Jiawei Fan, Chao Li, Xiaolong Liu and Anbang Yao.

This repository is the official PyTorch implementation of ScaleKD: Strong Vision Transformers Could Be Excellent Teachers published in NeurIPS 2024.

Overview of three core components in our ScaleKD, which are (a) cross attention projector, (b) dual-view feature mimicking, and (c) teacher parameter perception. Note that the teacher model is frozen in the distillation process and there is no modification to the student’s model at inference.

Update News

Stay tuned: We are preparing to release and training code on downstream tasks.

2025/01/24: We release the training scripts and more distilled models on ImageNet-1K.
2024/11/11: We release two distilled models, ViT-B/16 and ResNet-50.
2024/11/10: We release the project of ScaleKD, containing our very basic training and evaluation code.

Introduction

In this paper, we question if well pre-trained vision transformer (ViT) models could be used as teachers that exhibit scalable properties to advance cross architecture knowledge distillation research, in the context of adopting mainstream large-scale visual recognition datasets for evaluation. To make this possible, our analysis underlines the importance of seeking effective strategies to align (1) feature computing paradigm differences, (2) model scale differences, and (3) knowledge density differences. By combining three closely coupled components namely cross attention projector, dual-view feature mimicking and teacher parameter perception tailored to address the alignment problems stated above, we present a simple and effective knowledge distillation method, called ScaleKD. Our method can train student backbones that span across a variety of convolutional neural network (CNN), multi-layer perceptron (MLP), and ViT architectures on image classification datasets, achieving state-of-the-art knowledge distillation performance. For instance, taking a well pre-trained Swin-L as the teacher model, our method gets 75.15%|82.03%|84.16%|78.63%|81.96%|83.93%|83.80%|85.53% top-1 accuracies for MobileNet-V1|ResNet-50|ConvNeXt-T|Mixer-S/16|MixerB/16|ViT-S/16|Swin-T|ViT-B/16 models trained on ImageNet-1K dataset from scratch, showing 3.05%|3.39%|2.02%|4.61%|5.52%|4.03%|2.62%|3.73% absolute gains to the individually trained counterparts. Intriguingly, when scaling up the size of teacher models or their pre-training datasets, our method showcases the desired scalable properties, bringing increasingly larger gains to student models. We also empirically show that the student backbones trained by our method transfer well on downstream MS-COCO and ADE20K datasets. More importantly, our method could be used as a more efficient alternative to the time-intensive pre-training paradigm for any target student model on large-scale datasets if a strong pre-trained ViT is available, reducing the amount of viewed training samples up to 195×.

Requirement and Dataset

Environment

Python 3.8 (Anaconda is recommended)
CUDA 11.1
PyTorch 1.10.1
Torchvision 0.11.2

# create conda environment
conda create -n openmmlab python=3.8
# enter the environment
conda activate openmmlab
# install packages
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/cu111/torch_stable.html
pip install -r requirements.txt

Note that using pytorch with higher CUDA versions may result in low training speed.

Prepare datasets

Following this repository,
Download the ImageNet dataset from http://www.image-net.org/.
Then, move and extract the training and validation images to labeled subfolders, using the following script.
Move the data into folder data/imagenet

How to apply ScaleKD to various teacher-student network pairs

Basically, we peform our experiments with two different training strategies.

Training with traditional training strategy

The experiments based on the traditional training strategy are performed on 8 GPUs from a single node.
Training configurations for various teacher-student network pairs are in folder configs/distillers/traditional_traning_strategy/

Run distillation by following command:

  bash tools/dist_train.sh $CONFIG_PATH $NUM_GPU

Here, we give an example on running swin-s_distill_res50_img_s3_s4.py on 8 GPUs:

bash tools/dist_train.sh configs/distillers/traditional_traning_strategy/swin-s_distill_res50_img_s3_s4.py 8

Training with advanced training strategy

The experiments based on the advanced training strategy are performed on 32 GPUs from 4 nodes.
Training configurations for various teacher-student network pairs are in folder configs/distillers/advanced_traning_strategy/

Run distillation by following command:

  bash run.sh $CONFIG_PATH $NUM_GPU $NODE_RANK

Here, we give an example on running swin-l_distill_res50_img_s3_s4.py on 32 GPU from 4 nodes (8 GPUs per node):

# Node 1
bash run.sh configs/distillers/advanced_training_strategy/swin-l_distill_res50_img_s3_s4.py 8 0
# Node 2
bash run.sh configs/distillers/advanced_training_strategy/swin-l_distill_res50_img_s3_s4.py 8 1
# Node 3
bash run.sh configs/distillers/advanced_training_strategy/swin-l_distill_res50_img_s3_s4.py 8 2
# Node 4
bash run.sh configs/distillers/advanced_training_strategy/swin-l_distill_res50_img_s3_s4.py 8 2

If you want to adapt these experiments to a single node, please adjust the batch size or learning rate accordingly. And then use similar command as above:
```
  bash tools/dist_train.sh $CONFIG_PATH $NUM_GPU
```

Model zoo of distilled models

We also provide some state-of-the-art models trained by our ScaleKD.

Model	Teacher	Distillation Configurations	Epochs	Top-1 (%)	Weight
Swin-T	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_swin-t_img_s3_s4.py`	300	83.80	Google Drive
ViT-S/16	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_deit-s_img_s3_s4.py`	300	83.93	Google Drive
ViT-B/16	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_deit-b_img_s3_s4.py`	300	85.53	Google Drive
ResNet-50	Swin_L	`configs/distillers/advanced_training_strategy/swin-l_distill_res50_img_s3_s4.py`	300/600	82.03/82.55	Google Drive
ConvNext-T	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_convnext-t_img_s3_s4.py`	300	84.16	Google Drive
Mixer-S/16	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_mixer-s_img_s3_s4.py`	300	78.63	Google Drive
Mixer-B/16	Swin-L	`configs/distillers/advanced_training_strategy/swin-l_distill_mixer-b_img_s3_s4.py`	300	81.96	Google Drive
ViT-B/14	BEiT-L	`configs/distillers/advanced_training_strategy/beit-l_distill_deit-b_img_s3_s4.py`	300	86.43	Google Drive
ResNet-50	BEiT-L	`configs/distillers/advanced_training_strategy/beit-l_distill_res50_img_s3_s4.py`	300	82.34	Google Drive
Mixer-B/14	BEiT-L	`configs/distillers/advanced_training_strategy/beit-l_distill_mixer-b_img_s3_s4.py`	300	82.89	Google Drive

Testing the distilled models

Please use the following command to test the performance of models:

 bash tools/dist_test.sh $CONFIG_PATH $CKPT_PATH 8 --metrics accuracy

If you wish to test the originally saved checkpoint, please use the same configuration as the training. If your checkpoint has already been transferred to the student format, please use the config as the baseline.

Obtaining the student weight

Transfer the distillation model into mmcls (mmpretrain) model

python pth_transfer.py --dis_path $CKPT_PATH --output_path $NEW_CKPT_PATH

Main Results

Results on ImageNet-1K

Transfer learning on Downstream tasks

Object Detection and Instance Segmentation on MS-COCO

Semantic Segmentation on ADE20K

Citation

@article{fan2024scalekd,
  title={ScaleKD: Strong Vision Transformers Could Be Excellent Teachers},
  author={Fan, Jiawei and Li, Chao and Liu, Xiaolong and Yao, Anabang},
  journal={Thirty-eighth Conference on Neural Information Processing Systems},
  year={2024}
}

License

ScaleKD is released under the Apache license. We encourage use for both research and commercial purposes, as long as proper attribution is given.

Acknowledgement

This repository is built based on mmpretrain repository and cls_KD repository. We thank the authors of the two repositories for releasing their amazing codes.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
build/lib/mmcls		build/lib/mmcls
configs		configs
demo		demo
docker		docker
docs		docs
imgs		imgs
mmcls.egg-info		mmcls.egg-info
mmcls		mmcls
projects		projects
requirements		requirements
resources		resources
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pth_transfer.py		pth_transfer.py
requirements.txt		requirements.txt
run.sh		run.sh
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Update News

Introduction

Table of content

Requirement and Dataset

Environment

Prepare datasets

How to apply ScaleKD to various teacher-student network pairs

Training with traditional training strategy

Training with advanced training strategy

Model zoo of distilled models

Testing the distilled models

Obtaining the student weight

Main Results

Results on ImageNet-1K

Transfer learning on Downstream tasks

Object Detection and Instance Segmentation on MS-COCO

Semantic Segmentation on ADE20K

Citation

License

Acknowledgement

About

Releases

Packages

Contributors 2

Languages

License

deep-optimization/ScaleKD

Folders and files

Latest commit

History

Repository files navigation

ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

Update News

Introduction

Table of content

Requirement and Dataset

Environment

Prepare datasets

How to apply ScaleKD to various teacher-student network pairs

Training with traditional training strategy

Training with advanced training strategy

Model zoo of distilled models

Testing the distilled models

Obtaining the student weight

Main Results

Results on ImageNet-1K

Transfer learning on Downstream tasks

Object Detection and Instance Segmentation on MS-COCO

Semantic Segmentation on ADE20K

Citation

License

Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages