- (arXiv 2022.04) DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers, [Paper]
- (arXiv 2022.05) Knowledge Distillation via the Target-aware Transformer, [Paper]
- (arXiv 2022.05) Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation, [Paper], [Code]
- (arXiv 2022.07) Self-Distilled Vision Transformer for Domain Generalization, [Paper], [Code]
- (arXiv 2022.09) ViTKD: Practical Guidelines for ViT feature knowledge distillation, [Paper], [Code]
- (arXiv 2022.10) Self-Distillation for Further Pre-training of Transformers, [Paper]
- (arXiv 2022.11) Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling, [Paper]
- (arXiv 2022.11) D3ETR: Decoder Distillation for Detection Transformer, [Paper]
- (arXiv 2022.11) DETRDistill: A Universal Knowledge Distillation Framework for DETR-families, [Paper]
- (arXiv 2022.12) Autoencoders as Cross-Modal Teachers: Can Pretrained 2D Image Transformers Help 3D Representation Learning, [Paper], [Code]
- (arXiv 2022.12) OVO: One-shot Vision Transformer Search with Online distillation, [Paper]
- (arXiv 2023.02) Knowledge Distillation in Vision Transformers: A Critical Review, [Paper]
- (arXiv 2023.02) MaskedKD: Efficient Distillation of Vision Transformers with Masked Images, [Paper]
- (arXiv 2023.03) Multi-view knowledge distillation transformer for human action recognition, [Paper]
- (arXiv 2023.03) Supervised Masked Knowledge Distillation for Few-Shot Transformers, [Paper], [Code]
- (arXiv 2023.05) Vision Transformers for Small Histological Datasets Learned through Knowledge Distillation, [Paper]
- (arXiv 2023.05) Are Large Kernels Better Teachers than Transformers for ConvNets?, [Paper], [Code]
- (arXiv 2023.07) Cumulative Spatial Knowledge Distillation for Vision Transformers, [Paper]
- (arXiv 2023.10) CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction, [Paper], [Code]
- (arXiv 2023.10) Distilling Efficient Vision Transformers from CNNs for ation, [Paper], [Code]
- (arXiv 2023.10) One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation, [Paper], [Code]
- (arXiv 2023.11) Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples, [Paper]
- (arXiv 2023.12) GIST: Improving Parameter Efficient Fine Tuning via Knowledge Interaction, [Paper]
- (arXiv 2024.02) m2mKD: Module-to-Module Knowledge Distillation for Modular Transformers, [Paper], [Code]
- (arXiv 2024.04) Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities, [Paper]
- (arXiv 2024.07) Towards Optimal Trade-offs in Knowledge Distillation for CNNs and Vision Transformers at the Edge, [Paper]
- (arXiv 2024.07) Continual Distillation Learning, [Paper], [Code]
- (arXiv 2024.08) Optimizing Vision Transformers with Data-Free Knowledge Transfer, [Paper]
- (arXiv 2024.08) Adaptive Knowledge Distillation for Classification of Hand Images using Explainable Vision Transformers, [Paper]
- (arXiv 2024.11) ScaleKD: Strong Vision Transformers Could Be Excellent Teachers, [Paper], [Code]
- (arXiv 2025.02) Optimizing Knowledge Distillation in Transformers: Enabling Multi-Head Attention without Alignment Barriers, [Paper]