- A Survey on Model Compression for Large Language Models. Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang. [Paper]
- The Efficiency Spectrum of Large Language Models: An Algorithmic Survey. Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang. [Paper][Github]
- Efficient Large Language Models: A Survey. Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang. [Paper][Github]
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems. Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, Zhihao Jia. [Paper]
- Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models. Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao. [Paper][Github]
- A Survey of Resource-efficient LLM and Multimodal Foundation Models. Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu. [Paper][Github]
- A Survey on Hardware Accelerators for Large Language Models. Christoforos Kachris. [Paper]
- Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security. Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, Rui Kong, Yile Wang, Hanfei Geng, Jian Luan, Xuefeng Jin, Zilong Ye, Guanjing Xiong, Fan Zhang, Xiang Li, Mengwei Xu, Zhijun Li, Peng Li, Yang Liu, Ya-Qin Zhang, Yunxin Liu. [Paper][Github]
- A Comprehensive Survey of Compression Algorithms for Language Models. Seungcheol Park, Jaehyeon Choi, Sojin Lee, U Kang. [Paper]
- A Survey on Transformer Compression. Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, Dacheng Tao. [Paper]
- Model Compression and Efficient Inference for Large Language Models: A Survey. Wenxiao Wang, Wei Chen, Yicong Luo, Yongliu Long, Zhengkai Lin, Liye Zhang, Binbin Lin, Deng Cai, Xiaofei He. [Paper]
- A Survey on Knowledge Distillation of Large Language Models. Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou. [Paper][Github]
- Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui. [Paper][Github][Blog]
- Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward. Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta. [Paper][Github]
- Efficient Prompting Methods for Large Language Models: A Survey. Kaiyan Chang, Songcheng Xu, Chenglong Wang, Yingfeng Luo, Tong Xiao, Jingbo Zhu. [Paper]
- A Survey on Efficient Inference for Large Language Models. Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, Yu Wang. [Paper]
- A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models. Mahsa Khoshnoodi, Vinija Jain, Mingye Gao, Malavika Srikanth, Aman Chadha. [Paper]
- Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference. Christopher Wolters, Xiaoxuan Yang, Ulf Schlichtmann, Toyotaro Suzumura. [Paper]
- Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application. Chuanpeng Yang, Wang Lu, Yao Zhu, Yidong Wang, Qian Chen, Chenlong Gao, Bingjie Yan, Yiqiang Chen. [Paper]
- Inference Optimization of Foundation Models on AI Accelerators. Youngsuk Park, Kailash Budhathoki, Liangfu Chen, Jonas Kübler, Jiaji Huang, Matthäus Kleindessner, Jun Huan, Volkan Cevher, Yida Wang, George Karypis. [Paper]
- A Survey on Symbolic Knowledge Distillation of Large Language Models. Kamal Acharya, Alvaro Velasquez, Houbing Herbert Song. [Paper]
- Hardware Acceleration of LLMs: A comprehensive survey and comparison. Nikoletta Koilia, Christoforos Kachris. [Paper]
- Art and Science of Quantizing Large-Scale Models: A Comprehensive Overview. Yanshu Wang, Tong Yang, Xiyan Liang, Guoan Wang, Hanning Lu, Xu Zhe, Yaoming Li, Li Weitao. [Paper]
- Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey. Sourav Verma. [Paper][Github]
- A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms. Ruihao Gong, Yifu Ding, Zining Wang, Chengtao Lv, Xingyu Zheng, Jinyang Du, Haotong Qin, Jinyang Guo, Michele Magno, Xianglong Liu. [Paper]
- Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective. Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Guohao Dai. [Paper]
- Prompt Compression for Large Language Models: A Survey. Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier. [Paper][Github]
- LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators. Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al. [Paper][Github]
- Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding. Hyun Ryu, Eric Kim. [Paper]