wangmq53@mail.sysu.edu.cn)."],"referenceList":[{"publicationType":"journal","id":"b1","label":"[1]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"H. Touvron","personType":"author"},{"name":"L. Martin","personType":"author"},{"name":"E. A. Stone","personType":"author"}],"content":"H. Touvron, L. Martin, and E. A. Stone,“Llama 2: Open foundation and fine-tuned chat models,” Jul. 2023."}]},{"publicationType":"journal","id":"b2","label":"[2]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"W.-L. Chiang","personType":"author"},{"personType":"author"}],"content":"W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing GPT- 4 with 90%* chatGPT quality,” Mar. 2023."}]},{"publicationType":"journal","id":"b3","label":"[3]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"OpenAI","personType":"author"},{"personType":"author"}],"content":"OpenAI et al., “Gpt-4 technical report,” 2024."}]},{"publicationType":"journal","id":"b4","label":"[4]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"X. Liang","personType":"author"},{"personType":"author"}],"content":"X. Liang et al., “Controllable text generation for large language models: A survey,” 2024."}]},{"sourceEn":"Albuquerque, New Mexico, Apr. 2025","publicationType":"book","id":"b5","label":"[5]","nian":2025,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Feng","personType":"author"},{"personType":"author"}],"content":"Z. Feng et al., “TEaR: Improving LLM-based machine translation with systematic self-refinement,” in Proc. Find. Assoc. Comput. Linguist. : NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds., Albuquerque, New Mexico, Apr. 2025, pp. 3922-3938."}]},{"sourceEn":"Baltimore, MD","publicationType":"journal","id":"b6","label":"[6]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"D. Allemang","personType":"author"},{"name":"J. Sequeda","personType":"author"}],"content":"D. Allemang and J. Sequeda, “Increasing the accuracy of LLM question-answering systems with ontologies” in Proc. Semantic Web - ISWC 2024: 23rd Int. Semantic Web Conf., Baltimore, MD, USA, Nov. 11- 15, 2024, pp. 324-339."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"book","id":"b7","label":"[7]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"X. Ma","personType":"author"},{"name":"G. Fang","personType":"author"},{"name":"X. Wang","personType":"author"},{"name":"A.Oh","personType":"translator"},{"name":"T.Naumann","personType":"translator"},{"name":"A.Globerson","personType":"translator"},{"name":"K.Saenko","personType":"translator"},{"name":"M. Hardtand S","personType":"translator"}],"content":"X. Ma, G. Fang, and X. Wang, “LLM-pruner:On the structural pruning of large language models,” in Proc. Adv. Neural Inf. Process. Syst., A.Oh, T.Naumann, A.Globerson, K.Saenko, M. Hardtand S. Eds., 2023, vol. 36, pp. 21702-21720."}]},{"publicationType":"journal","id":"b8","label":"[8]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Yang","personType":"author"},{"name":"Z. Cao","personType":"author"},{"name":"H. Zhao","personType":"author"}],"content":"Y. Yang, Z. Cao, and H. Zhao,“Laco: Large language model pruning via layer collapse,” Feb. 2024."}]},{"sourceEn":"Red Hook, NY, USA","publicationType":"standard","id":"b9","label":"[9]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"T. Dettmers","personType":"author"},{"name":"M. Lewis","personType":"author"},{"name":"Y. Belkada","personType":"author"},{"name":"L. Zettlemoyer","personType":"author"}],"content":"T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer,“Llm.int8():8- bit matrix multiplication for transformers at scale,” in Proc. 36th Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, Nov. 2022, pp. 30318-30332."}]},{"publicationType":"journal","id":"b10","label":"[10]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Liu","personType":"author"},{"personType":"author"}],"content":"Z. Liu et al., “LLM-QAT: Data-free quantization aware training for large language models,” May 2023."}]},{"publicationType":"journal","id":"b11","label":"[11]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"personType":"author"}],"content":"H. Wang et al., “Bitnet: Scaling 1-bit transformers for large language models,” Oct. 2023."}]},{"publicationType":"journal","id":"b12","label":"[12]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"X. Xu","personType":"author"},{"personType":"author"}],"content":"X. Xu et al., “A survey on knowledge distillation of large language models,” Oct. 2024."}]},{"publicationType":"journal","id":"b13","label":"[13]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"S. Sun","personType":"author"},{"name":"W. Ren","personType":"author"},{"name":"J. Li","personType":"author"},{"name":"R. Wang","personType":"author"},{"name":"X. Cao","personType":"author"}],"content":"S. Sun, W. Ren, J. Li, R. Wang, and X. Cao, “Logit standardization in knowledge distillation,” in Proc. 2024 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 15731-15740."}]},{"publicationType":"journal","id":"b14","label":"[14]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"C. Chen","personType":"author"},{"name":"S. Borgeaud","personType":"author"},{"name":"G. Irving","personType":"author"},{"name":"J.-B. Lespiau","personType":"author"},{"name":"L. Sifre","personType":"author"},{"name":"J. Jumper","personType":"author"}],"content":"C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” 2023."}]},{"sourceEn":"Proc. Int. Conf. Mach. Learn.","publicationType":"journal","id":"b15","label":"[15]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Leviathan","personType":"author"},{"name":"M. Kalman","personType":"author"},{"name":"Y. Matias","personType":"author"}],"content":"Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 19274-19286."}]},{"sourceEn":"vol","publicationType":"journal","id":"b16","label":"[16]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"X. Miao","personType":"author"},{"personType":"author"}],"content":"X. Miao et al., “Accelerating generative large language model serving with speculative inference and token tree verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949."}]},{"sourceEn":"“DistillSpec: Improving speculative decoding via knowledge distillation,”","publicationType":"journal","id":"b17","label":"[17]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Zhou","personType":"author"},{"personType":"author"}],"content":"Y. Zhou et al., “DistillSpec: Improving speculative decoding via knowledge distillation,” in Proc. 12th Int. Conf. Learn. Representations, 2024."}]},{"sourceEn":"“MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads,”","publicationType":"journal","id":"b18","label":"[18]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"T. Cai","personType":"author"},{"personType":"author"}],"content":"T. Cai et al., “MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads,” in Proc. 41st Int. Conf. Mach. Learn., 2024."}]},{"sourceEn":"Bangkok, Thailand, Aug. 2024, vol","publicationType":"book","id":"b19","label":"[19]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"J. Zhang","personType":"author"},{"personType":"author"}],"content":"J. Zhang et al., “Draft & verify: Lossless large language model acceleration via self-speculative decoding,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguist., L.-W. Ku, A. Martins, and V. Srikumar, Eds., Bangkok, Thailand, Aug. 2024, vol. 1, pp. 11263-11282."}]},{"sourceEn":"vol","publicationType":"book","id":"b20","label":"[20]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Z. He","personType":"author"},{"name":"Z. Zhong","personType":"author"},{"name":"T. Cai","personType":"author"},{"name":"J. D. Lee","personType":"author"},{"name":"D. He","personType":"author"}],"content":"Z. He, Z. Zhong, T. Cai, J. D. Lee, and D. He, Eds., “REST: Retrieval-based speculative decoding,” in Proc. 2024 Conf. North Amer. Chapt. Assoc. Comput. Linguist.:Hum. Lang. Technol., K. Duh, H. Gomez, and S. Bethard, Mexico City, Mexico, Jun. 2024, vol. 1, pp. 1582-1595."}]},{"sourceEn":"Proc. Int. Conf. Mach. Learn.","publicationType":"journal","id":"b21","label":"[21]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Fu","personType":"author"},{"name":"P. Bailis","personType":"author"},{"name":"I. Stoica","personType":"author"},{"name":"H. Zhang","personType":"author"}],"content":"Y. Fu, P. Bailis, I. Stoica, and H. Zhang, “Break the sequential dependency of LLM inference using lookahead decoding,” in Proc. Int. Conf. Mach. Learn., 2024."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b22","label":"[22]","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Tang","personType":"author"},{"name":"F. Yu","personType":"author"},{"name":"W. Pedrycz","personType":"author"},{"name":"X. Yang","personType":"author"},{"name":"J. Wang","personType":"author"},{"name":"S. Liu","personType":"author"}],"content":"Y. Tang, F. Yu, W. Pedrycz, X. Yang, J. Wang, and S. Liu, “Building trend fuzzy granulation-based LSTM recurrent neural network for longterm time-series forecasting,” IEEE Trans. Fuzzy Syst., vol. 30, no. 6, pp. 1599-1613, 2021."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b23","label":"[23]","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"T. Lei","personType":"author"},{"name":"X. Jia","personType":"author"},{"name":"Y. Zhang","personType":"author"},{"name":"S. Liu","personType":"author"},{"name":"H. Meng","personType":"author"},{"name":"A. K Nandi","personType":"author"}],"content":"T. Lei, X. Jia, Y. Zhang, S. Liu, H. Meng, and A. K Nandi, “Superpixelbased fast fuzzy c-means clustering for color image segmentation,” IEEE Trans. Fuzzy Syst., vol. 27, no. 9, pp. 1753-1766, 2018."}]},{"sourceEn":"IEEE Trans. Syst., Man, Cybern.","publicationType":"journal","id":"b24","label":"[24]","nian":1993,"citedCount":0,"citationList":[{"personList":[{"name":"J.-S. Jang","personType":"author"}],"content":"J.-S. Jang, “ANFIS: Adaptive-network-based fuzzy inference system,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665-685, 1993."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b25","label":"[25]","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"G. Selvachandran","personType":"author"},{"personType":"author"}],"content":"G. Selvachandran et al., “A new design of mamdani complex fuzzy inference system for multiattribute decision making problems,” IEEE Trans. Fuzzy Syst., vol. 29, no. 4, pp. 716-730, 2019."}]},{"sourceEn":"“Model tells you what to discard: Adaptive KV cache compression for LLMs,”","publicationType":"journal","id":"b26","label":"[26]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"S. Ge","personType":"author"},{"name":"Y. Zhang","personType":"author"},{"name":"L. Liu","personType":"author"},{"name":"M. Zhang","personType":"author"},{"name":"J. Han","personType":"author"},{"name":"J. Gao","personType":"author"}],"content":"S. Ge, Y. Zhang, L. Liu, M. Zhang, J. Han, and J. Gao, “Model tells you what to discard: Adaptive KV cache compression for LLMs,” 2023, arXiv:2310.01801."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b27","label":"[27]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Zhang","personType":"author"},{"personType":"author"}],"content":"Z. Zhang et al., “H2O: Heavy-hitter oracle for efficient generative inference of large language models,” in Proc. Adv. Neural Inf. Process. Syst., 2023, vol. 36, pp. 34661-34710."}]},{"publicationType":"journal","id":"b28","label":"[28]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"L. Shi","personType":"author"},{"name":"H. Zhang","personType":"author"},{"name":"Y. Yao","personType":"author"},{"name":"Z. Li","personType":"author"},{"name":"H. Zhao","personType":"author"}],"content":"L. Shi, H. Zhang, Y. Yao, Z. Li, and H. Zhao, “Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption,” 2024, arXiv:2407.18003."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b29","label":"[29]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Liu","personType":"author"},{"personType":"author"}],"content":"Z. Liu et al., “Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time,” in Proc. Adv. Neural Inf. Process. Syst., 2024, vol. 36."}]},{"sourceEn":"vol","publicationType":"journal","id":"b30","label":"[30]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"X. Miao","personType":"author"},{"personType":"author"}],"content":"X. Miao et al., “Accelerating generative large language model serving with tree-based speculative inference and verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949."}]},{"sourceEn":"“Reinforced self-training (rest) for language modeling,”","publicationType":"journal","id":"b31","label":"[31]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"C. Gulcehre","personType":"author"},{"personType":"author"}],"content":"C. Gulcehre et al., “Reinforced self-training (rest) for language modeling,” 2023, arXiv:2308.08998."}]}],"fnList":[{"id":"fn-1","content":"

Zhe Wen and Liang Xu contributed equally to this work.

"}],"affList_en":["School of Integrated Circuits, Sun Yat-sen University, Shenzhen 518107, China"],"article":{"juan":"2","endNoteUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=EndNote&id=50726","bibtexUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=BibTeX&id=50726","articleType":"A","abstractUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/10.23919/ICS.2025.3575371","qi":"2","id":50726,"nian":2025,"bianHao":"1761142703382-1781039779","juanUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/Y2025","shouCiFaBuRiQi":"2025-10-22","qiShiYe":"58","accepted":"2025-05-23","received":"2024-12-31","qiUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025/V2/I2","pdfSize":"3032KB","risUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=Ris&id=50726","doi":"10.23919/ICS.2025.3575371","jieShuYe":"66","keywordList_en":["fuzzy inference","inference decoding","KV Cache optimization","LLM optimization accelerates"],"endNoteUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=EndNote&id=50726","zhaiyao_en":"

In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.

","bibtexUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=BibTeX&id=50726","abstractUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/10.23919/ICS.2025.3575371","juanUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025","lanMu_en":"Co-Optimization for Large Language Models: Advances in Algorithm and Hardware","qiUrl_en":"//www.sghhindu.com/www.qk/ics/EN/Y2025/V2/I2","risUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=Ris&id=50726","title_en":"An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding","revised":"2025-03-31","hasPdf":"true"},"authorList_en":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorList_cn":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"journal":{"issn":"2995-1968","qiKanWangZhi":"//www.sghhindu.com/www.qk/ics","qiKanMingCheng_CN":"Integrated Circuits and Systems","id":22,"qiKanMingCheng_EN":"Integrated Circuits and Systems"},"authorList":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorNotes_en":["MEIQI WANG (e-mail: ).","

Zhe Wen and Liang Xu contributed equally to this work.

"],"authorNotesCommon_en":["

Zhe Wen and Liang Xu contributed equally to this work.

"],"fundList_en":["National Natural Science Foundation of China under Grant(62404256)","Jiangsu Provincial Science and Technology Major Special Project under Grant(BG2024032)","Key Project of Shenzhen Basic Research Program under Grant(JCYJ20241206180301003)","High-performance Computing Public Platform (Shenzhen Campus) of Sun Yat-sen University"]}">

An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding

ZHE WEN, LIANG XU, MEIQIWANG

Integrated Circuits and Systems››2025, Vol. 2››Issue (2): 58-66.

PDF(3032 KB)
PDF(3032 KB)
Integrated Circuits and Systems ›› 2025, Vol. 2 ›› Issue (2) : 58-66. DOI: 10.23919/ICS.2025.3575371
Co-Optimization for Large Language Models: Advances in Algorithm and Hardware

An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding

    {{javascript:window.custom_author_en_index=0;}}
  • {{article.zuoZhe_EN}}
Author information +
History +

HeighLight

{{article.keyPoints_en}}

Abstract

{{article.zhaiyao_en}}

Key words

QR code of this article

Cite this article

Download Citations
{{article.zuoZheEn_L}}.{{article.title_en}}[J]. {{journal.qiKanMingCheng_EN}}, 2025, 2(2): 58-66 https://doi.org/10.23919/ICS.2025.3575371

References

References

{{article.reference}}

Funding

RIGHTS & PERMISSIONS

{{article.copyrightStatement_en}}
{{article.copyrightLicense_en}}
PDF(3032 KB)

Accesses

Citation

Detail

Sections
Recommended

/

Baidu