An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding - 开云网页登录

wangmq53@mail.sysu.edu.cn)."],"referenceList":[{"publicationType":"journal","id":"b1","label":"[1]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"H. Touvron","personType":"author"},{"name":"L. Martin","personType":"author"},{"name":"E. A. Stone","personType":"author"}],"content":"H.

Touvron

, L.

Martin

, and E. A.

Stone

,“Llama 2: Open foundation and fine-tuned chat models,” Jul. 2023."}]},{"publicationType":"journal","id":"b2","label":"[2]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"W.-L. Chiang","personType":"author"},{"personType":"author"}],"content":"W.-L.

Chiang

et al., “Vicuna: An open-source chatbot impressing GPT- 4 with 90%* chatGPT quality,” Mar. 2023."}]},{"publicationType":"journal","id":"b3","label":"[3]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"OpenAI","personType":"author"},{"personType":"author"}],"content":"OpenAI et al., “Gpt-4 technical report,” 2024."}]},{"publicationType":"journal","id":"b4","label":"[4]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"X. Liang","personType":"author"},{"personType":"author"}],"content":"X.

Liang

et al., “Controllable text generation for large language models: A survey,” 2024."}]},{"sourceEn":"Albuquerque, New Mexico, Apr. 2025","publicationType":"book","id":"b5","label":"[5]","nian":2025,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Feng","personType":"author"},{"personType":"author"}],"content":"Z.

Feng

et al., “TEaR: Improving LLM-based machine translation with systematic self-refinement,” in Proc. Find. Assoc. Comput. Linguist. : NAACL 2025, L. Chiruzzo, A. Ritter, and L. Wang, Eds., Albuquerque, New Mexico, Apr. 2025, pp. 3922-3938."}]},{"sourceEn":"Baltimore, MD","publicationType":"journal","id":"b6","label":"[6]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"D. Allemang","personType":"author"},{"name":"J. Sequeda","personType":"author"}],"content":"D.

Allemang

and J.

Sequeda

, “Increasing the accuracy of LLM question-answering systems with ontologies” in Proc. Semantic Web - ISWC 2024: 23rd Int. Semantic Web Conf., Baltimore, MD, USA, Nov. 11- 15, 2024, pp. 324-339."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"book","id":"b7","label":"[7]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"X. Ma","personType":"author"},{"name":"G. Fang","personType":"author"},{"name":"X. Wang","personType":"author"},{"name":"A.Oh","personType":"translator"},{"name":"T.Naumann","personType":"translator"},{"name":"A.Globerson","personType":"translator"},{"name":"K.Saenko","personType":"translator"},{"name":"M. Hardtand S","personType":"translator"}],"content":"X.

Ma

, G.

Fang

, and X.

Wang

, “LLM-pruner:On the structural pruning of large language models,” in Proc. Adv. Neural Inf. Process. Syst., A.

Oh

, T.

Naumann

, A.

Globerson

, K.

Saenko

, M. Hardt

and S

. Eds., 2023, vol. 36, pp. 21702-21720."}]},{"publicationType":"journal","id":"b8","label":"[8]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Yang","personType":"author"},{"name":"Z. Cao","personType":"author"},{"name":"H. Zhao","personType":"author"}],"content":"Y.

Yang

, Z.

Cao

, and H.

Zhao

,“Laco: Large language model pruning via layer collapse,” Feb. 2024."}]},{"sourceEn":"Red Hook, NY, USA","publicationType":"standard","id":"b9","label":"[9]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"T. Dettmers","personType":"author"},{"name":"M. Lewis","personType":"author"},{"name":"Y. Belkada","personType":"author"},{"name":"L. Zettlemoyer","personType":"author"}],"content":"T.

Dettmers

, M.

Lewis

, Y.

Belkada

, and L.

Zettlemoyer

,“Llm.int8():8- bit matrix multiplication for transformers at scale,” in Proc. 36th Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, Nov. 2022, pp. 30318-30332."}]},{"publicationType":"journal","id":"b10","label":"[10]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Liu","personType":"author"},{"personType":"author"}],"content":"Z.

Liu

et al., “LLM-QAT: Data-free quantization aware training for large language models,” May 2023."}]},{"publicationType":"journal","id":"b11","label":"[11]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"personType":"author"}],"content":"H.

Wang

et al., “Bitnet: Scaling 1-bit transformers for large language models,” Oct. 2023."}]},{"publicationType":"journal","id":"b12","label":"[12]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"X. Xu","personType":"author"},{"personType":"author"}],"content":"X.

Xu

et al., “A survey on knowledge distillation of large language models,” Oct. 2024."}]},{"publicationType":"journal","id":"b13","label":"[13]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"S. Sun","personType":"author"},{"name":"W. Ren","personType":"author"},{"name":"J. Li","personType":"author"},{"name":"R. Wang","personType":"author"},{"name":"X. Cao","personType":"author"}],"content":"S.

Sun

, W.

Ren

, J.

Li

, R.

Wang

, and X.

Cao

, “Logit standardization in knowledge distillation,” in Proc. 2024 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 15731-15740."}]},{"publicationType":"journal","id":"b14","label":"[14]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"C. Chen","personType":"author"},{"name":"S. Borgeaud","personType":"author"},{"name":"G. Irving","personType":"author"},{"name":"J.-B. Lespiau","personType":"author"},{"name":"L. Sifre","personType":"author"},{"name":"J. Jumper","personType":"author"}],"content":"C.

Chen

, S.

Borgeaud

, G.

Irving

, J.-B.

Lespiau

, L.

Sifre

, and J.

Jumper

, “Accelerating large language model decoding with speculative sampling,” 2023."}]},{"sourceEn":"Proc. Int. Conf. Mach. Learn.","publicationType":"journal","id":"b15","label":"[15]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Leviathan","personType":"author"},{"name":"M. Kalman","personType":"author"},{"name":"Y. Matias","personType":"author"}],"content":"Y.

Leviathan

, M.

Kalman

, and Y.

Matias

, “Fast inference from transformers via speculative decoding,” in Proc. Int. Conf. Mach. Learn., 2023, pp. 19274-19286."}]},{"sourceEn":"vol","publicationType":"journal","id":"b16","label":"[16]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"X. Miao","personType":"author"},{"personType":"author"}],"content":"X.

Miao

et al., “Accelerating generative large language model serving with speculative inference and token tree verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949."}]},{"sourceEn":"“DistillSpec: Improving speculative decoding via knowledge distillation,”","publicationType":"journal","id":"b17","label":"[17]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Zhou","personType":"author"},{"personType":"author"}],"content":"Y.

Zhou

et al., “DistillSpec: Improving speculative decoding via knowledge distillation,” in Proc. 12th Int. Conf. Learn. Representations, 2024."}]},{"sourceEn":"“MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads,”","publicationType":"journal","id":"b18","label":"[18]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"T. Cai","personType":"author"},{"personType":"author"}],"content":"T.

Cai

et al., “MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads,” in Proc. 41st Int. Conf. Mach. Learn., 2024."}]},{"sourceEn":"Bangkok, Thailand, Aug. 2024, vol","publicationType":"book","id":"b19","label":"[19]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"J. Zhang","personType":"author"},{"personType":"author"}],"content":"J.

Zhang

et al., “Draft & verify: Lossless large language model acceleration via self-speculative decoding,” in Proc. 62nd Annu. Meeting Assoc. Comput. Linguist., L.-W. Ku, A. Martins, and V. Srikumar, Eds., Bangkok, Thailand, Aug. 2024, vol. 1, pp. 11263-11282."}]},{"sourceEn":"vol","publicationType":"book","id":"b20","label":"[20]","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Z. He","personType":"author"},{"name":"Z. Zhong","personType":"author"},{"name":"T. Cai","personType":"author"},{"name":"J. D. Lee","personType":"author"},{"name":"D. He","personType":"author"}],"content":"Z.

He

, Z.

Zhong

, T.

Cai

, J. D.

Lee

, and D.

He

, Eds., “REST: Retrieval-based speculative decoding,” in Proc. 2024 Conf. North Amer. Chapt. Assoc. Comput. Linguist.:Hum. Lang. Technol., K. Duh, H. Gomez, and S. Bethard, Mexico City, Mexico, Jun. 2024, vol. 1, pp. 1582-1595."}]},{"sourceEn":"Proc. Int. Conf. Mach. Learn.","publicationType":"journal","id":"b21","label":"[21]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Fu","personType":"author"},{"name":"P. Bailis","personType":"author"},{"name":"I. Stoica","personType":"author"},{"name":"H. Zhang","personType":"author"}],"content":"Y.

Fu

, P.

Bailis

, I.

Stoica

, and H.

Zhang

, “Break the sequential dependency of LLM inference using lookahead decoding,” in Proc. Int. Conf. Mach. Learn., 2024."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b22","label":"[22]","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Tang","personType":"author"},{"name":"F. Yu","personType":"author"},{"name":"W. Pedrycz","personType":"author"},{"name":"X. Yang","personType":"author"},{"name":"J. Wang","personType":"author"},{"name":"S. Liu","personType":"author"}],"content":"Y.

Tang

, F.

Yu

, W.

Pedrycz

, X.

Yang

, J.

Wang

, and S.

Liu

, “Building trend fuzzy granulation-based LSTM recurrent neural network for longterm time-series forecasting,” IEEE Trans. Fuzzy Syst., vol. 30, no. 6, pp. 1599-1613, 2021."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b23","label":"[23]","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"T. Lei","personType":"author"},{"name":"X. Jia","personType":"author"},{"name":"Y. Zhang","personType":"author"},{"name":"S. Liu","personType":"author"},{"name":"H. Meng","personType":"author"},{"name":"A. K Nandi","personType":"author"}],"content":"T.

Lei

, X.

Jia

, Y.

Zhang

, S.

Liu

, H.

Meng

, and A. K

Nandi

, “Superpixelbased fast fuzzy c-means clustering for color image segmentation,” IEEE Trans. Fuzzy Syst., vol. 27, no. 9, pp. 1753-1766, 2018."}]},{"sourceEn":"IEEE Trans. Syst., Man, Cybern.","publicationType":"journal","id":"b24","label":"[24]","nian":1993,"citedCount":0,"citationList":[{"personList":[{"name":"J.-S. Jang","personType":"author"}],"content":"J.-S.

Jang

, “ANFIS: Adaptive-network-based fuzzy inference system,” IEEE Trans. Syst., Man, Cybern., vol. 23, no. 3, pp. 665-685, 1993."}]},{"sourceEn":"IEEE Trans. Fuzzy Syst.","publicationType":"journal","id":"b25","label":"[25]","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"G. Selvachandran","personType":"author"},{"personType":"author"}],"content":"G.

Selvachandran

et al., “A new design of mamdani complex fuzzy inference system for multiattribute decision making problems,” IEEE Trans. Fuzzy Syst., vol. 29, no. 4, pp. 716-730, 2019."}]},{"sourceEn":"“Model tells you what to discard: Adaptive KV cache compression for LLMs,”","publicationType":"journal","id":"b26","label":"[26]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"S. Ge","personType":"author"},{"name":"Y. Zhang","personType":"author"},{"name":"L. Liu","personType":"author"},{"name":"M. Zhang","personType":"author"},{"name":"J. Han","personType":"author"},{"name":"J. Gao","personType":"author"}],"content":"S.

Ge

, Y.

Zhang

, L.

Liu

, M.

Zhang

, J.

Han

, and J.

Gao

, “Model tells you what to discard: Adaptive KV cache compression for LLMs,” 2023, arXiv:2310.01801."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b27","label":"[27]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Zhang","personType":"author"},{"personType":"author"}],"content":"Z.

Zhang

et al., “H2O: Heavy-hitter oracle for efficient generative inference of large language models,” in Proc. Adv. Neural Inf. Process. Syst., 2023, vol. 36, pp. 34661-34710."}]},{"publicationType":"journal","id":"b28","label":"[28]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"L. Shi","personType":"author"},{"name":"H. Zhang","personType":"author"},{"name":"Y. Yao","personType":"author"},{"name":"Z. Li","personType":"author"},{"name":"H. Zhao","personType":"author"}],"content":"L.

Shi

, H.

Zhang

, Y.

Yao

, Z.

Li

, and H.

Zhao

, “Keep the cost down: A review on methods to optimize LLM’s KV-cache consumption,” 2024, arXiv:2407.18003."}]},{"sourceEn":"Proc. Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b29","label":"[29]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Liu","personType":"author"},{"personType":"author"}],"content":"Z.

Liu

et al., “Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time,” in Proc. Adv. Neural Inf. Process. Syst., 2024, vol. 36."}]},{"sourceEn":"vol","publicationType":"journal","id":"b30","label":"[30]","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"X. Miao","personType":"author"},{"personType":"author"}],"content":"X.

Miao

et al., “Accelerating generative large language model serving with tree-based speculative inference and verification,” in Proc. Int. Conf. Architectural Support Program. Lang. Operat. Syst. (ASPLOS), 2024, vol. 3, pp. 932-949."}]},{"sourceEn":"“Reinforced self-training (rest) for language modeling,”","publicationType":"journal","id":"b31","label":"[31]","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"C. Gulcehre","personType":"author"},{"personType":"author"}],"content":"C.

Gulcehre

et al., “Reinforced self-training (rest) for language modeling,” 2023, arXiv:2308.08998."}]}],"fnList":[{"id":"fn-1","content":"

Zhe Wen and Liang Xu contributed equally to this work.

"}],"affList_en":["School of Integrated Circuits, Sun Yat-sen University, Shenzhen 518107, China"],"article":{"juan":"2","endNoteUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=EndNote&id=50726","bibtexUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=BibTeX&id=50726","articleType":"A","abstractUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/10.23919/ICS.2025.3575371","qi":"2","id":50726,"nian":2025,"bianHao":"1761142703382-1781039779","juanUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/Y2025","shouCiFaBuRiQi":"2025-10-22","qiShiYe":"58","accepted":"2025-05-23","received":"2024-12-31","qiUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025/V2/I2","pdfSize":"3032KB","risUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=Ris&id=50726","doi":"10.23919/ICS.2025.3575371","jieShuYe":"66","keywordList_en":["fuzzy inference","inference decoding","KV Cache optimization","LLM optimization accelerates"],"endNoteUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=EndNote&id=50726","zhaiyao_en":"

In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.

","bibtexUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=BibTeX&id=50726","abstractUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/10.23919/ICS.2025.3575371","juanUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025","lanMu_en":"Co-Optimization for Large Language Models: Advances in Algorithm and Hardware","qiUrl_en":"//www.sghhindu.com/www.qk/ics/EN/Y2025/V2/I2","risUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=Ris&id=50726","title_en":"An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding","revised":"2025-03-31","hasPdf":"true"},"authorList_en":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"^{","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"^{","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorList_cn":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"^{","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"^{","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"journal":{"issn":"2995-1968","qiKanWangZhi":"//www.sghhindu.com/www.qk/ics","qiKanMingCheng_CN":"Integrated Circuits and Systems","id":22,"qiKanMingCheng_EN":"Integrated Circuits and Systems"},"authorList":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"^{","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"^{","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorNotes_en":["MEIQI WANG (e-mail: wangmq53@mail.sysu.edu.cn).","Zhe Wen and Liang Xu contributed equally to this work.
"],"authorNotesCommon_en":["Zhe Wen and Liang Xu contributed equally to this work."],"fundList_en":["National Natural Science Foundation of China under Grant(62404256)","Jiangsu Provincial Science and Technology Major Special Project under Grant(BG2024032)","Key Project of Shenzhen Basic Research Program under Grant(JCYJ20241206180301003)","High-performance Computing Public Platform (Shenzhen Campus) of Sun Yat-sen University"]}">

Please choose a citation manager

RIS (ProCite, Reference Manager)

BibTeX

Content to export

Citation

Citation and abstract

An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding
ZHE WEN, LIANG XU, MEIQIWANG
Integrated Circuits and Systems››2025, Vol. 2››Issue (2): 58-66.

PDF(3032 KB)

Home
About

About the Journal
Aims and Scope
Editorial Policies
Editorial Board
News
Contact us

Publish with us

Author Instructions
Article Processing Charges
Editorial Processes
Manuscript Templates
Submit a Manuscript
Peer Review Guidelines

Browse

Current Issue
Archive
Most Viewed
Most Downloaded
Most Cited

Special Issues and Sections

PDF(3032 KB)

Integrated Circuits and Systems
››
2025, Vol. 2
››
Issue (2)
: 58-66.
DOI:
10.23919/ICS.2025.3575371

Co-Optimization for Large Language Models: Advances in Algorithm and Hardware

An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding

{{javascript:window.custom_author_en_index=0;}}

{{article.zuoZhe_EN}}

Author information
+

About authors:

Corresponding authors:

Show less

History
+

Received
Revised
Accepted
Published

2024-12-31
2025-03-31
2025-05-23
2025-10-22

Issue Date

2025-10-22

HeighLight

{{article.keyPoints_en}}

Abstract

{{article.zhaiyao_en}}

Key words

QR code of this article

Cite this article

EndNote
Ris (Procite)
Bibtex

Download Citations

{{article.zuoZheEn_L}}.{{article.title_en}}[J].
{{journal.qiKanMingCheng_EN}}, 2025, 2(2): 58-66 https://doi.org/10.23919/ICS.2025.3575371

References

List(Publishing order|Descend order by publishing year|Descend order by cited within)
Chart analysis

Annual distribution of articles

Pulication type distribution

Cited journals distribution

Cited authors distribution

References

{{article.reference}}

Funding

RIGHTS & PERMISSIONS

{{article.copyrightStatement_en}}

{{article.copyrightLicense_en}}

Share on WeChat

PDF(3032 KB)

Knowledge map

Accesses

Citation

Altmetric

Detail

Sections

Recommended

Copyright © Integrated Circuits and Systems, All Rights Reserved.

Powered by
Beijing Magtech Co. Ltd

/

〈

〉

map}}}}}}