PDF(3032 KB)
Zhe Wen and Liang Xu contributed equally to this work.
"}],"affList_en":["School of Integrated Circuits, Sun Yat-sen University, Shenzhen 518107, China"],"article":{"juan":"2","endNoteUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=EndNote&id=50726","bibtexUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=BibTeX&id=50726","articleType":"A","abstractUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/10.23919/ICS.2025.3575371","qi":"2","id":50726,"nian":2025,"bianHao":"1761142703382-1781039779","juanUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/Y2025","shouCiFaBuRiQi":"2025-10-22","qiShiYe":"58","accepted":"2025-05-23","received":"2024-12-31","qiUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025/V2/I2","pdfSize":"3032KB","risUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=Ris&id=50726","doi":"10.23919/ICS.2025.3575371","jieShuYe":"66","keywordList_en":["fuzzy inference","inference decoding","KV Cache optimization","LLM optimization accelerates"],"endNoteUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=EndNote&id=50726","zhaiyao_en":"In recent years, the exponential growth in Large Language Model (LLM) parameter sizes has significantly increased computational complexity, with inference latency emerging as a prominent challenge. The primary bottleneck lies in the token-by-token prediction process during autoregressive decoding, resulting in substantial delays. Therefore, enhancing decoding efficiency while maintaining accuracy has become a critical research objective. This paper proposes an Adaptive Parallel Layer-Skipping Speculative Decoding (APLS) method, which leverages speculative decoding techniques by employing a Small-Scale Model (SSM) for preliminary inference and validating the predictions using the original LLM. This approach effectively balances the high precision of LLMs with the efficiency of SSMs. Notably, our SSM does not require additional training but is instead derived through a simplification of the original large-scale model. By incorporating parallelization and a layer-skipping structure, the inference process dynamically bypasses certain redundant transformation layers, significantly improving GPU utilization and inference speed without compromising performance. Furthermore, to address challenges such as window size limitations and memory fragmentation in long-text processing, this paper introduces progressive layer reduction and key-value cache deletion techniques to further optimize the performance of SSMs. Experimental results demonstrate that the proposed method achieves a 2.51 × improvement in efficiency during autoregressive decoding. As this approach eliminates the need for additional training of SSM, it offers a significant competitive advantage in high-cost model compression environments.
","bibtexUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=BibTeX&id=50726","abstractUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/10.23919/ICS.2025.3575371","juanUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2025","lanMu_en":"Co-Optimization for Large Language Models: Advances in Algorithm and Hardware","qiUrl_en":"//www.sghhindu.com/www.qk/ics/EN/Y2025/V2/I2","risUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=Ris&id=50726","title_en":"An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding","revised":"2025-03-31","hasPdf":"true"},"authorList_en":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorList_cn":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"journal":{"issn":"2995-1968","qiKanWangZhi":"//www.sghhindu.com/www.qk/ics","qiKanMingCheng_CN":"Integrated Circuits and Systems","id":22,"qiKanMingCheng_EN":"Integrated Circuits and Systems"},"authorList":[{"deceased":false,"orcid":"0009-0009-0973-1372","name_cn":"ZHE WEN","xref_en":"","name_en":"ZHE WEN"},{"deceased":false,"name_cn":"LIANG XU","xref_en":"","name_en":"LIANG XU"},{"deceased":false,"orcid":"0000-0001-9553-3640","name_cn":"MEIQIWANG","name_en":"MEIQIWANG"}],"authorNotes_en":["MEIQI WANG (e-mail: wangmq53@mail.sysu.edu.cn).","Zhe Wen and Liang Xu contributed equally to this work.
"],"authorNotesCommon_en":["Zhe Wen and Liang Xu contributed equally to this work.
"],"fundList_en":["National Natural Science Foundation of China under Grant(62404256)","Jiangsu Provincial Science and Technology Major Special Project under Grant(BG2024032)","Key Project of Shenzhen Basic Research Program under Grant(JCYJ20241206180301003)","High-performance Computing Public Platform (Shenzhen Campus) of Sun Yat-sen University"]}">
An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding
ZHE WEN, LIANG XU, MEIQIWANG
Integrated Circuits and Systems››2025, Vol. 2››Issue (2): 58-66.
PDF(3032 KB)
PDF(3032 KB)
An Adaptive Parallel Layer-Skipping Framework for Large Language Model Inference Speedup With Speculative Decoding
({{custom_author.role_en}}),{{javascript:window.custom_author_en_index++;}}| {{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
/
| 〈 | 〉 |