PDF(2383 KB)
(Huizheng Wang and Qize Yang contributed equally to this work.)
"}],"affList_en":["1 School of Integrated Circuits, Tsinghua University, Beijing 100084, China","2 Shanghai AI laboratory, Shanghai 200003, China","3 International Innovation Center of Tsinghua University, Shanghai 200003, China"],"article":{"juan":"1","endNoteUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=EndNote&id=49229","bibtexUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=BibTeX&id=49229","articleType":"A","abstractUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/10.23919/ICS.2024.3515003","qi":"4","id":49229,"nian":2024,"bianHao":"1736417280986-150224381","juanUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/Y2024","shouCiFaBuRiQi":"2025-01-09","qiShiYe":"178","accepted":"2024-12-05","qiUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2024/V1/I4","pdfSize":"2383KB","risUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=Ris&id=49229","doi":"10.23919/ICS.2024.3515003","jieShuYe":"195","keywordList_en":["Large language models","recomputation","tensor partition","training","wafer-scale chips."],"endNoteUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=EndNote&id=49229","zhaiyao_en":"Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.
","bibtexUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=BibTeX&id=49229","abstractUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/10.23919/ICS.2024.3515003","juanUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2024","lanMu_en":"Original article","qiUrl_en":"//www.sghhindu.com/www.qk/ics/EN/Y2024/V1/I4","risUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=Ris&id=49229","title_en":"TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips","revised":"2024-12-05","hasPdf":"true"},"authorList_en":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"authorList_cn":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"journal":{"issn":"2995-1968","qiKanWangZhi":"//www.sghhindu.com/www.qk/ics","qiKanMingCheng_CN":"Integrated Circuits and Systems","id":22,"qiKanMingCheng_EN":"Integrated Circuits and Systems"},"authorList":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"authorNotes_en":["+ CORRESPONDING AUTHOR: YANG HU (e-mail: hu_yang@tsinghua.edu.cn).","(Huizheng Wang and Qize Yang contributed equally to this work.)
"],"authorNotesCommon_en":["(Huizheng Wang and Qize Yang contributed equally to this work.)
"],"backFnGroupList":[{}]}">
TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips
HUIZHENG WANG, QIZE YANG, TAIQUAN WEI, XINGMAO YU, CHENGRAN LI, JIAHAO FANG, GUANGYANG LU, XU DAI, LIANG LIU, SHENFEI JIANG, YANG HU, SHOUYI YIN, SHAOJUN WEI
Integrated Circuits and Systems››2024, Vol. 1››Issue (4): 178-195.
PDF(2383 KB)
PDF(2383 KB)
TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips
({{custom_author.role_en}}),{{javascript:window.custom_author_en_index++;}}| {{custom_ref.label}} |
{{custom_citation.content}}
{{custom_citation.annotation}}
|
/
| 〈 | 〉 |