+ CORRESPONDING AUTHOR: YANG HU (e-mail: hu_yang@tsinghua.edu.cn)."],"referenceList":[{"publicationType":"conf-proc","id":"b1","label":"[1].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"J. Devlin","personType":"author"},{"name":"M.-W. Chang","personType":"author"},{"name":"K. Lee","personType":"author"},{"name":"K. Toutanova","personType":"author"}],"content":"J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT:Pretraining of deep bidirectional transformers for language understanding,” in Proc. North Amer. Chapter Assoc. Comput. Linguistics: Hum. Lang. Technol., 2018, vol. 1, pp. 2-18."}]},{"publicationType":"journal","id":"b2","label":"[2].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Lan","personType":"author"},{"name":"M. Chen","personType":"author"},{"name":"S. Goodman","personType":"author"},{"name":"K. Gimpel","personType":"author"},{"name":"P. Sharma","personType":"author"},{"name":"R. Soricut","personType":"author"}],"content":"Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut,“Albert: A lite BERT for self-supervised learning of language representations,” 2019, arXiv:1909.11942."}]},{"publicationType":"journal","id":"b3","label":"[3].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Liu","personType":"author"},{"personType":"author"}],"content":"Y. Liu et al., “Roberta: A robustly optimized BERT pretraining approach,” 2019, arXiv:1907.11692."}]},{"sourceEn":"OpenAI blog","publicationType":"journal","id":"b4","label":"[4].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"A. Radford","personType":"author"},{"name":"J. Wu","personType":"author"},{"name":"R. Child","personType":"author"},{"name":"D. Luan","personType":"author"},{"name":"D. Amodei","personType":"author"},{"name":"I. Sutskever","personType":"author"}],"content":"A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, pp. 9-23, 2019."}]},{"sourceEn":"J. Mach. Learn. Res.","publicationType":"journal","id":"b5","label":"[5].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"C. Raffel","personType":"author"},{"personType":"author"}],"content":"C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, pp. 5485-5551, 2020."}]},{"publicationType":"journal","id":"b6","label":"[6].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"V. Sanh","personType":"author"},{"name":"L. Debut","personType":"author"},{"name":"J. Chaumond","personType":"author"},{"name":"T. Wolf","personType":"author"}],"content":"V. Sanh, L. Debut, J. Chaumond, and T. Wolf,“DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter,” 2019, arXiv:1910.01108."}]},{"sourceEn":"“Ammus: A survey of transformer-based pretrained models in natural language processing,”","publicationType":"journal","id":"b7","label":"[7].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"K. S. Kalyan","personType":"author"},{"name":"A. Rajasekharan","personType":"author"},{"name":"S. Sangeetha","personType":"author"}],"content":"K. S. Kalyan, A. Rajasekharan, and S. Sangeetha, “Ammus: A survey of transformer-based pretrained models in natural language processing,” 2021, arXiv:2108.05542."}]},{"sourceEn":"Conf. Acoust., Speech, Signal Process.","publicationType":"journal","id":"b8","label":"[8].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"Q. Zhang","personType":"author"},{"personType":"author"}],"content":"Q. Zhang et al., “Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-T loss,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7829-7833."}]},{"publicationType":"conf-proc","id":"b9","label":"[9].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"personType":"author"}],"content":"H. Wang et al., “SOFA: A compute-memory optimized sparsity accelerator via cross-stage coordinated tiling,” in Proc. 57th IEEE/ACM Int. Symp. Microarchit., 2024, pp. 1247-1263."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b10","label":"[10].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"T. B. Brown","personType":"author"},{"personType":"author"}],"content":"T. B. Brown et al., “Language models are few-shot learners,” Adv. Neural Inf. Process. Syst., vol. 33, pp. 1877-1901, 2020."}]},{"sourceEn":"Mak. Art With Generative AI Tools","publicationType":"journal","id":"b11","label":"[11].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"A. T. Rosário","personType":"author"}],"content":"A. T. Rosário, “Generative AI and generative pre-trained transformer applications: Challenges and opportunities,” Mak. Art With Generative AI Tools, vol. 1, pp. 45-71, 2024."}]},{"publicationType":"journal","id":"b12","label":"[12].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"S. Feuerriegel","personType":"author"},{"name":"J. Hartmann","personType":"author"},{"name":"C. Janiesch","personType":"author"},{"name":"P. Zschech","personType":"author"}],"content":"S. Feuerriegel, J. Hartmann, C. Janiesch, and P. Zschech,“Generative AI,” Inf. Syst. Eng., vol. 66, pp. 111-126, 2024."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b13","label":"[13].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Liu","personType":"author"},{"name":"Y. Wang","personType":"author"},{"name":"K. Han","personType":"author"},{"name":"W. Zhang","personType":"author"},{"name":"S. Ma","personType":"author"},{"name":"W. Gao","personType":"author"}],"content":"Z. Liu, Y. Wang, K. Han, W. Zhang, S. Ma, and W. Gao, “Post-training quantization for vision transformer,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 28092-28103, 2021."}]},{"sourceEn":"Conf. Digit. Signal Process.","publicationType":"journal","id":"b14","label":"[14].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"name":"Z. Zhang","personType":"author"},{"name":"X. You","personType":"author"},{"name":"C. Zhang","personType":"author"}],"content":"H. Wang, Z. Zhang, X. You, and C. Zhang, “Low-complexity wino-grad convolution architecture based on stochastic computing,” in Proc. IEEE 23rd Int. Conf. Digit. Signal Process., 2018, pp. 1-5."}]},{"publicationType":"conf-proc","id":"b15","label":"[15].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"L. Yuan","personType":"author"},{"personType":"author"}],"content":"L. Yuan et al., “Tokens-to-token vit: Training vision transformers from scratch on imagenet,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 558-567."}]},{"publicationType":"conf-proc","id":"b16","label":"[16].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"Z.-Y. Dou","personType":"author"},{"personType":"author"}],"content":"Z.-Y. Dou et al., “An empirical study of training end-to-end vision- and-language transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 18166-18176."}]},{"sourceEn":"words: Transformers for image recognition at scale","publicationType":"journal","id":"b17","label":"[17].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"A. Dosovitskiy","personType":"author"},{"personType":"author"}],"content":"A. Dosovitskiy et al., “An image is worth 16 × 16 words: Transformers for image recognition at scale,” 2020, arXiv:2010.11929."}]},{"publicationType":"conf-proc","id":"b18","label":"[18].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"C.-F. R. Chen","personType":"author"},{"name":"Q. Fan","personType":"author"},{"name":"R. Panda","personType":"author"}],"content":"C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multiscale vision transformer for image classification,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 357-366."}]},{"publicationType":"conf-proc","id":"b19","label":"[19].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"H. Touvron","personType":"author"},{"name":"M. Cord","personType":"author"},{"name":"A. Sablayrolles","personType":"author"},{"name":"G. Synnaeve","personType":"author"},{"name":"H. Jégou","personType":"author"}],"content":"H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, and H. Jégou,“Going deeper with image transformers,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 32-42."}]},{"sourceEn":"Briefs","publicationType":"journal","id":"b20","label":"[20].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"name":"W. Xu","personType":"author"},{"name":"Z. Zhang","personType":"author"},{"name":"X. You","personType":"author"},{"name":"C. Zhang","personType":"author"}],"content":"H. Wang, W. Xu, Z. Zhang, X. You, and C. Zhang, “An efficient stochastic convolution architecture based on fast FIR algorithm,” IEEE Trans. Circuits Syst. II: Exp. Briefs, vol. 69, no. 3, pp. 984-988, Mar. 2022."}]},{"sourceEn":"Process. J., Art. no.","publicationType":"journal","id":"b21","label":"[21].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"K. S. Kalyan","personType":"author"}],"content":"K. S. Kalyan, “A survey of GPT-3 family large language models including ChatGPT and GPT-4,” Natural Lang. Process. J., Art. no. 100048, 2023."}]},{"publicationType":"conf-proc","id":"b22","label":"[22].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"D. Narayanan","personType":"author"},{"name":"A. Phanishayee","personType":"author"},{"name":"K. Shi","personType":"author"},{"name":"X. Chen","personType":"author"},{"name":"M. Zaharia","personType":"author"},{"name":"“Memory-efficient pipeline-parallel DNN training ”","personType":"author"},{"name":"in Proc. Int.","personType":"author"}],"content":"D. Narayanan, A. Phanishayee, K. Shi, X. Chen, and M. Zaharia, “Memory-efficient pipeline-parallel DNN training in Proc. Int. Conf. Mach. Learn., 2021, pp. 7937-7947."}]},{"sourceEn":"Archit. News","publicationType":"journal","id":"b23","label":"[23].","nian":2016,"citedCount":0,"citationList":[{"personList":[{"name":"D. Kim","personType":"author"},{"name":"J. Kung","personType":"author"},{"name":"S. Chai","personType":"author"},{"name":"S. Yalamanchili","personType":"author"},{"name":"S. Mukhopadhyay","personType":"author"}],"content":"D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay, “Neurocube: A programmable digital neuromorphic architecture with high-density 3D memory,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, pp. 380-392, 2016."}]},{"sourceEn":"Tech. Rep.","publicationType":"journal","id":"b24","label":"[24].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"S. Hall","personType":"author"},{"name":"R. Schreiber","personType":"author"},{"name":"S. Lie","personType":"author"}],"content":"S. Hall, R. Schreiber, and S. Lie, “Training giant neural networks using weight streaming on cerebras wafer-scale systems,” Tech. Rep. 111521, 2021."}]},{"sourceEn":"Syst.","publicationType":"journal","id":"b25","label":"[25].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"F. Zeng","personType":"author"},{"name":"W. Gan","personType":"author"},{"name":"Y. Wang","personType":"author"},{"name":"S. Y. Philip","personType":"author"}],"content":"F. Zeng, W. Gan, Y. Wang, and S. Y. Philip,“Distributed training of large language models,” in Proc. IEEE 29th Int. Conf. Parallel Distrib. Syst., 2023, pp. 840-847."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b26","label":"[26].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Yu","personType":"author"},{"personType":"author"}],"content":"Y. Yu et al., “Large language model as attributed training data generator: A tale of diversity and bias,” Adv. Neural Inf. Process. Syst., vol. 36, pp. 55734-55784, 2024."}]},{"sourceEn":"Oper. Syst. Principles","publicationType":"journal","id":"b27","label":"[27].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"D. Narayanan","personType":"author"},{"personType":"author"}],"content":"D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. Oper. Syst. Principles, 2019, pp. 1-15."}]},{"publicationType":"conf-proc","id":"b28","label":"[28].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"X. S. Huang","personType":"author"},{"name":"F. Perez","personType":"author"},{"name":"J. Ba","personType":"author"},{"name":"M. Volkovs","personType":"author"}],"content":"X. S. Huang, F. Perez, J. Ba, and M. Volkovs,“Improving transformer optimization through better initialization,” in Proc. Int. Conf. Mach. Learn., 2020, pp. 4475-4483."}]},{"publicationType":"conf-proc","id":"b29","label":"[29].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"S. Shen","personType":"author"},{"name":"P. Walsh","personType":"author"},{"name":"K. Keutzer","personType":"author"},{"name":"J. Dodge","personType":"author"},{"name":"M. Peters","personType":"author"},{"name":"I. Beltagy","personType":"author"}],"content":"S. Shen, P. Walsh, K. Keutzer, J. Dodge, M. Peters, and I. Beltagy,“Staged training for transformer language models,” in Proc. Int. Conf. Mach. Learn., 2022, pp. 19893-19908."}]},{"publicationType":"conf-proc","id":"b30","label":"[30].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"L. Zheng","personType":"author"},{"personType":"author"}],"content":"L. Zheng et al., “Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning,” in Proc. 16th USENIX Symp. Operating Syst. Des. Implementation, 2022, pp. 559-578."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b31","label":"[31].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Huang","personType":"author"},{"personType":"author"}],"content":"Y. Huang et al., “Gpipe: Efficient training of giant neural networks using pipeline parallelism,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 103-112, 2019."}]},{"sourceEn":"Princ.","publicationType":"journal","id":"b32","label":"[32].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"D. Narayanan","personType":"author"},{"personType":"author"}],"content":"D. Narayanan et al., “PipeDream: Generalized pipeline parallelism for DNN training,” in Proc. 27th ACM Symp. operating Syst. Princ., 2019, pp. 1-15."}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b33","label":"[33].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"S. Pal","personType":"author"},{"name":"D. Petrisko","personType":"author"},{"name":"A. A. Bajwa","personType":"author"},{"name":"P. Gupta","personType":"author"},{"name":"S. S. Iyer","personType":"author"},{"name":"R. Kumar","personType":"author"}],"content":"S. Pal, D. Petrisko, A. A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar, “A case for packageless processors,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2018, pp. 466-479."}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b34","label":"[34].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"S. Pal","personType":"author"},{"name":"D. Petrisko","personType":"author"},{"name":"M. Tomei","personType":"author"},{"name":"P. Gupta","personType":"author"},{"name":"S. S. Iyer","personType":"author"},{"name":"R. Kumar","personType":"author"}],"content":"S. Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, “Architecting waferscale processors—A GPU case study,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2019, pp. 250-263."}]},{"sourceEn":"Mag.","publicationType":"journal","id":"b35","label":"[35].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Hu","personType":"author"},{"personType":"author"}],"content":"Y. Hu et al., “Wafer-scale computing: Advancements, challenges, and future perspectives [feature],” IEEE Circuits Syst. Mag., vol. 24, no. 1, pp. 52-81, Firstquarter 2024."}]},{"sourceEn":"Electron Devices","publicationType":"journal","id":"b36","label":"[36].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"S. Hou","personType":"author"},{"personType":"author"}],"content":"S. Hou et al., “Wafer-level integration of an advanced logic-memory system through the second-generation cowos technology,” IEEE Trans. Electron Devices, vol. 64, no. 10, pp. 4071-4077, Oct. 2017."}]},{"sourceEn":"Compon. Technol. Conf.","publicationType":"journal","id":"b37","label":"[37].","nian":2016,"citedCount":0,"citationList":[{"personList":[{"name":"R. Mahajan","personType":"author"},{"personType":"author"}],"content":"R. Mahajan et al., “Embedded multi-die interconnect bridge (EMIB)— A high density, high bandwidth packaging interconnect,” in Proc. IEEE 66th Electron. Compon. Technol. Conf., 2016, pp. 557-565."}]},{"sourceEn":"Symp.","publicationType":"journal","id":"b38","label":"[38].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"S. Lie","personType":"author"}],"content":"S. Lie, “Cerebras architecture deep dive: First look inside the HW/SW co-design for deep learning: Cerebras systems,” in Proc. IEEE Hot Chips 34 Symp., 2022, pp. 1-34."}]},{"sourceEn":"Symp.","publicationType":"journal","id":"b39","label":"[39].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"E. Talpes","personType":"author"},{"name":"D. Williams","personType":"author"},{"name":"D. D. Sarma","personType":"author"}],"content":"E. Talpes, D. Williams, and D. D. Sarma, “Dojo: The microarchitecture of Tesla’s exa-scale computer,” in Proc. IEEE Hot Chips 34 Symp., 2022, pp. 1-28."}]},{"sourceEn":"[Online]","publicationType":"online","id":"b40","label":"[40].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"D. Patel","personType":"author"}],"content":"D. Patel,“Tenstorrent Wormhole analysis—A scale out architecture for machine learning that could put Nvidia on their back foot,” 2021, [Online]. ","url":"Available: https://www.semianalysis.com/p/tenstorrent-wormhole-analysis-a-scale"}]},{"sourceEn":"[Online]","publicationType":"online","id":"b41","label":"[41].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"T. Morgan","personType":"author"}],"content":"T. Morgan,“Inside Tesla’s innovative and homegrown ‘DOJO’ AI supercomputer,” 2022, [Online]. ","url":"Available: https://www.nextplatform.com/2022/08/23/inside-teslas-innovative- and-homegrown-dojo-ai-supercomputer/"}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b42","label":"[42].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"L. Song","personType":"author"},{"name":"J. Mao","personType":"author"},{"name":"Y. Zhuo","personType":"author"},{"name":"X. Qian","personType":"author"},{"name":"H. Li","personType":"author"},{"name":"Y. Chen","personType":"author"}],"content":"L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Hypar: Towards hybrid parallelism for deep learning accelerator array,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2019, pp. 56-68."}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b43","label":"[43].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"L. Song","personType":"author"},{"name":"F. Chen","personType":"author"},{"name":"Y. Zhuo","personType":"author"},{"name":"X. Qian","personType":"author"},{"name":"H. Li","personType":"author"},{"name":"Y. Chen","personType":"author"}],"content":"L. Song, F. Chen, Y. Zhuo, X. Qian, H. Li, and Y. Chen, “Accpar: Tensor partitioning for heterogeneous deep learning accelerators,” in Proc. IEEE Int. Symp. High Perform. Comput. Archit., 2020, pp. 342-355."}]},{"publicationType":"conf-proc","id":"b44","label":"[44].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"B. Li","personType":"author"},{"name":"Q. Du","personType":"author"},{"name":"D. Liu","personType":"author"},{"name":"J. Zhang","personType":"author"},{"name":"G. Chen","personType":"author"},{"name":"H. You","personType":"author"}],"content":"B. Li, Q. Du, D. Liu, J. Zhang, G. Chen, and H. You,“Placement for wafer-scale deep learning accelerator,” in Proc. 26th Asia South Pacific Des. Automat. Conf., 2021, pp. 665-670."}]},{"sourceEn":"Conf. Distrib. Comput. Syst.","publicationType":"journal","id":"b45","label":"[45].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"H. Peng","personType":"author"},{"name":"L. Guo","personType":"author"},{"name":"L. Sun","personType":"author"},{"name":"X. Zhang","personType":"author"}],"content":"H. Peng, L. Guo, L. Sun, and X. Zhang, “Resource allocation for wafer-scale deep learning accelerator,” in Proc. IEEE 41st Int. Conf. Distrib. Comput. Syst., 2021, pp. 1114-1115."}]},{"publicationType":"conf-proc","id":"b46","label":"[46].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"B. Jiang","personType":"author"},{"personType":"author"}],"content":"B. Jiang et al., “CU. POKer: Placing DNNs on wafer-scale AI accelerator with optimal kernel sizing,” in Proc. 39th Int. Conf. Comput-Aided Des., 2020, pp. 1-9."}]},{"publicationType":"journal","id":"b47","label":"[47].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"J. Chen","personType":"author"},{"name":"S. Li","personType":"author"},{"name":"R. Gun","personType":"author"},{"name":"J. Yuan","personType":"author"},{"name":"T. Hoefler","personType":"author"}],"content":"J. Chen, S. Li, R. Gun, J. Yuan, and T. Hoefler,“Autoddl: Automatic distributed deep learning with asymptotically optimal communication,” 2023, arXiv:2301.06813."}]},{"publicationType":"conf-proc","id":"b48","label":"[48].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"H. Kwon","personType":"author"},{"name":"P. Chatarasi","personType":"author"},{"name":"M. Pellauer","personType":"author"},{"name":"A. Parashar","personType":"author"},{"name":"V. Sarkar","personType":"author"},{"name":"T. Kr- ishna","personType":"author"}],"content":"H. Kwon, P. Chatarasi, M. Pellauer, A. Parashar, V. Sarkar, and T. Kr- ishna, “Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 754-768."}]},{"sourceEn":"“Maximizing parallelism in distributed training for huge neural networks,”","publicationType":"journal","id":"b49","label":"[49].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Bian","personType":"author"},{"name":"Q. Xu","personType":"author"},{"name":"B. Wang","personType":"author"},{"name":"Y. You","personType":"author"}],"content":"Z. Bian, Q. Xu, B. Wang, and Y. You, “Maximizing parallelism in distributed training for huge neural networks,” 2021, arXiv:2105.14450."}]},{"publicationType":"journal","id":"b50","label":"[50].","nian":2016,"citedCount":0,"citationList":[{"personList":[{"name":"T. Chen","personType":"author"},{"name":"B. Xu","personType":"author"},{"name":"C. Zhang","personType":"author"},{"name":"C. Guestrin","personType":"author"}],"content":"T. Chen, B. Xu, C. Zhang, and C. Guestrin,“Training deep nets with sublinear memory cost,” 2016, arXiv:1604.06174."}]},{"publicationType":"conf-proc","id":"b51","label":"[51].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"P. Jain","personType":"author"},{"personType":"author"}],"content":"P. Jain et al., “Checkmate: Breaking the memory wall with optimal tensor rematerialization,” Proc. Mach. Learn. Syst., vol. 2, pp. 497-511, 2020."}]},{"publicationType":"journal","id":"b52","label":"[52].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"M. Shoeybi","personType":"author"},{"name":"M. Patwary","personType":"author"},{"name":"R. Puri","personType":"author"},{"name":"P. LeGresley","personType":"author"},{"name":"J. Casper","personType":"author"},{"name":"B. Catanzaro","personType":"author"}],"content":"M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro,“Megatron-LM: Training multi-billion parameter language models using model parallelism,” 2019, arXiv:1909.08053."}]},{"publicationType":"conf-proc","id":"b53","label":"[53].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"D. Narayanan","personType":"author"},{"personType":"author"}],"content":"D. Narayanan et al., “Efficient large-scale language model training on GPU clusters using Megatron-LM,” in Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2021, pp. 1-15."}]},{"publicationType":"journal","id":"b54","label":"[54].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"S. Smith","personType":"author"},{"personType":"author"}],"content":"S. Smith et al., “Using deepspeed and megatron to train Megatronturing NLG 530B, a large-scale generative language model,” 2022, arXiv:2201.11990."}]},{"publicationType":"conf-proc","id":"b55","label":"[55].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"V. A. Korthikanti","personType":"author"},{"personType":"author"}],"content":"V. A. Korthikanti et al., “Reducing activation recomputation in large transformer models,” Proc. Mach. Learn. Syst., vol. 5, pp. 341-353, 2023."}]},{"publicationType":"journal","id":"b56","label":"[56].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"H. Touvron","personType":"author"},{"personType":"author"}],"content":"H. Touvron et al., “Llama: Open and efficient foundation language models,” 2023, arXiv:2302.13971."}]},{"publicationType":"journal","id":"b57","label":"[57].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"H. Touvron","personType":"author"},{"personType":"author"}],"content":"H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,” 2023, arXiv:2307.09288."}]},{"publicationType":"report","id":"b58","label":"[58].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"J. Bai","personType":"author"},{"personType":"author"}],"content":"J. Bai et al., “Qwen technical report,” 2023, arXiv:2309.16609."}]},{"publicationType":"conf-proc","id":"b59","label":"[59].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"S. Rajbhandari","personType":"author"},{"name":"J. Rasley","personType":"author"},{"name":"O. Ruwase","personType":"author"},{"name":"Y. He","personType":"author"}],"content":"S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZERO: Memory optimizations toward training trillion parameter models,” in Proc. SC20:Int. Conf. High Perform. Comput., Netw., Storage Anal., 2020, pp. 1-16."}]},{"publicationType":"conf-proc","id":"b60","label":"[60].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"S. Pal","personType":"author"},{"personType":"author"}],"content":"S. Pal et al., “Designing a 2048-chiplet, 14336-core waferscale processor,” in Proc. 58th ACM/IEEE Des. Automat. Conf., 2021, pp. 1183-1188."}]},{"sourceEn":"Symp. Circuits Syst.","publicationType":"journal","id":"b61","label":"[61].","nian":2010,"citedCount":0,"citationList":[{"personList":[{"name":"J. Schemmel","personType":"author"},{"name":"D. Brüderle","personType":"author"},{"name":"A. Grübl","personType":"author"},{"name":"M. Hock","personType":"author"},{"name":"K. Meier","personType":"author"},{"name":"S. Millner","personType":"author"}],"content":"J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier, and S. Millner, “A wafer-scale neuromorphic hardware system for large-scale neural modeling,” in Proc. IEEE Int. Symp. Circuits Syst., 2010, pp. 1947-1950."}]},{"sourceEn":"[Online]","publicationType":"journal","id":"b62","label":"[62].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"Cerebras","personType":"author"}],"content":"Cerebras,“Cerebras systems: Achieving industry best AI performance through a systems approach,” 2021, [Online]. ","url":"Available: https://cerebras.net/wp-content/uploads/2021/04/Cerebras-CS-2-Whitepaper.pdf"}]},{"publicationType":"conf-proc","id":"b63","label":"[63].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"Y. S. Shao","personType":"author"},{"personType":"author"}],"content":"Y. S. Shao et al., “Simba: Scaling deep-learning inference with multi-chip-module-based architecture,” in Proc. 52nd Annu. IEEE/ACM Int. Symp. Microarchit., 2019, pp. 14-27."}]},{"sourceEn":"Compon. Technol. Conf.","publicationType":"journal","id":"b64","label":"[64].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"A. A. Bajwa","personType":"author"},{"personType":"author"}],"content":"A. A. Bajwa et al., “Demonstration of a heterogeneously integrated system-on-wafer (SoW) assembly,” in Proc. IEEE 68th Electron. Compon. Technol. Conf., 2018, pp. 1926-1930."}]},{"sourceEn":"J. Electron. Mater.","publicationType":"journal","id":"b65","label":"[65].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"M. Zhang","personType":"author"},{"name":"F. Qin","personType":"author"},{"name":"S. Chen","personType":"author"},{"name":"Y. Dai","personType":"author"},{"name":"P. Chen","personType":"author"},{"name":"T. An","personType":"author"}],"content":"M. Zhang, F. Qin, S. Chen, Y. Dai, P. Chen, and T. An, “Protrusion of through-silicon-via (TSV) copper with double annealing processes,” J. Electron. Mater., vol. 51, no. 5, pp. 2433-2449, 2022."}]},{"sourceEn":"Art.","publicationType":"journal","id":"b66","label":"[66].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"H. Guo","personType":"author"},{"name":"S. Cao","personType":"author"},{"name":"L. Li","personType":"author"},{"name":"X. Zhang","personType":"author"}],"content":"H. Guo, S. Cao, L. Li, and X. Zhang, “A review on the mainstream through-silicon via etching methods,” Mater. Sci. Semicond. Process., vol. 137, 2022, Art. no. 106182."}]},{"sourceEn":"Compon. Technol. Conf.","publicationType":"journal","id":"b67","label":"[67].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"P. Huang","personType":"author"},{"personType":"author"}],"content":"P. Huang et al., “Wafer level system integration of the fifth generation CoWoS-S with high performance si interposer at 2500 mm2,” in Proc. IEEE 71st Electron. Compon. Technol. Conf., 2021, pp. 101-104."}]},{"sourceEn":"Compon. Technol. Conf.","publicationType":"journal","id":"b68","label":"[68].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Y.-C. Hu","personType":"author"},{"personType":"author"}],"content":"Y.-C. Hu et al., “CoWoS architecture evolution for next generation HPC on 2.5 D. system in package,” in Proc. IEEE 73rd Electron. Compon. Technol. Conf., 2023, pp. 1022-1026."}]},{"sourceEn":"Compon. Packag. Manuf. Technol.","publicationType":"journal","id":"b69","label":"[69].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"J. H. Lau","personType":"author"},{"personType":"author"}],"content":"J. H. Lau et al., “Hybrid substrate by fan-out RDL-first panel-level packaging,” IEEE Trans. Compon. Packag. Manuf. Technol., vol. 11, no. 8, pp. 1301-1309, Aug. 2021."}]},{"sourceEn":"Compon. Packag. Manuf. Technol.","publicationType":"journal","id":"b70","label":"[70].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"J. H. Lau","personType":"author"},{"personType":"author"}],"content":"J. H. Lau. et al., “Panel-level fan-out RDL-first packaging for hetero-geneous integration,” IEEE Trans. Compon. Packag. Manuf. Technol., vol. 10, no. 7, pp. 1125-1137, Jul. 2020."}]},{"sourceEn":"Univ. Chicago, Tech. Rep.","publicationType":"journal","id":"b71","label":"[71].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"M. La","personType":"author"},{"name":"A","personType":"author"},{"name":"Chien","personType":"author"}],"content":"M. La and A. Chien, “Cerebras systems: Journey to the wafer-scale engine,” Univ. Chicago, Tech. Rep. CS24440, 2020."}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b72","label":"[72].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"Q. Zhou","personType":"author"},{"personType":"author"}],"content":"Q. Zhou et al., “Mpress: Democratizing billion-scale model training on multi-GPU servers via memory-saving inter-operator parallelism,” in Proc. IEEE Int. Symp. High-Perform. Comput. Archit., 2023,"}]},{"publicationType":"journal","id":"b73","label":"[73].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"pp","personType":"author"}],"content":"pp.556-569."}]},{"sourceEn":"Lang. Operating Syst.","publicationType":"journal","id":"b74","label":"[74].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"H. Wang","personType":"author"},{"name":"L. Wang","personType":"author"},{"name":"H. Xu","personType":"author"},{"name":"Y. Wang","personType":"author"},{"name":"Y. Li","personType":"author"},{"name":"Y. Han","personType":"author"}],"content":"H. Wang, L. Wang, H. Xu, Y. Wang, Y. Li, and Y. Han, “Primepar: Efficient spatial-temporal tensor partitioning for large transformer model training,” in Proc. 29th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., 2024, pp. 801-817."}]},{"publicationType":"conf-proc","id":"b75","label":"[75].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"S. Li","personType":"author"},{"personType":"author"}],"content":"S. Li et al., “Colossal-AI: A unified deep learning system for large-scale parallel training,” in Proc. 52nd Int. Conf. Parallel Process., 2023, pp. 766-775."}]},{"sourceEn":"Syst.","publicationType":"journal","id":"b76","label":"[76].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"A. Li","personType":"author"},{"personType":"author"}],"content":"A. Li et al., “Evaluating modern GPU interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 1, pp. 94-110, Jan. 2020."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b77","label":"[77].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"M. Kusumoto","personType":"author"},{"name":"T. Inoue","personType":"author"},{"name":"G. Watanabe","personType":"author"},{"name":"T. Akiba","personType":"author"},{"name":"M. Koyama","personType":"author"}],"content":"M. Kusumoto, T. Inoue, G. Watanabe, T. Akiba, and M. Koyama, “A graph theoretic framework of recomputation algorithms for memory-efficient backpropagation,” Adv. Neural Inf. Process. Syst., vol. 32, pp. 1163-1172, 2019."}]},{"publicationType":"conf-proc","id":"b78","label":"[78].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"B. Zheng","personType":"author"},{"name":"N. Vijaykumar","personType":"author"},{"name":"G. Pekhimenko","personType":"author"}],"content":"B. Zheng, N. Vijaykumar, and G. Pekhimenko, “Echo: Compiler-based GPU memory footprint reduction for LSTM RNN training,” in Proc. ACM/IEEE 47th Annu. Int. Symp. Comput. Archit., 2020, pp. 1089-1102."}]},{"publicationType":"conf-proc","id":"b79","label":"[79].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"A. Yazdanbakhsh","personType":"author"},{"name":"A. Moradifirouzabadi","personType":"author"},{"name":"Z. Li","personType":"author"},{"name":"M. Kang","personType":"author"}],"content":"A. Yazdanbakhsh, A. Moradifirouzabadi, Z. Li, and M. Kang,“Sparse attention acceleration with synergistic in-memory pruning and on-chip recomputation,” in Proc. 55th IEEE/ACM Int. Symp. Microarchit., 2022, pp. 744-762."}]},{"sourceEn":"Adv. Neural Inf. Process. Syst.","publicationType":"journal","id":"b80","label":"[80].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"O. Beaumont","personType":"author"},{"name":"L. Eyraud-Dubois","personType":"author"},{"name":"A. Shilova","personType":"author"}],"content":"O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training DNNs,” Adv. Neural Inf. Process. Syst., vol. 34, pp. 23844-23857, 2021."}]},{"publicationType":"conf-proc","id":"b81","label":"[81].","nian":2010,"citedCount":0,"citationList":[{"personList":[{"name":"J. Hu","personType":"author"},{"personType":"author"}],"content":"J. Hu et al., “Reducing write activities on non-volatile memories in embedded CMPs via data migration and recomputation,” in Proc. 47th Des. Automat. Conf., 2010, pp. 350-355."}]},{"sourceEn":"Appl. Specific Processors","publicationType":"journal","id":"b82","label":"[82].","nian":2010,"citedCount":0,"citationList":[{"personList":[{"name":"J. Hu","personType":"author"},{"name":"C. J. Xue","personType":"author"},{"name":"W.-C. Tseng","personType":"author"},{"name":"Q. Zhuge","personType":"author"},{"name":"E. H.-M. Sha","personType":"author"}],"content":"J. Hu, C. J. Xue, W.-C. Tseng, Q. Zhuge, and E. H.-M. Sha, “Minimizing write activities to non-volatile memory via scheduling and recomputation,” in Proc. IEEE 8th Symp. Appl. Specific Processors, 2010, pp. 101-106."}]},{"sourceEn":"Comput.-Aided Des. Integr. Circuits Syst.","publicationType":"journal","id":"b83","label":"[83].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"J. Hu","personType":"author"},{"personType":"author"}],"content":"J. Hu et al., “Write activity minimization for nonvolatile main memory via scheduling and recomputation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 30, no. 4, pp. 584-592, Apr. 2011."}]},{"publicationType":"conf-proc","id":"b84","label":"[84].","nian":2019,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Jia","personType":"author"},{"name":"M. Zaharia","personType":"author"},{"name":"A. Aiken","personType":"author"}],"content":"Z. Jia, M. Zaharia, and A. Aiken,“Beyond data and model parallelism for deep neural networks,” in Proc. Mach. Learn. Syst., 2019, pp. 1-13."}]},{"sourceEn":"ACM SIGPLAN Notices","publicationType":"journal","id":"b85","label":"[85].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"H. Kwon","personType":"author"},{"name":"A. Samajdar","personType":"author"},{"name":"T. Krishna","personType":"author"}],"content":"H. Kwon, A. Samajdar, and T. Krishna, “MAERI: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects,” ACM SIGPLAN Notices, vol. 53, no. 2, pp. 461-475, 2018."}]},{"publicationType":"conf-proc","id":"b86","label":"[86].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"L. Lu","personType":"author"},{"personType":"author"}],"content":"L. Lu et al., “TENET: A framework for modeling tensor dataflow based on relation-centric notation,” in Proc. ACM/IEEE 48th Annu. Int. Symp. Comput. Archit., 2021, pp. 720-733."}]},{"publicationType":"conf-proc","id":"b87","label":"[87].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"V. Muttillo","personType":"author"},{"name":"P. Giammatteo","personType":"author"},{"name":"G. Fiorilli","personType":"author"},{"name":"L. Pomante","personType":"author"}],"content":"V. Muttillo, P. Giammatteo, G. Fiorilli, and L. Pomante,“An open mp parallel genetic algorithm for design space exploration of heterogeneous multi-processor embedded systems,” in Proc. 11th Workshop Parallel Program. Run-Time Manage. Techn. Many-core Archit./9th Workshop Des. Tools Archit. Multicore Embedded Comput. Platforms, 2020, pp. 1-6."}]},{"sourceEn":"Symp. Circuits Syst.","publicationType":"journal","id":"b88","label":"[88].","nian":1996,"citedCount":0,"citationList":[{"personList":[{"name":"H. Esbensen","personType":"author"},{"name":"E","personType":"author"},{"name":"S. Kuh","personType":"author"}],"content":"H. Esbensen and E. S. Kuh, “Design space exploration using the genetic algorithm,” in Proc. IEEE Int. Symp. Circuits Syst., 1996, vol. 4, pp. 500-503."}]},{"sourceEn":"Emerg. Technol. Comput. Syst.","publicationType":"journal","id":"b89","label":"[89].","nian":2022,"citedCount":0,"citationList":[{"personList":[{"name":"M. Barbareschi","personType":"author"},{"name":"S. Barone","personType":"author"},{"name":"A. Bosio","personType":"author"},{"name":"J. Han","personType":"author"},{"name":"M. Traiola","personType":"author"}],"content":"M. Barbareschi, S. Barone, A. Bosio, J. Han, and M. Traiola, “A genetic-algorithm-based approach to the design of DCT hardware accelerators,” ACM J. Emerg. Technol. Comput. Syst., vol. 18, no. 3, pp. 1-25, 2022."}]},{"sourceEn":"Comput.-Aided Des. Integr. Circuits Syst.","publicationType":"journal","id":"b90","label":"[90].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Dou","personType":"author"},{"name":"C. Wang","personType":"author"},{"name":"H. Waris","personType":"author"},{"name":"R. Woods","personType":"author"},{"name":"W. Liu","personType":"author"}],"content":"Y. Dou, C. Wang, H. Waris, R. Woods, and W. Liu, “FPAX: A fast prior knowledge-based framework for DSE in approximate configurations,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 43, no. 6, pp. 1650-1662, Jun. 2024."}]},{"publicationType":"conf-proc","id":"b91","label":"[91].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"N. Wu","personType":"author"},{"name":"Y. Xie","personType":"author"},{"name":"C. Hao","personType":"author"}],"content":"N. Wu, Y. Xie, and C. Hao, “Ironman: GNN-assisted design space exploration in high-level synthesis via reinforcement learning,” in Proc. 2021 Great Lakes Symp. VLSI, 2021, pp. 39-44."}]},{"sourceEn":"Comput.-Aided Des. Integr. Circuits Syst.","publicationType":"journal","id":"b92","label":"[92].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"N. Wu","personType":"author"},{"name":"Y. Xie","personType":"author"},{"name":"C. Hao","personType":"author"}],"content":"N. Wu, Y. Xie, and C. Hao, “Ironman-pro: Multiobjective design space exploration in HLS via reinforcement learning and graph neural network-based modeling,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 42, no. 3, pp. 900-913, Mar. 2023."}]},{"sourceEn":"Art.","publicationType":"journal","id":"b93","label":"[93].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"K. Feng","personType":"author"},{"personType":"author"}],"content":"K. Feng et al., “Erdse: Efficient reinforcement learning based design space exploration method for cnn accelerator on resource limited platform,” Graph. Vis. Comput., vol. 4, 2021, Art. no. 200024."}]},{"publicationType":"conf-proc","id":"b94","label":"[94].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"S. Saeedi","personType":"author"},{"name":"A. Savino","personType":"author"},{"name":"S. Di Carlo","personType":"author"}],"content":"S. Saeedi, A. Savino, and S. Di Carlo,“Design space exploration of approximate computing techniques with a reinforcement learning approach,” in Proc. 53rd Annu. IEEE/IFIP Int. Conf. Dependable Syst. Netw. Workshops, 2023, pp. 167-170."}]},{"sourceEn":"Neural Comput.","publicationType":"journal","id":"b95","label":"[95].","nian":2000,"citedCount":0,"citationList":[{"personList":[{"name":"Y. Bengio","personType":"author"}],"content":"Y. Bengio, “Gradient-based optimization of hyperparameters,” Neural Comput., vol. 12, no. 8, pp. 1889-1900, 2000."}]},{"sourceEn":"Inf. Sci.","publicationType":"journal","id":"b96","label":"[96].","nian":2020,"citedCount":0,"citationList":[{"personList":[{"name":"I. Ahmadianfar","personType":"author"},{"name":"O. Bozorg-Haddad","personType":"author"},{"name":"X. Chu","personType":"author"}],"content":"I. Ahmadianfar, O. Bozorg-Haddad, and X. Chu, “Gradient-based optimizer: A new metaheuristic optimization algorithm,” Inf. Sci., vol. 540, pp. 131-159, 2020."}]},{"sourceEn":"Soc. Annu. Symp. VLSI","publicationType":"journal","id":"b97","label":"[97].","nian":2010,"citedCount":0,"citationList":[{"personList":[{"name":"S. Xydis","personType":"author"},{"name":"C. Skouroumounis","personType":"author"},{"name":"K. Pekmestzi","personType":"author"},{"name":"D. Soudris","personType":"author"},{"name":"G. Economakos","personType":"author"}],"content":"S. Xydis, C. Skouroumounis, K. Pekmestzi, D. Soudris, and G. Economakos,“Efficient high level synthesis exploration methodology combining exhaustive and gradient-based pruned searching,” in Proc. 2010 IEEE Comput. Soc. Annu. Symp. VLSI, 2010, pp. 104-109."}]},{"sourceEn":"Comput. Informat.","publicationType":"journal","id":"b98","label":"[98].","nian":2003,"citedCount":0,"citationList":[{"personList":[{"name":"V. Filipovic´","personType":"author"}],"content":"V. Filipovic´, “Fine-grained tournament selection operator in genetic algorithms,” Comput. Informat., vol. 22, no. 2, pp. 143-161, 2003."}]},{"sourceEn":"Complex Syst.","publicationType":"journal","id":"b99","label":"[99].","nian":1995,"citedCount":0,"citationList":[{"personList":[{"name":"B. L. Miller","personType":"author"},{"personType":"author"}],"content":"B. L. Miller et al., “Genetic algorithms, tournament selection, and the effects of noise,” Complex Syst., vol. 9, no. 3, pp. 193-212, 1995."}]},{"sourceEn":"Anal. Knowl. Manage.","publicationType":"journal","id":"b100","label":"[100].","nian":2015,"citedCount":0,"citationList":[{"personList":[{"name":"A. Shukla","personType":"author"},{"name":"H. M. Pandey","personType":"author"},{"name":"D. Mehrotra","personType":"author"}],"content":"A. Shukla, H. M. Pandey, and D. Mehrotra,“Comparative review of selection techniques in genetic algorithm,” in Proc. 2015 IEEE Int. Conf. Futuristic Trends Comput. Anal. Knowl. Manage., 2015, pp. 515-519."}]},{"sourceEn":"Mechan. Appl.","publicationType":"journal","id":"b101","label":"[101].","nian":2012,"citedCount":0,"citationList":[{"personList":[{"name":"A. Lipowski","personType":"author"},{"name":"D","personType":"author"},{"name":"Lipowska","personType":"author"}],"content":"A. Lipowski and D. Lipowska, “Roulette-wheel selection via stochastic acceptance,” Physica A: Stat. Mechan. Appl., vol. 391, no. 6, pp. 2193-2196, 2012."}]},{"sourceEn":"Conf. Netw. Inf. Syst. Comput.","publicationType":"journal","id":"b102","label":"[102].","nian":2016,"citedCount":0,"citationList":[{"personList":[{"name":"F. Yu","personType":"author"},{"name":"X. Fu","personType":"author"},{"name":"H. Li","personType":"author"},{"name":"G. Dong","personType":"author"}],"content":"F. Yu, X. Fu, H. Li, and G. Dong,“Improved Roulette wheel selection-based genetic algorithm for TSP,” in Proc. 2016 IEEE Int. Conf. Netw. Inf. Syst. Comput., 2016, pp. 151-154."}]},{"sourceEn":"Appl. Intell.","publicationType":"journal","id":"b103","label":"[103].","nian":2018,"citedCount":0,"citationList":[{"personList":[{"name":"W. Qian","personType":"author"},{"name":"J. Chai","personType":"author"},{"name":"Z. Xu","personType":"author"},{"name":"Z. Zhang","personType":"author"}],"content":"W. Qian, J. Chai, Z. Xu, and Z. Zhang, “Differential evolution algorithm with multiple mutation strategies based on Roulette wheel selection,” Appl. Intell., vol. 48, pp. 3612-3629, 2018."}]},{"sourceEn":"Soft Comput.","publicationType":"journal","id":"b104","label":"[104].","nian":2015,"citedCount":0,"citationList":[{"personList":[{"name":"A. J. Umbarkar","personType":"author"},{"name":"P","personType":"author"},{"name":"D. Sheth","personType":"author"}],"content":"A. J. Umbarkar and P. D. Sheth, “Crossover operators in genetic algorithms: A review,” ICTACT J. Soft Comput., vol. 6, no. 1, pp. 1083-1092, 2015."}]},{"sourceEn":"Managerial Psychol.","publicationType":"journal","id":"b105","label":"[105].","nian":2009,"citedCount":0,"citationList":[{"personList":[{"name":"A. B. Bakker","personType":"author"},{"name":"M. Westman","personType":"author"},{"name":"I. H. van Emmerik","personType":"author"}],"content":"A. B. Bakker, M. Westman, and I. H. van Emmerik, “Advancements in crossover theory,” J. Managerial Psychol., vol. 24, no. 3, pp. 206-219, 2009."}]},{"publicationType":"journal","id":"b106","label":"[106].","nian":1994,"citedCount":0,"citationList":[{"personList":[{"name":"R. W. Hockney","personType":"author"}],"content":"R. W. Hockney,“The communication challenge for MPP: Intel paragon and meiko CS-2,” Parallel Comput., vol. 20, no. 3, pp. 389-398, 1994."}]},{"sourceEn":"[Online]","publicationType":"online","id":"b107","label":"[107].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"J. Fang","personType":"author"},{"personType":"author"}],"content":"J. Fang et al., “Palm: A efficient performance simulator for tiled accelerators with large-scale model training,” 2024. [Online]. org/abs/2406.03868","url":"Available: https://arxiv"}]},{"publicationType":"journal","id":"b108","label":"[108].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"S. Rashidi","personType":"author"},{"name":"W. Won","personType":"author"},{"name":"S. Srinivasan","personType":"author"},{"name":"P. Gupta","personType":"author"},{"name":"T. Krishna","personType":"author"}],"content":"S. Rashidi, W. Won, S. Srinivasan, P. Gupta, and T. Krishna,“Fred: Flexible reduction-distribution interconnect and communication implementation for wafer-scale distributed training of DNN models,” 2024, arXiv:2406.19580."}]},{"sourceEn":"Lang. Operating Syst.","publicationType":"journal","id":"b109","label":"[109].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"Z. Tan","personType":"author"},{"name":"Z. Zhu","personType":"author"},{"name":"K. Ma","personType":"author"}],"content":"Z. Tan, Z. Zhu, and K. Ma, “Cocco: Hardware-mapping co-exploration towards memory capacity-communication optimization,” in Proc. 29th ACM Int. Conf. Architectural Support Program. Lang. Operating Syst., 2024, pp. 69-84."}]},{"sourceEn":"Comput. Archit.","publicationType":"journal","id":"b110","label":"[110].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"J. Cai","personType":"author"},{"personType":"author"}],"content":"J. Cai et al., “Gemini: Mapping and architecture co-exploration for large-scale DNN chiplet accelerators,” in Proc. 2024 IEEE Int. Symp. High-Perform. Comput. Archit., 2024, pp. 156-171."}]},{"sourceEn":"“Theseus: Exploring efficient wafer-scale chip design for large language models,”","publicationType":"journal","id":"b111","label":"[111].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"J. Zhu","personType":"author"},{"name":"C. Xue","personType":"author"},{"name":"Y. Chen","personType":"author"},{"name":"Z. Wang","personType":"author"},{"name":"G. Sun","personType":"author"}],"content":"J. Zhu, C. Xue, Y. Chen, Z. Wang, and G. Sun, “Theseus: Exploring efficient wafer-scale chip design for large language models,” 2024, arXiv:2407.02079."}]},{"sourceEn":"Eng.","publicationType":"journal","id":"b112","label":"[112].","nian":2023,"citedCount":0,"citationList":[{"personList":[{"name":"R. Dale","personType":"author"}],"content":"R. Dale,“GPT-3: What’s it good for?.” Natural Lang. Eng., vol. 27, no. 1, pp. 113-118, 2023."}]},{"sourceEn":"Jan.-Jun.","publicationType":"journal","id":"b113","label":"[113].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"H. Luo","personType":"author"},{"personType":"author"}],"content":"H. Luo et al., “Ramulator 2.0: A modern, modular, and extensible dram simulator,” IEEE Comput. Archit. Lett., vol. 23, no. 1, pp. 112-116, Jan.-Jun. 2024."}]},{"sourceEn":"Comput. Netw.","publicationType":"journal","id":"b114","label":"[114].","nian":2009,"citedCount":0,"citationList":[{"personList":[{"name":"A. H. M. Rad","personType":"author"},{"name":"V","personType":"author"},{"name":"W. Wong","personType":"author"}],"content":"A. H. M. Rad and V. W. Wong, “Congestion-aware channel assignment for multi-channel wireless mesh networks,” Comput. Netw., vol. 53, no. 14, pp. 2502-2516, 2009."}]},{"sourceEn":"Conf. Netw.","publicationType":"journal","id":"b115","label":"[115].","nian":2007,"citedCount":0,"citationList":[{"personList":[{"name":"A. A. Pirzada","personType":"author"},{"name":"R. Wishart","personType":"author"},{"name":"M. Portmann","personType":"author"}],"content":"A. A. Pirzada, R. Wishart, and M. Portmann, “Congestion aware routing in hybrid wireless mesh networks,” in Proc. 15th IEEE Int. Conf. Netw., 2007, pp. 513-518."}]},{"sourceEn":"Art.","publicationType":"journal","id":"b116","label":"[116].","nian":2021,"citedCount":0,"citationList":[{"personList":[{"name":"N. Taherkhani","personType":"author"},{"name":"R. Akbar","personType":"author"},{"name":"F. Safaei","personType":"author"},{"name":"M. Moudi","personType":"author"}],"content":"N. Taherkhani, R. Akbar, F. Safaei, and M. Moudi,“A congestion-aware routing algorithm for mesh-based platform networks-on-chip,” Microelectronics J., vol. 114, 2021, Art. no. 105145."}]},{"publicationType":"journal","id":"b117","label":"[117].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"J. Yao","personType":"author"},{"personType":"author"}],"content":"J. Yao et al., “Training ultra long context language model with fully pipelined distributed transformer,” 2024, arXiv:2408.16978."}]},{"publicationType":"conf-proc","id":"b118","label":"[118].","nian":2024,"citedCount":0,"citationList":[{"personList":[{"name":"P. Basu","personType":"author"},{"name":"L. Zhao","personType":"author"},{"name":"J. Fantl","personType":"author"},{"name":"S. Pal","personType":"author"},{"name":"A. Krishnamurthy","personType":"author"},{"name":"J. Khoury","personType":"author"}],"content":"P. Basu, L. Zhao, J. Fantl, S. Pal, A. Krishnamurthy, and J. Khoury,“Efficient all-to-all collective communication schedules for direct-connect topologies,” in Proc. 33rd Int. Symp. High-Perform. Parallel Distrib. Comput., 2024, pp. 28-41."}]},{"sourceEn":"Syst.","publicationType":"journal","id":"b119","label":"[119].","nian":0,"citedCount":0,"citationList":[{"personList":[{"name":"S. Li","personType":"author"},{"name":"K. Lu","personType":"author"},{"name":"Z. Lai","personType":"author"},{"name":"W. Liu","personType":"author"},{"name":"K. Ge","personType":"author"},{"name":"D. Li","personType":"author"}],"content":"S. Li, K. Lu, Z. Lai, W. Liu, K. Ge, and D. Li, “A multidimensional communication scheduling method for hybrid parallel DNN training,” IEEE Trans. Parallel Distrib. Syst., vol. 35, no. 8, pp. 1415-1428, Aug. 2024."}]}],"fnList":[{"id":"equal1","content":"

(Huizheng Wang and Qize Yang contributed equally to this work.)

"}],"affList_en":["1 School of Integrated Circuits, Tsinghua University, Beijing 100084, China","2 Shanghai AI laboratory, Shanghai 200003, China","3 International Innovation Center of Tsinghua University, Shanghai 200003, China"],"article":{"juan":"1","endNoteUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=EndNote&id=49229","bibtexUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=BibTeX&id=49229","articleType":"A","abstractUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/10.23919/ICS.2024.3515003","qi":"4","id":49229,"nian":2024,"bianHao":"1736417280986-150224381","juanUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/Y2024","shouCiFaBuRiQi":"2025-01-09","qiShiYe":"178","accepted":"2024-12-05","qiUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2024/V1/I4","pdfSize":"2383KB","risUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=Ris&id=49229","doi":"10.23919/ICS.2024.3515003","jieShuYe":"195","keywordList_en":["Large language models","recomputation","tensor partition","training","wafer-scale chips."],"endNoteUrl_cn":"//www.sghhindu.com/www.qk/ics/CN/article/getTxtFile.do?fileType=EndNote&id=49229","zhaiyao_en":"

Transformer-based large language models (LLMs) have made significant strides in the field of artificial intelligence (AI). However, training these LLMs imposes immense demands on computational power and bandwidth for hardware systems. Wafer-scale chips (WSCs) offer a promising solution, yet they struggle with limited on-chip memory and complex tensor partitioning. To fully harness the high-bandwidth, low-latency on-chip interconnect benefits of WSCs and to alleviate the on-chip memory limitations, a specialized mapping and architecture co-exploration method is essential. Despite existing efforts in memory optimization and mapping, current approaches fall short for WSC scenarios. To bridge this gap, we introduce TMAC, an architecture-mapping co-exploration framework that integrates recomputation into the design space, fully exploiting optimization opportunities overlooked by existing works. Further, TMAC takes advantage of the superior on-chip interconnect performance of WSCs by incorporating a more flexible tensor partition scheme. TMAC then introduces a novel operator-centric encoding scheme (OCES) designed to comprehensively describe the mapping space for training LLMs. Unlike previous studies that focus solely on communication volume analysis based on mapping, TMAC explores the design space by evaluating the combined impact of mapping and architecture on training performance. However, fully accounting for these untapped optimization opportunities increases the complexity of the design space. To address this, we streamline the simulation process, reducing the time needed for exploration. Compared to AccPar, Deepspeed and Megatron, TMAC delivers a 3.1×, 2.9×, 1.6× performance gain. In terms of memory usage, TMAC requires 3.6×, 3.1× less memory than AccPar and Deepspeed, respectively and is comparable to Megatron’s full recomputation method.

","bibtexUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=BibTeX&id=49229","abstractUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/10.23919/ICS.2024.3515003","juanUrl_cn":"https://www.qk.sjtu.edu.cn/ics/CN/Y2024","lanMu_en":"Original article","qiUrl_en":"//www.sghhindu.com/www.qk/ics/EN/Y2024/V1/I4","risUrl_en":"https://www.qk.sjtu.edu.cn/ics/EN/article/getTxtFile.do?fileType=Ris&id=49229","title_en":"TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips","revised":"2024-12-05","hasPdf":"true"},"authorList_en":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"authorList_cn":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"journal":{"issn":"2995-1968","qiKanWangZhi":"//www.sghhindu.com/www.qk/ics","qiKanMingCheng_CN":"Integrated Circuits and Systems","id":22,"qiKanMingCheng_EN":"Integrated Circuits and Systems"},"authorList":[{"deceased":false,"xref":"1","orcid":"0000-0002-9763-8208","name_cn":"HUIZHENG WANG","xref_en":"1","name_en":"HUIZHENG WANG"},{"deceased":false,"xref":"1","orcid":"0009-0006-2221-4706","name_cn":"QIZE YANG","xref_en":"1","name_en":"QIZE YANG"},{"deceased":false,"xref":"1","name_cn":"TAIQUAN WEI","xref_en":"1","name_en":"TAIQUAN WEI"},{"deceased":false,"xref":"1","name_cn":"XINGMAO YU","xref_en":"1","name_en":"XINGMAO YU"},{"deceased":false,"xref":"1","name_cn":"CHENGRAN LI","xref_en":"1","name_en":"CHENGRAN LI"},{"deceased":false,"xref":"1","name_cn":"JIAHAO FANG","xref_en":"1","name_en":"JIAHAO FANG"},{"deceased":false,"xref":"1","name_cn":"GUANGYANG LU","xref_en":"1","name_en":"GUANGYANG LU"},{"deceased":false,"xref":"2","name_cn":"XU DAI","xref_en":"2","name_en":"XU DAI"},{"deceased":false,"xref":"2","name_cn":"LIANG LIU","xref_en":"2","name_en":"LIANG LIU"},{"deceased":false,"xref":"2","name_cn":"SHENFEI JIANG","xref_en":"2","name_en":"SHENFEI JIANG"},{"deceased":false,"xref":"1","orcid":"0000-0001-6942-4395","name_cn":"YANG HU","email":"hu_yang@tsinghua.edu.cn","xref_en":"1","name_en":"YANG HU"},{"deceased":false,"xref":"1, 3","orcid":"0000-0003-2309-572X","name_cn":"SHOUYI YIN","xref_en":"1, 3","name_en":"SHOUYI YIN"},{"deceased":false,"xref":"1","orcid":"0000-0001-5117-7920","name_cn":"SHAOJUN WEI","xref_en":"1","name_en":"SHAOJUN WEI"}],"authorNotes_en":["+ CORRESPONDING AUTHOR: YANG HU (e-mail: hu_yang@tsinghua.edu.cn).","

(Huizheng Wang and Qize Yang contributed equally to this work.)

"],"authorNotesCommon_en":["

(Huizheng Wang and Qize Yang contributed equally to this work.)

"],"backFnGroupList":[{}]}">

TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips

HUIZHENG WANG, QIZE YANG, TAIQUAN WEI, XINGMAO YU, CHENGRAN LI, JIAHAO FANG, GUANGYANG LU, XU DAI, LIANG LIU, SHENFEI JIANG, YANG HU, SHOUYI YIN, SHAOJUN WEI

Integrated Circuits and Systems››2024, Vol. 1››Issue (4): 178-195.

PDF(2383 KB)
PDF(2383 KB)
Integrated Circuits and Systems ›› 2024, Vol. 1 ›› Issue (4) : 178-195. DOI: 10.23919/ICS.2024.3515003
Original article

TMAC: Training-Targeted Mapping and Architecture Co-Exploration for Wafer-Scale Chips

    {{javascript:window.custom_author_en_index=0;}}
  • {{article.zuoZhe_EN}}
Author information +
History +

HeighLight

{{article.keyPoints_en}}

Abstract

{{article.zhaiyao_en}}

Key words

QR code of this article

Cite this article

Download Citations
{{article.zuoZheEn_L}}.{{article.title_en}}[J]. {{journal.qiKanMingCheng_EN}}, 2024, 1(4): 178-195 https://doi.org/10.23919/ICS.2024.3515003

References

References

{{article.reference}}

Funding

RIGHTS & PERMISSIONS

{{article.copyrightStatement_en}}
{{article.copyrightLicense_en}}
PDF(2383 KB)

Accesses

Citation

Detail

Sections
Recommended

/

Baidu
map