Title: Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms
Date: Monday, September 23, 2024
Time: 9:00 AM – 11:00 AM ET
Location: Klaus 1212, (hybrid) https://gatech.zoom.us/j/94843770067?pwd=1kRevvLZLDTxm0N59mBoW70EdL1fbw.1
William Jonghoon Won
Ph.D. Student
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Tushar Krishna (advisor) - School of Electrical and Computer Engineering & School of Computer Science, Georgia Institute of Technology
Dr. Yingyan (Celine) Lin - School of Computer Science, Georgia Institute of Technology
Dr. Divya Mahajan - School of Computer Science & School of Electrical and Computer Engineering, Georgia Institute of Technology
Dr. Manya Ghobadi - Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology
Dr. Bradford Beckmann - Research and Advanced Development, Advanced Micro Devices
Abstract:
The advancement of large-scale Machine Learning (ML) models and their massive resource requirements has driven the development of specialized, distributed High-Performance Computing (HPC) platforms tailored to ML workloads. These platforms integrate multiple Neural Processing Units (NPUs) interconnected through custom network fabrics. Since ML models and data are distributed, frequent synchronization of activations and gradients among NPUs is required. This synchronization presents a major bottleneck in distributed ML, making efficient collective communication a pivotal research challenge.
Given the tightly coupled co-design space of distributed ML, judicious software-hardware optimization approaches are essential. To address this, I first present (i) ASTRA-sim2.0, an end-to-end simulation and modeling framework that facilitates design space exploration of the distributed ML stack. Next, I present (ii) LIBRA, an analytical modeling framework that captures the end-to-end execution time of distributed ML on multi-dimensional networks. Through integration with optimizers, LIBRA identifies optimal multi-dimensional network design points. Finally, I introduce (iii) TACOS, an autonomous topology-aware collective algorithm synthesizer that leverages time-expanded network representation and link-chunk matching algorithms to automatically generate optimized collective algorithms for arbitrary target topologies.