Title: Software-Hardware Optimizations for Efficient Collective Communications in Distributed Machine Learning Platforms

 

Date: Monday, September 23, 2024

Time: 9:00 AM – 11:00 AM ET

Location: Klaus 1212, (hybrid) https://gatech.zoom.us/j/94843770067?pwd=1kRevvLZLDTxm0N59mBoW70EdL1fbw.1

 

William Jonghoon Won

Ph.D. Student

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee:

Dr. Tushar Krishna (advisor) - School of Electrical and Computer Engineering & School of Computer Science, Georgia Institute of Technology

Dr. Yingyan (Celine) Lin - School of Computer Science, Georgia Institute of Technology

Dr. Divya Mahajan - School of Computer Science & School of Electrical and Computer Engineering, Georgia Institute of Technology

Dr. Manya Ghobadi - Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology

Dr. Bradford Beckmann - Research and Advanced Development, Advanced Micro Devices

 

Abstract:

The advancement of large-scale Machine Learning (ML) models and their massive resource requirements has driven the development of specialized, distributed High-Performance Computing (HPC) platforms tailored to ML workloads. These platforms integrate multiple Neural Processing Units (NPUs) interconnected through custom network fabrics. Since ML models and data are distributed, frequent synchronization of activations and gradients among NPUs is required. This synchronization presents a major bottleneck in distributed ML, making efficient collective communication a pivotal research challenge.

 

Given the tightly coupled co-design space of distributed ML, judicious software-hardware optimization approaches are essential. To address this, I first present (i) ASTRA-sim2.0, an end-to-end simulation and modeling framework that facilitates design space exploration of the distributed ML stack. Next, I present (ii) LIBRA, an analytical modeling framework that captures the end-to-end execution time of distributed ML on multi-dimensional networks. Through integration with optimizers, LIBRA identifies optimal multi-dimensional network design points. Finally, I introduce (iii) TACOS, an autonomous topology-aware collective algorithm synthesizer that leverages time-expanded network representation and link-chunk matching algorithms to automatically generate optimized collective algorithms for arbitrary target topologies.