Title: Optimizing Sparsity in Distributed Machine Learning Training

 

Date: Wednesday, August 14th, 2024

Time: 10:00 AM - 11:30 AM EST

Location: KACB 3126

Virtual: https://gatech.zoom.us/j/95651978956?pwd=cNa98ds2wGysW5SlajJDBAkouSiAvH.1

 

Cheng Wan

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee Members:

Dr. Yingyan (Celine) Lin (Advisor) – School of Computer Science, Georgia Institute of Technology

Dr. Pan Li – School of Computational Science and Engineering, Georgia Institute of Technology

Dr. Alexey Tumanov – School of Computer Science, Georgia Institute of Technology

Dr. Anand Iyer – School of Computer Science, Georgia Institute of Technology

 

Abstract:

As machine learning models and datasets continue to grow, distributed machine learning has become essential for meeting the computational demands of large-scale training. While recent frameworks have improved scalability and throughput, the optimization of sparsity within distributed deep neural networks remains under-explored. Sparsity, however, poses a significant challenge in advanced models such as graph neural networks and mixture-of-experts models.

 

This thesis proposal systematically investigates three types of sparsity in distributed machine learning training: sparse data access, sparse operations, and sparse workflows. We introduce both system-level and algorithm-level innovations to address inefficiencies caused by these types of sparsity. Through theoretical insights and practical implementations, we demonstrate how these optimizations can significantly reduce communication overhead and enhance scalability, thereby improving the efficiency and performance of distributed machine learning systems. Our work advances the understanding of sparsity optimization and lays the groundwork for developing more efficient distributed training architectures in the future.