Title: Designing and Automating Asynchronous, Localized, Multi-Level Fault-Tolerance at the Application Level

 

Date: Thursday, August 15, 2024

Time: 3pm - 5pm EST

Location: Klaus 3402

Remote: via Teams

 

Matthew Whitlock

Ph.D. Candidate in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

 

Committee:

Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology

Dr. Keita Teranishi - Advanced Computing Systems Research Section, Oak Ridge National Laboratories

Dr. Ada Gavrilovska - School of Computer Science, Georgia Institute of Technology

Dr. Umakishore Ramachandran - School of Computer Science, Georgia Institute of Technology

Dr. Tom Conte - School of Computer Science, Georgia Institute of Technology

 

Abstract

Moore's law is dead or dying, but Rock's law of doubling costs for semiconductor fabrication is still going strong. It is becoming more expensive to meet ever-growing compute demands, and the general public is expressing growing concerns about the environmental impact of extreme-scale computing. Consequently, researchers in fields like machine learning and embedded computing are exploring reduced-reliability computing.

 

Supercomputing facilities, however, are struggling to maintain high-reliability hardware to support the inefficient and unscalable global checkpoint/restart (C/R) mechanisms that most scientific computing applications continue to rely on. The performance cost of C/R is rising faster than the performance of leading supercomputers. Applications' fault-tolerance must scale against higher parallelism and reduced hardware reliability for HPC to continue scaling while reigning in its environmental footprint. To avoid the exponential growth of C/R overheads, applications must localize the cost of hardware faults. Further, fault tolerance must be flexible to application-specific refinements while managing application developers' reticence to implement complex resilience code. 

 

We describe a layer-based resilience taxonomy that exposes the imperative configurability mechanisms to make fault-tolerance tools that can flexibly combine to utilize general application- and platform- tailored fault recovery. We prove this by extending contemporary resilience tools to enable flexible, easy-to-implement online recovery into applications with a multi-layered approach. Next, we define the key requirements of localized recovery by creating a general analytical model for local recovery. We prove that recovery can be localized using modern User-Level Fault Tolerance (ULFM) MPI features despite ULFM's collective recovery constraints. Finally, we prove that asynchrony via task-based parallelism can mitigate the non-local costs of recovery for applications which cannot strictly meet the requirements for localized recovery.

 

These works build the path for HPC to maintain environmental accountability, meet growing compute demands, and benefit from novel upcoming hardware trends.