Title: Designing and Automating Asynchronous, Localized, Multi-Level Fault-Tolerance at the Application Level
Date: Thursday, August 15, 2024
Time: 3pm - 5pm EST
Location: Klaus 3402
Remote: via Teams
Matthew Whitlock
Ph.D. Candidate in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Committee:
Dr. Vivek Sarkar (Advisor) - School of Computer Science, Georgia Institute of Technology
Dr. Keita Teranishi - Advanced Computing Systems Research Section, Oak Ridge National Laboratories
Dr. Ada Gavrilovska - School of Computer Science, Georgia Institute of Technology
Dr. Umakishore Ramachandran - School of Computer Science, Georgia Institute of Technology
Dr. Tom Conte - School of Computer Science, Georgia Institute of Technology
Abstract
Moore's law is dead or dying, but Rock's law of doubling costs for semiconductor fabrication is still going strong. It is becoming more expensive to meet ever-growing compute demands, and the general public is expressing growing concerns about the environmental impact of extreme-scale computing. Consequently, researchers in fields like machine learning and embedded computing are exploring reduced-reliability computing.
Supercomputing facilities, however, are struggling to maintain high-reliability hardware to support the inefficient and unscalable global checkpoint/restart (C/R) mechanisms that most scientific computing applications continue to rely on. The performance cost of C/R is rising faster than the performance of leading supercomputers. Applications' fault-tolerance must scale against higher parallelism and reduced hardware reliability for HPC to continue scaling while reigning in its environmental footprint. To avoid the exponential growth of C/R overheads, applications must localize the cost of hardware faults. Further, fault tolerance must be flexible to application-specific refinements while managing application developers' reticence to implement complex resilience code.
We describe a layer-based resilience taxonomy that exposes the imperative configurability mechanisms to make fault-tolerance tools that can flexibly combine to utilize general application- and platform- tailored fault recovery. We prove this by extending contemporary resilience tools to enable flexible, easy-to-implement online recovery into applications with a multi-layered approach. Next, we define the key requirements of localized recovery by creating a general analytical model for local recovery. We prove that recovery can be localized using modern User-Level Fault Tolerance (ULFM) MPI features despite ULFM's collective recovery constraints. Finally, we prove that asynchrony via task-based parallelism can mitigate the non-local costs of recovery for applications which cannot strictly meet the requirements for localized recovery.
These works build the path for HPC to maintain environmental accountability, meet growing compute demands, and benefit from novel upcoming hardware trends.