Title: Optimizing HPC I/O Performance Over New Memory and Storage Hierarchies: A Data-Driven Approach

Date: September 20th, 2024
Time: 2:00 PM - 4:00 PM EDT
Location: Klaus Advanced Computing Building, Conference Room 3126
Virtual meeting: https://gatech.zoom.us/j/98262397839?pwd=fc5E5vZzIEwgH2nMHN5oa9ci1z8Q3t.1

Ranjan Sarpangala Venkatesh
School of Computer Science
College of Computing
Georgia Institute of Technology

Committee
Dr. Ada Gavrilovska (advisor) - School of Computer Science, Georgia Institute of Technology
Dr. Greg Eisenhauer - School of Computer Science, Georgia Institute of Technology
Dr. Santosh Pande - School of Computer Science, Georgia Institute of Technology
Dr. Richard Vuduc - School of Computational Science and Engineering, Georgia Institute of Technology

Abstract

Multi-component HPC workflows face growing bottlenecks in data and metadata I/O due to rapid data growth. The efficiency of data movement in these workflows relies on I/O performance throughout the memory and storage hierarchy. While new memory technologies and I/O stacks offer opportunities for improvement, their distinct APIs complicate optimization, making empirical approaches necessary to balance trade-offs across components. This thesis supports data-driven methods to improve I/O performance in next-generation HPC systems.

The first part of this work evaluated various workflow configurations on systems with heterogeneous memory, showing that careful scheduling and data allocation can enhance end-to-end performance by up to 1.6x. By analyzing workflow characteristics, key elements impacting performance variability were identified, resulting in a framework for future workflow schedulers to optimize in situ workflows.

The second part focused on metadata I/O, a growing issue in large-scale workflows. Using the WarpX application and ADIOS (Adaptable I/O System) middleware, which is widely used for data management in scientific applications, it was shown that metadata I/O could account for up to 25% of total I/O time at scale. To address this, the design space of the DAOS (Distributed Asynchronous Object Storage) system was explored, specifically focusing on DAOS Key-Value and Array objects for transferring ADIOS metadata. A newly developed DAOS-based engine for ADIOS metadata I/O improved performance by 2.3x compared to the DAOS POSIX interface, effectively reducing metadata scaling bottlenecks. For the WarpX application, this reduced metadata I/O time by more than 4x, lowering overhead from 20% to just 5% of total I/O time.

As part of the proposed work, MetaBench is introduced. This suite of benchmarks will evaluate DAOS Key-Value and Array interfaces for ADIOS metadata transfer, optimized for a given HPC setup. MetaBench will analyze trade-offs, including metadata size and the number of ranks, to identify the optimal DAOS configuration. It will evaluate real applications and data patterns, providing a practical template for managing metadata transfer across HPC middleware, including ADIOS, HDF5, and PnetCDF.

These insights will contribute to the development of tools and support the HPC community by integrating DAOS engines into widely used middleware.