Fault Resilient Exascale Supercomputing

With the increase in the complexity and number of nodes in large-scale high performance computing (HPC) systems, the probability of applications experiencing failures has increased significantly. As the computational demands of applications that execute on HPC systems increase, projections indicate that applications executing on exascale-sized systems are likely to operate with a mean time between failures (MTBF) of as little as a few minutes. A number of strategies for enabling fault resilience in systems of extreme sizes have been proposed in recent years. However, few studies provide performance comparisons for these resilience techniques.

The research objective of this project is to analyze existing state-of-the-art HPC resilience techniques that are being considered for use in exascale systems.The goal is to explore the behavior of each resilience technique for a diverse set of applications varying in communication behavior and memory use, and design new resilience techniques with better scalability. We aim to examine how resilience techniques behaves as application size scales from what is considered large today through to exascale-sized applications. We further propose to study the performance degradation that a large-scale system experiences from the overhead associated with each resilience technique as well as the application computation needed to continue execution when a failure occurs. We will also examine how application performance on exascale systems can be improved by allowing the system to select the optimal resilience technique to use in an application-specific manner, depending upon each application’s execution characteristics.

Selected Publications

D. Dauwe, S. Pasricha, A. A. Maciejewski, H.J. Siegel, “Resilience-Aware Resource Management for Exascale Computing Systems”, IEEE Transactions on Sustainable Computing (TSUSC), Vol. 3, No. 4, Oct-Dec 2018.

D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A. A. Maciejewski, D. Bader, H.J. Siegel, “HPC Node Performance and Energy Modeling Under the Uncertainty of Application Co-Location”, Journal of Supercomputing, Vol. 72, No. 12, pp. 4771-4809, Nov. 2016.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Analysis of Multilevel Checkpoint Performance Models,” 20th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), co-organized with IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, May 2018.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms,” 5th Exascale Applications and Software Conference (EASC), Edinburgh, Scotland, 2018.

D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “Optimizing Checkpoint Intervals for Reduced Energy Use in Exascale Systems,” IEEE Workshop on Energy-efficient Networks of Computers (E2NC): from the Chip to the Cloud, Orlando, FL, USA, Oct 2017.

S. Pasricha, J. Doppa, K. Chakrabarty, S. Tiku, D. Dauwe, S. Jin, P. Pande, “Data Analytics Enables Energy-Efficiency and Robustness: From Mobile to Manycores, Datacenters, and Networks”, ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Oct 2017.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Analysis of Resilience Techniques for Exascale Computing Platforms,” 19th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), co-organized with IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2017.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “A Performance and Energy Comparison of Fault Tolerence Techniques for Exascale Computing Systems,” 6th IEEE International Symposium on Cloud and Service Computing (SC-2), Dec 2016.

D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A. A. Maciejewski, D. Bader, H.J. Siegel, “A Methodology for Co-Location Aware Application Performance Modeling in Multicore Computing,” 17th Workshop on Workshop on Advances in Parallel and Distributed Computational Models (APDCM), May 2015.

D. Dauwe, R. Friese, S. Pasricha, A. A. Maciejewski, G. A. Koenig, H. J. Siegel, ” Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems,” International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), 2014.