Fault Resilient Exascale Supercomputing

The EPIC Lab’s research on fault‑resilient exascale supercomputing develops cross‑layer strategies that ensure reliable, energy‑efficient, and high‑performance execution at unprecedented system scales. This work spans resilience‑aware resource management frameworks, multilevel checkpointing models, and optimized checkpoint intervals that reduce energy overhead while maintaining strong fault tolerance in massively parallel systems. The lab has advanced methodologies for modeling performance and power under application co‑location, analyzing memory‑interference effects, and evaluating the trade‑offs among diverse resilience protocols for large‑scale scientific workloads. Complementary contributions leverage data analytics to improve robustness and efficiency across heterogeneous HPC nodes, capturing uncertainty in workload behavior and system interactions. Collectively, this research establishes a comprehensive foundation for dependable exascale platforms capable of sustaining performance in the face of frequent faults, extreme concurrency, and complex resource dynamics.

Selected Publications

D. Dauwe, S. Pasricha, A. A. Maciejewski, H.J. Siegel, “Resilience-Aware Resource Management for Exascale Computing Systems”, IEEE Transactions on Sustainable Computing (TSUSC), Vol. 3, No. 4, Oct-Dec 2018.

D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A. A. Maciejewski, D. Bader, H.J. Siegel, “HPC Node Performance and Energy Modeling Under the Uncertainty of Application Co-Location”, Journal of Supercomputing, Vol. 72, No. 12, pp. 4771-4809, Nov. 2016.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Analysis of Multilevel Checkpoint Performance Models,” 20th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), co-organized with IEEE International Parallel and Distributed Processing Symposium (IPDPS), Vancouver, BC, Canada, May 2018.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Exploration of Fault Resilience Protocols for Large-Scale Application Execution on Exascale Computing Platforms,” 5th Exascale Applications and Software Conference (EASC), Edinburgh, Scotland, 2018. (Extended Abstract)

D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “Optimizing Checkpoint Intervals for Reduced Energy Use in Exascale Systems,” IEEE Workshop on Energy-efficient Networks of Computers (E2NC): from the Chip to the Cloud, Orlando, FL, USA, Oct 2017.

S. Pasricha, J. Doppa, K. Chakrabarty, S. Tiku, D. Dauwe, S. Jin, P. Pande, “Data Analytics Enables Energy-Efficiency and Robustness: From Mobile to Manycores, Datacenters, and Networks”, ACM/IEEE International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Oct 2017.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “An Analysis of Resilience Techniques for Exascale Computing Platforms,” 19th Workshop on Advances in Parallel and Distributed Computational Models (APDCM), co-organized with IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2017.

D. Dauwe, S. Pasricha, A. A. Maciejewski, H. J. Siegel, “A Performance and Energy Comparison of Fault Tolerence Techniques for Exascale Computing Systems,” 6th IEEE International Symposium on Cloud and Service Computing (SC-2), Dec 2016.

D. Dauwe, E. Jonardi, R. Friese, S. Pasricha, A. A. Maciejewski, D. Bader, H.J. Siegel, “A Methodology for Co-Location Aware Application Performance Modeling in Multicore Computing,” 17th Workshop on Workshop on Advances in Parallel and Distributed Computational Models (APDCM), May 2015.

D. Dauwe, R. Friese, S. Pasricha, A. A. Maciejewski, G. A. Koenig, H. J. Siegel, ” Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems,” International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), 2014.