Graduate Exam Abstract

Daniel Dauwe

Ph.D. Final
April 30, 2018, 2:00 pm - 4:00 pm
EDUC 11 (Education building)
Resource Management for Extreme Scale High Performance Computing Systems in the Presence of Failures

Abstract: High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems.

To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems.

Adviser: Dr. Sudeep Pasricha
Co-Adviser: Dr. H.J. Siegel
Non-ECE Member: Dr. Patrick Burns, VP for Information Technologies/ Dean of Libraries
Member 3: Dr. Anothony A. Maciejewski
Addional Members: N/A

Publications:
[1] D. Dauwe, R. Friese, S. Pasricha, A. A. Maciejewski, G. A. Koenig, and H. J. Siegel, â€œModeling the effects on power and performance from memory interference of co-located applications in multicore systems,â€ The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2014), pp. 3-9, Las Vegas, NV, July 2014.

[2] D. Dauwe, E. Jonardi, R. D. Friese, S. Pasricha, A. A. Maciejewski, D. A. Bader, and H. J. Siegel, â€œA methodology for co-location aware application performance modeling in multicore computing,â€ The Workshop on Advances on Parallel and Distributed Computing Models (APDCM 2015), pp. 434-443, May 2015.

[3] D. Dauwe, E. Jonardi, R. D. Friese, S. Pasricha, A. A. Maciejewski, D. A. Bader, and H. J. Siegel, â€œHPC node performance and energy modeling with the co-location of applications,â€ The Journal of Supercomputing, Vol. 72, No. 12, pp. 4771-4809, Nov. 2016.

[4] D. Dauwe, S. Pasricha, A. A. Maciejewski, and H. J. Siegel, â€œA performance and energy comparison of fault tolerance techniques for exascale Computing Systems,â€ The 6th IEEE International Symposium on Cloud and Service Computing (SC2-2016), pp. 436-443, Dec. 2016.

[5] D. Dauwe, S. Pasricha, A. A. Maciejewski, and H.J. Siegel, â€œAn Analysis of Resilience Techniques for Exascale Computing Platforms,â€ The Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2017), pp. 914-923, May 2017.

[6] S. Pasricha, J. R. Doppa, K. Chakrabarty, S. Tiku, D. Dauwe, S. Jin, and P. P. Pande, "Special session paper: data analytics enables energy- efficiency and robustness: from mobile to manycores, datacenters, and networks," 2017 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), 10 pp., Oct. 2017.

[7] D. Dauwe, R. Jhaveri, S. Pasricha, A. A. Maciejewski, and H. J. Siegel, â€œOptimizing Checkpoint Intervals for Reduced Energy Use in Exascale Systems,â€ The Workshop on Energy-efficient Networks of Computers (E2NC): from the Chip to the Cloud, 8 pp., Oct. 2017.

[8] D. Dauwe, S. Pasricha, A. A. Maciejewski and H. J. Siegel, "Resilience-Aware Resource Management for Exascale Computing Systems," in IEEE Transactions on Sustainable Computing, 14 pp., accepted 2018, to appear.

[9] D. Dauwe, S. Pasricha, A. A. Maciejewski, and H.J. Siegel, â€œAn Analysis of Multilevel Checkpoint Performance Models,â€ The Workshop on Advances in Parallel and Distributed Computational Models (APDCM 2018), 10 pp., accepted 2018, to appear.

Program of Study:
ECE 561
ECE 554
ECE 661
ECE 666
ECE 514
CS 540
CS 545
Others: ECE 520, ECE 795, CS 420, GRAD 510, GRAD

Colorado State University

Electrical and Computer Engineering

Graduate Exam Abstract

Daniel Dauwe