Graduate Exam Abstract

Daniel Dauwe

Ph.D. Preliminary

December 2, 2016, 2:00 pm - 4:00 pm

ECE Conference Room C101 B

Resource Management for Extreme Scale High Performance Computing Systems in the Presence of Failures

Abstract: High performance computing (HPC) systems such as datacenters and supercomputers coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. To address this challenge, we propose ways of intelligently managing system resources and addressing distributed failures using various techniques to mitigate the negative effects of failures in HPC systems. Our resource management techniques employ information obtained from historical or predicted analysis about system performance, energy, and temperature behavior when operating under the uncertainty of failures. We investigate both how to better characterize and model the negative effects that system failures have on large-scale computing systems, as well as on developing new techniques for intelligently utilizing system resources through optimal scheduling of parallel applications on to HPC nodes.

Adviser: Prof. Sudeep Pasricha
Co-Adviser: Prof. Howard Jay Siegel
Non-ECE Member: Prof. Patrick Burns
Member 3: Prof. Anthony A. Maciejewski
Addional Members: N/A

Daniel Dauwe, Eric Jonardi, Ryan D. Friese, Sudeep Pasricha, Anthony A. Maciejewski, David A. Bader, and Howard Jay Siegel, “HPC Node Performance and Energy Modeling with the Co-Location of Applications,” The Journal of Supercomputing, accepted 2016, to appear.
Daniel Dauwe, Ryan Friese, Sudeep Pasricha, Anthony A. Maciejewski, Gregory A. Koenig, and Howard Jay Siegel, “Modeling the Effects on Power and Performance from Memory Interference of Co-located Applications in Multicore Systems,” The 2014 International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2014), cosponsors: World Academy of Science and Computer Science Research, Education, and Applications (CSREA), pp. 3-9, Las Vegas, NV, July 2014.
Daniel Dauwe, Eric Jonardi, Ryan Friese, Sudeep Pasricha, Anthony A. Maciejewski, David A. Bader, and Howard Jay Siegel, “A Methodology for Co-Location Aware Application Performance Modeling in Multicore Computing,” The Workshop on Advances on Parallel and Distributed Computing Models (APDCM 2015), sponsor: IEEE Computer Society, in the proceedings of 2015 International Parallel and Distributed Processing Symposium Workshops (IPDPSW 2015), Hyderabad, India, pp. 434-443,May 2015.
Daniel Dauwe, Sudeep Pasricha, Anthony A. Maciejewski, and Howard Jay Siegel, “A Performance and Energy Comparison of Fault Tolerence Techniques for Exascale Computing Systems,” The 6th IEEE International Symposium on Cloud and Service Computing (SC2-2016), sponsor: IEEE Computer Society, Nadi, Fiji, Dec. 2016, to appear.

Program of Study: