Abstract: Today's high-performance computing (HPC) systems face the issue of balancing electricity (energy) use and performance. Rising energy costs are forcing system operators to either operate within an energy budget or to reduce energy use as much as possible while still maintaining performance-based service agreements. Energy-aware resource management is one method for solving such problems. Resource management in the context of high-performance computing refers to the process of assigning and scheduling workloads to resources (e.g., compute nodes). Because the cooling systems in HPC facilities also consume a considerable amount of energy, it is important to consider the computer room air conditioning (CRAC) units as a controllable resource and to study the relationship (and energy consumption impact) between the computing and cooling systems. In this thesis, we present four primary contributing studies with differing environments and novel techniques designed for each of those environments. Each study proposes new ideas in the field of energy- and thermal-aware resource management for heterogeneous high-performance computing systems.
Our first contribution explores the problem of assigning a collection of independent tasks (bag-of-tasks) to a heterogeneous HPC system in an energy-aware manner, where task execution times vary. We propose two new measures that consider these uncertainties with respect to makespan and energy: makespan-robustness and energy-robustness. We design resource management heuristics to either: (a) maximize makespan-robustness within an energy-robustness constraint, or (b) maximize energy-robustness within a makespan-robustness constraint.
Our next contribution studies a rate-based environment where task execution rates are assigned to compute cores within the HPC facility. The performance measure in this study is the reward rate earned for executing tasks. We analyze the impact that co-location interference (i.e., the performance degradation experienced when tasks are simultaneously executing on cores that share memory resources) has on the reward rate. Novel heuristics are designed that maximize the reward rate under power and thermal constraints, considering the interactions between both computing and cooling systems.
As part of the third contribution, we design new techniques for a geographical load distribution problem. That is, our proposed techniques intelligently distribute the workload to data centers located in different geographical regions that have varying energy prices and amount of renewable energy available. The novel techniques we propose use knowledge of co-location interference, thermal models, varying energy prices, and available renewable energy at each data center to minimize monetary energy costs while ensuring all tasks in the workload are completed.
Our final contribution is a new energy- and thermal-aware runtime framework designed to maximize reward earned from completing individual tasks by their deadlines within energy and thermal constraints. Thermal-aware resource management strategies often consult thermal models to intelligently determine which cores in the HPC facility to assign workloads. However, the time required to perform the thermal model calculations can be prohibitive in a runtime environment. Therefore, we propose a novel offline-assisted online resource management technique where the online resource manager uses information obtained from offline-generated solutions to help in its thermal-aware decision making.
Adviser: HJ Siegel Co-Adviser: Sudeep Pasricha Non-ECE Member: Darrell Whitley Member 3: Anthony A. Maciejewski Addional Members: N/A
Publications: Mark A. Oxley, Sudeep Pasricha, Howard Jay Siegel, and Anthony A. Maciejewski, “Energy and Deadline Constrained Robust Stochastic Static Resource Allocation,” 1st Workshop on Power and Energy Aspects of Computation (PEAC 2013), pp. 761-771, Warsaw, Poland, Sep. 2013.
 Mark A. Oxley, Eric Jonardi, Sudeep Pasricha, Anthony A. Maciejewski, Gregory A. Koenig, and Howard Jay Siegel, “Thermal, Power, and Co-location Aware Resource Allocation in Heterogeneous High Performance Computing Systems,” 5th International Green Computing Conference (IGCC 2014), 10 pp., Dallas, TX, Nov. 2014.
Mark A. Oxley, Sudeep Pasricha, Anthony A. Maciejewski, Howard Jay Siegel, Jonathan Apodaca, Dalton Young, Luis Diego Briceño, Jay Smith, Shirish Bahirat, Bhavesh Khemka, Adrian Ramirez, and Yong Zou, “Makespan and Energy Robust Stochastic Static Resource Allocation of a Bag-of-Tasks to a Heterogeneous Computing System,” IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 10, pp. 2791-2805, Oct. 2015.
Program of Study: ECE520 ECE554 ECE514 ECE561 MATH 510 CS545 CS555 CS645