ECE Seminar Series
Speaker: Justin Li
Affiliation: Department of Electrical and Computer Engineering, University of British Columbia
Day: Monday, March 5, 2018
Time: 11:00 am - 12:00 pm
Location: Clark A102
Abstract: Hardware errors are projected to drastically increase in modern computer systems due to shrinking feature sizes and increasing manufacturing variations. The impact of hardware faults on programs can be catastrophic, and can lead to substantial financial and societal consequences. Error propagation is often the leading cause of catastrophic system failures, and hence must be mitigated. Traditional hardware-only techniques to avoid error propagation are energy hungry, and hence not suitable for commodity systems. Researchers have proposed selective software-based protection techniques to prevent error propagation at lower costs. However, these techniques use expensive fault injection simulations to determine which parts of a program must be protected. Fault injection simulation artificially introduces a fault to program execution and observe failures (if any) upon the completion of the program execution. Thousands of such simulations need to be performed in order to achieve statistical significance. It is time-consuming as even a single program execution of an High-Performance Computing (HPC) application may take long time. In this talk, I propose both empirical and analytical approaches in identifying and mitigating error propagation without expensive fault injections. The key observation underlying my research is that only a small fraction of states are responsible for almost all error propagation in programs, and the propagation falls into identifiable patterns which can be modeled. As a result, my proposed techniques are nearly as close as fault injection approaches in measuring failure rates of programs, and orders of magnitude faster than fault injections. This allows developers to build low-cost fault-tolerant applications in an extremely efficient manner.
Bio: Guanpeng(Justin) Li is a Ph.D. candidate in the Department of Electrical and Computer Engineering at the University of British Columbia (UBC). He received his B.ASc. in Electrical and Computer Engineering from UBC in 2014. His research interests are in the areas of computer dependability. He focuses on building cost-effective fault-tolerant applications and published across multiple reputed peer-reviewed computer science conferences and journals, including IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), International Conference for High-Performance Computing, Networking, Storage and Analysis (SC), and ACM Transactions on Embedded Computing (TECS).