Characterization of the Impact of Soft Errors on Iterative Methods
Authors: Burcu Mutlu (Pacific Northwest National Laboratory, Polytechnic University of Catalonia)
Abstract: Soft errors caused by transient bit flips can lead to silent data corruption which can significantly impact an application’s behavior. When a transient bit flip affects a hardware component, the application is said to be impacted by a soft error. When a soft error escapes hardware detection and impacts the application state, it can impact execution by leading to incorrect results or significantly impacting application execution times. Recent architectural trends, such as near-threshold voltage operation and constrained power budgets, exacerbate the frequency and impact of soft errors. This has motivated design of numerous soft error detection and correction techniques, which focuses localizing the application vulnerabilities, and then correcting these errors using micro-architectural, architectural, compilation-based, or application-level techniques.
A broad array of techniques has been designed to understand application behavior under soft errors and to detect, isolate, and correct soft-error-impacted application state. The first step toward tolerating soft errors involves understanding an application’s behavior under soft errors. This can help elucidate the need for error detection/correction techniques.
In this study, we designed a deterministic application level error injection method that allows us explore the error injection space over several iterative solvers. As iterative methods are a crucial component of scientific applications, and consume a significant fraction of supercomputing time, we performed this study on the iterative methods. We used real life datasets to evaluate and characterize vulnerabilities and strengths of each solver under identical settings.
Back to Women in HPC: Diversifying the HPC Community Archive Listing
Back to Full Workshop Archive Listing