<span class="var-sub_title">PRISM: Predicting Resilience of GPU Applications Using Statistical Methods</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

PRISM: Predicting Resilience of GPU Applications Using Statistical Methods

Authors: Charu Kalra (Northeastern University), Fritz Previlon (Northeastern University), Xiangyu Li (Northeastern University), Norman Rubin (Nvidia Corporation), David Kaeli (Northeastern University)

Abstract: As Graphics Processing Units (GPUs) become more pervasive in HPC and safety-critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. Due to the random nature of these faults, predicting whether they will alter the program output is a challenging problem. In this paper, we build a framework named PRISM, which uses a systematic approach to predict failures in GPU programs. PRISM extracts micro-architecture agnostic features to characterize program resiliency, which serve as predictors in our statistical model. PRISM enables us to predict failures in applications without running exhaustive fault-injection campaigns on a GPU, thereby reducing the error estimation effort. PRISM can also be used to gain insight into potential architectural support required to improve the reliability of GPU applications.

Presentation: file

Back to Technical Papers Archive Listing