Abstract: As Graphics Processing Units (GPUs) become more pervasive in HPC and safety-critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. Due to the random nature of these faults, predicting whether they will alter the program output is a challenging problem. In this paper, we build a framework named PRISM, which uses a systematic approach to predict failures in GPU programs. PRISM extracts micro-architecture agnostic features to characterize program resiliency, which serve as predictors in our statistical model. PRISM enables us to predict failures in applications without running exhaustive fault-injection campaigns on a GPU, thereby reducing the error estimation effort. PRISM can also be used to gain insight into potential architectural support required to improve the reliability of GPU applications.
Back to Technical Papers Archive Listing