search-icon
Paper
:
PRISM: Predicting Resilience of GPU Applications Using Statistical Methods
Event Type
Paper
Registration Categories
TP
Tags
Algorithms
Architectures
GPUs
Linear Algebra
Networks
Resiliency
TimeThursday, November 15th2:30pm - 3pm
LocationC141/143/149
DescriptionAs Graphics Processing Units (GPUs) become more pervasive in HPC and safety-critical domains, ensuring that GPU applications can be protected from data corruption grows in importance. Despite prior efforts to mitigate errors, we still lack a clear understanding of how resilient these applications are in the presence of transient faults. Due to the random nature of these faults, predicting whether they will alter the program output is a challenging problem. In this paper, we build a framework named PRISM, which uses a systematic approach to predict failures in GPU programs. PRISM extracts micro-architecture agnostic features to characterize program resiliency, which serve as predictors in our statistical model. PRISM enables us to predict failures in applications without running exhaustive fault-injection campaigns on a GPU, thereby reducing the error estimation effort. PRISM can also be used to gain insight into potential architectural support required to improve the reliability of GPU applications.
Archive
Back To Top Button