<span class="var-sub_title">Fault Tolerant Cholesky Factorization on GPUs</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)

Fault Tolerant Cholesky Factorization on GPUs

Authors: Parameswaran Ramanathan (University of Wisconsin)

Abstract: Direct Cholesky-based solvers are typically used to solve large linear systems where the coefficient matrix is symmetric positive definite. These solvers offer faster performance in solving such linear systems, compared to other more general solvers such as LU and QR solvers. In recent days, graphics processing units (GPUs) have become a popular platform for scientific computing applications, and are increasingly being used as major computational units in supercomputers. However, GPUs are susceptible to transient faults caused by events such as alpha particle strikes and power fluctuations. As a result, the possibility of an error increases as more and more GPU computing nodes are used. In this paper, we introduce two efficient fault tolerance schemes for the Cholesky factorization method, and study their performance using a direct Cholesky solver in the presence of faults. We utilize a transient fault injection mechanism for NVIDIA GPUs and compare our schemes with a traditional checksum fault tolerance technique, and show that our proposed schemes have superior performance, good error coverage and low overhead.

Archive Materials

Back to Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) Archive Listing

Back to Full Workshop Archive Listing