<span class="var-sub_title">Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)

Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods

Authors: Carlos Pachajoa (University of Vienna)

Abstract: We compare and refine exact and heuristic fault-tolerance extensions for the preconditioned conjugate gradient (PCG) and the split preconditioner conjugate gradient (SPCG) methods for recovering from failures of compute nodes of large-scale parallel computers. In the exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011), the solver keeps extra information from previous search directions of the (S)PCG solver, so that its state can be fully reconstructed if a node fails unexpectedly. ESR does not make use of checkpointing or external storage for saving dynamic solver data and has only negligible computation and communication overhead compared to the failure-free situation. In exact arithmetic, the reconstruction is exact, but in finite-precision computations, the number of iterations until convergence can differ slightly from the failure-free case due to rounding effects. We perform experiments to investigate the behavior of ESR in floating-point arithmetic and compare it to the heuristic linear interpolation (LI) approach by Langou et al. (2007) and Agullo et al. (2016), which does not have to keep extra information and thus has lower memory requirements. Our experiments illustrate that ESR, on average, has essentially zero overhead in terms of additional iterations until convergence, whereas the LI approach incurs much larger overheads.

Archive Materials

Back to Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) Archive Listing

Back to Full Workshop Archive Listing