search-icon
Workshop
:
Extending and Evaluating Fault-Tolerant Preconditioned Conjugate Gradient Methods
Event Type
Workshop
Registration Categories
W
Tags
Resiliency
Scientific Computing
TimeFriday, November 16th10:50am - 11:10am
LocationD174
DescriptionWe compare and refine exact and heuristic fault-tolerance extensions for the preconditioned conjugate gradient (PCG) and the split preconditioner conjugate gradient (SPCG) methods for recovering from failures of compute nodes of large-scale parallel computers. In the exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011), the solver keeps extra information from previous search directions of the (S)PCG solver, so that its state can be fully reconstructed if a node fails unexpectedly. ESR does not make use of checkpointing or external storage for saving dynamic solver data and has only negligible computation and communication overhead compared to the failure-free situation. In exact arithmetic, the reconstruction is exact, but in finite-precision computations, the number of iterations until convergence can differ slightly from the failure-free case due to rounding effects. We perform experiments to investigate the behavior of ESR in floating-point arithmetic and compare it to the heuristic linear interpolation (LI) approach by Langou et al. (2007) and Agullo et al. (2016), which does not have to keep extra information and thus has lower memory requirements. Our experiments illustrate that ESR, on average, has essentially zero overhead in terms of additional iterations until convergence, whereas the LI approach incurs much larger overheads.
Archive
Back To Top Button