<span class="var-sub_title">Toward Ad Hoc Recovery For Soft Errors</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS)

Toward Ad Hoc Recovery For Soft Errors

Authors: Leonardo Bautista-Gomez (Barcelona Supercomputing Center)

Abstract: The coming exascale era is a great opportunity for high performance computing (HPC) applications. However, high failure rates on these systems will hazard the successful completion of their execution. Bit-flip errors in dynamic random access memory (DRAM) account for a noticeable share of the failures in supercomputers. Hardware mechanisms, such as error correcting code (ECC), can detect and correct single-bit errors and can detect some multi-bit errors while others can go undiscovered. Unfortunately, detected multi-bit errors will most of the time force the termination of the application and lead to a global restart. Thus, other strategies at the software level are needed to tolerate these type of faults more efficiently and to avoid a global restart. In this work, we extend the FTI checkpointing library to facilitate the implementation of custom recovery strategies for MPI applications, minimizing the overhead introduced when coping with soft errors. The new functionalities are evaluated by implementing local forward recovery on three HPC benchmarks with different reliability requirements. Our results demonstrate a reduction on the recovery times by up to 14%.

