<span class="var-sub_title">Evaluating and Accelerating High-Fidelity Error Injection for HPC</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Evaluating and Accelerating High-Fidelity Error Injection for HPC


Authors: Chun-Kai Chang (University of Texas), Sangkug Lym (University of Texas), Nicholas Kelly (University of Texas), Michael B. Sullivan (Nvidia Corporation), Mattan Erez (University of Texas)

Abstract: We address two important concerns in the analysis of the behavior of applications in the presence of hardware errors: (1) when is it important to model how hardware faults lead to erroneous values (instruction-level errors) with high fidelity, as opposed to using simple bit-flipping models, and (2) how to enable fast high-fidelity error injection campaigns, in particular when error detectors are employed. We present and verify a new nested Monte Carlo methodology for evaluating high-fidelity gate-level fault models and error-detector coverage, which is orders of magnitude faster than current approaches. We use that methodology to demonstrate that, without detectors, simple error models suffice for evaluating errors in 9 HPC benchmarks.


Presentation: file


Back to Technical Papers Archive Listing