Abstract: Soft errors, such as silent data corruptions (SDCs) hinder the correctness of large-scale scientific applications. Ghost replication (GR) is proposed herein as the first SDCs detector relying on the fast error propagation inherent to applications that employ the smooth particle hydrodynamics (SPH) method. GR follows a two-steps selective replication scheme. First, an algorithm selects which particles to replicate on a different process. Then, a different algorithm detects SDCs by comparing the data of the selected particles with the data of their ghost. The overhead and scalability of the proposed approach are assessed through a set of strong-scaling experiments conducted on a large HPC system under error-free conditions, using upwards of 3, 000 cores. The results show that GR achieves a recall and precision similar to that of full replication methods, at only a fraction of the cost, with detection rates of 91−99.9%, no false-positives, and an overhead of 1−10%.
Best Poster Finalist (BP): no
Poster summary: PDF
Reproducibility Description Appendix: PDF
Back to Poster Archive Listing