<span class="var-sub_title">SpotSDC: an Information Visualization System to Analyze Silent Data Corruption</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

SpotSDC: an Information Visualization System to Analyze Silent Data Corruption

Authors: Zhimin Li (University of Utah), Harshitha Menon (Lawrence Livermore National Laboratory), Yarden Livnat (University of Utah), Kathryn Mohror (Lawrence Livermore National Laboratory), Valerio Pascucci (University of Utah)

Abstract: Aggressive technology scaling trends are expected to make the hardware of HPC systems more susceptible to transient faults. Transient faults in hardware may be masked without affecting the program output, cause a program to crash, or lead to silent data corruptions (SDC). While fault injection studies can give an overall resiliency profile for an application, without a good visualization tool it is difficult to summarize and highlight critical information obtained. In this work, we design SpotSDC, a visualization system to analyze a program's resilience characteristics to SDC. SpotSDC provides an overview of the SDC impact on an application by highlighting regions of code that are most susceptible to SDC and will have a high impact on the program's output. SpotSDC also enables users to visualize the propagation of error through an application execution.

Best Poster Finalist (BP): no

Poster: pdf
Poster summary: PDF

Back to Poster Archive Listing