Holistic Root Cause Analysis of Node Failures in Production HPC

<span class="var-sub_title">Holistic Root Cause Analysis of Node Failures in Production HPC</span> SC18 Proceedings

Holistic Root Cause Analysis of Node Failures in Production HPC

Student: Anwesha Das (North Carolina State University)
Supervisor: Frank Mueller (North Carolina State University)

Abstract: Production HPC clusters endure failures incurring computation and resource wastage. Despite the presence of various failure detection and prediction schemes, a comprehensive understanding of how nodes fail considering various components and layers of the system is required for sustained resilience. This work performs a holistic root cause diagnosis of node failures using a measurement-driven approach on contemporary system logs that can help vendors and system administrators support exascale resilience.

Our work shows that lead times can be increased by at least 5 times if external subsystem correlations are considered as opposed to considering the events of a specific node in isolation. Moreover, when detecting sensor measurement outliers and interconnect related failures, triggering automated recovery events can exacerbate the situation if recovery is unsuccessful.

ACM-SRC Semi-Finalist: no

Poster: PDF
Poster Summary: pdf
Reproducibility Description Appendix: PDF

Back to Poster Archive Listing