Student:
Supervisor: Frank Mueller (North Carolina State University)
Abstract: Production HPC clusters endure failures incurring computation and resource wastage. Despite the presence of various failure detection and prediction schemes, a comprehensive understanding of how nodes fail considering various components and layers of the system is required for sustained resilience. This work performs a holistic root cause diagnosis of node failures using a measurement-driven approach on contemporary system logs that can help vendors and system administrators support exascale resilience.
Our work shows that lead times can be increased by at least 5 times if external subsystem correlations are considered as opposed to considering the events of a specific node in isolation. Moreover, when detecting sensor measurement outliers and interconnect related failures, triggering automated recovery events can exacerbate the situation if recovery is unsuccessful.
ACM-SRC Semi-Finalist: no
Poster: PDF
Poster Summary: pdf
Reproducibility Description Appendix: PDF
Back to Poster Archive Listing