search-icon
ACM Student Research Competition
Poster
:
Holistic Root Cause Analysis of Node Failures in Production HPC
Event Type
ACM Student Research Competition
Poster
Registration Categories
TP
EX
TimeTuesday, November 13th8:30am - 5pm
DescriptionProduction HPC clusters endure failures incurring computation and resource wastage. Despite the presence of various failure detection and prediction schemes, a comprehensive understanding of how nodes fail considering various components and layers of the system is required for sustained resilience. This work performs a holistic root cause diagnosis of node failures using a measurement-driven approach on contemporary system logs that can help vendors and system administrators support exascale resilience.

Our work shows that lead times can be increased by at least 5 times if external subsystem correlations are considered as opposed to considering the events of a specific node in isolation. Moreover, when detecting sensor measurement outliers and interconnect related failures, triggering automated recovery events can exacerbate the situation if recovery is unsuccessful.
Archive
Back To Top Button