search-icon
Paper
:
Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo
Event Type
Paper
Registration Categories
TP
Tags
Performance
Resiliency
Tools
TimeWednesday, November 14th1:30pm - 2pm
LocationC146
DescriptionMaintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent research demonstrates that hardware failures are expected to become more common due to increased component counts, reduced device-feature sizes, and tightly-constrained power budgets. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze failure data collected over the entire lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several key findings, including: (i) Cielo’s memory (DRAM and SRAM) exhibited no discernible aging effects; (ii) correctable memory faults are not predictive of future uncorrectable memory faults; (iii) developing more comprehensive logging facilities will improve failure analysis on future machines; (iv) continued advances will be required to ensure current failure mitigation techniques remain a viable option for future platforms.
Archive
Back To Top Button