Authors:
Abstract: Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent research demonstrates that hardware failures are expected to become more common due to increased component counts, reduced device-feature sizes, and tightly-constrained power budgets. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze failure data collected over the entire lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several key findings, including: (i) Cielo’s memory (DRAM and SRAM) exhibited no discernible aging effects; (ii) correctable memory faults are not predictive of future uncorrectable memory faults; (iii) developing more comprehensive logging facilities will improve failure analysis on future machines; (iv) continued advances will be required to ensure current failure mitigation techniques remain a viable option for future platforms.
Presentation: file
Back to Technical Papers Archive Listing