Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

<span class="var-sub_title">Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo</span> SC18 Proceedings

Lessons Learned from Memory Errors Observed Over the Lifetime of Cielo

Authors: Scott Levy (Sandia National Laboratories), Kurt B. Ferreira (Sandia National Laboratories), Nathan DeBardeleben (Los Alamos National Laboratory), Taniya Siddiqua (Advanced Micro Devices Inc), Vilas Sridharan (Advanced Micro Devices Inc), Elisabeth Baseman (Los Alamos National Laboratory)

Abstract: Maintaining the performance of high-performance computing (HPC) applications as failures increase is a major challenge for next-generation extreme-scale systems. Recent research demonstrates that hardware failures are expected to become more common due to increased component counts, reduced device-feature sizes, and tightly-constrained power budgets. Few existing studies, however, have examined failures in the context of the entire lifetime of a single platform. In this paper, we analyze failure data collected over the entire lifetime of Cielo, a leadership-class HPC system. Our analysis reveals several key findings, including: (i) Cielo’s memory (DRAM and SRAM) exhibited no discernible aging effects; (ii) correctable memory faults are not predictive of future uncorrectable memory faults; (iii) developing more comprehensive logging facilities will improve failure analysis on future machines; (iv) continued advances will be required to ensure current failure mitigation techniques remain a viable option for future platforms.

Presentation: file

Back to Technical Papers Archive Listing