CPU Overheating Characterization in HPC Systems: a Case Study
Authors: Marc Platini (University of Grenoble, Atos)
Abstract: With the increase in size of supercomputers, the number of abnormal events also increases. Some of these events might lead to an application failure. Others might simply impact the system efficiency. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper studies the problem of CPU overheating in supercomputers. In a first part, we analyze data collected over one year on a supercomputer of the Top500 list to understand under which conditions CPU overheating occurs. Our analysis show that overheating events are due to some specific applications. In a second part, we evaluate the impact of such overheating events on the performance of MPI applications. Using 6 representative HPC benchmarks, we show that for a majority of the applications, a frequency drop on one CPU impacts the execution time of distributed runs proportionally to the duration and to the extent of the frequency drop.
Back to Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS) Archive Listing
Back to Full Workshop Archive Listing