Presentation
GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan
SessionResilience
Event Type
Paper
TP
GPUs
Resiliency
State of the Practice
System Software
TimeTuesday, November 13th10:30am - 11am
LocationC141/143/149
DescriptionThe increasing rate of failures on the Oak Ridge Leadership Computing Facility's (OLCF) Titan supercomputer, resulted in the replacement of 50% of its GPUs between 2015 and 2017. The largest jobs, also known as "leadership jobs'', continued to experience increased application failures. These jobs contained significant amounts of low-failure rate and high-failure rate GPUs. The impacts of these failures were felt more by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. This employed two complementary techniques, updating both the system ordering and the allocation mechanisms. In simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.
Download PDF
Archive