Enabling HPC and Deep Learning Workloads at Extreme Scale in the Cloud
Clouds and Distributed Computing
TimeTuesday, November 13th2pm - 2:30pm
DescriptionIndependent research (Reuther et al., J. Parallel Distrib. Comput., 111, 2018, 76–92) underscores the importance of efficient workload management: “For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system.” However enabling efficiency at extreme scale in the cloud, for workload management or other purposes, requires sophisticated integration and automation that also scales. By deeply integrating with AWS-specific APIs, the capabilities of this public-cloud provider are fully leveraged via Navops Launch in a highly automated fashion. As a compelling proof point, Navops Launch makes routine the scaling of a compute cluster to more than 1,000,000 cores, across 55,000 heterogeneous spot instances spanning three availability zones. As a consequence, in demanding policy-based launching of cloud instances, heroics are no longer required to scale HPC and Deep Learning workloads to the extreme.