Analytic Based Monitoring of High Performance Computing Applications
TimeTuesday, November 13th4pm - 4:30pm
DescriptionThe complexity of High Performance Computing (HPC) systems and the innate premium on system efficiency necessitate the use of automated tools to monitor not only system-level health and status, but also job performance. Current vendor-provided and third party monitoring tools, such as Nagios or Ganglia, enable system-level monitoring using features that reflect the state of system resources. None of those tools, however, are designed to determine the health and status of a user’s application, or job, as it executes.
This presentation introduces the concept of job-level, analytics-based monitoring using system features external to the job, like those reported by Ganglia. Preliminary results show these features contain sufficient information content to characterize key behaviors of an executing job when incorporated into a job-specific, application-neutral analytic model; transitions between computational phases, onset of a load imbalance, and anomalous activity on a compute node may each be detected using this approach.