Monitoring Large-Scale HPC Systems: Extracting and Presenting Meaningful System and Application Insights

Authors: Ann Gentile (Sandia National Laboratories), Jim Brandt (Sandia National Laboratories), Martin Schulz (Technical University Munich), Mike Showerman (National Center for Supercomputing Applications), Joe Greenseid (Cray Inc), Ayse Coskun (Boston University)

Abstract: We explore opportunities and challenges in extracting and presenting meaningful insights into HPC System and Application behavior via monitoring. Panelists from large data/platform sites will interact with the audience on: Data stores to support performant analyses, Exploration of data to discover meaningful relationships, Machine learning, and more. Further, we discuss results from two multi-site reports on the "state of the practice" in HPC monitoring.

We invite system administrators, analysis and visualization developers, and application users and developers, to facilitate community progress by identifying tools, techniques, gaps, and requirements for exploratory and production scenarios. Results will be posted at: https://sites.google.com/site/monitoringlargescalehpcsystems.

Long Description: This BoF consists of a deep-dive on a critical topic in HPC monitoring, followed by a state of the practice discussion and questionnaire.

I. Deep-dive:

To ensure efficient operation, more data on the performance of HPC systems and their applications, and more insights from that data, are required from HPC systems than ever before. For example, users, seeking to understand application performance, need to capture their applications' resource demands and contention impact on shared resources (e.g., network, filesystem); or system administrators need to understand conditions in the system and how those conditions affect application performance, system throughput, and system stability.

Although progress is being made on collecting data, progress on extracting meaning from that data has been lagging. Reasons include: a) Ineffective analytical and visual methods to obtain the information sought from raw data, b) Lack of semantics of data on particular architectures as well as desired behavior, c) Analytical and architectural difficulties in handling temporally distributed, multi-dimensional data, and d) Data gaps due to access and security constraints.

We explore techniques, tools, requirements, and gaps in essential architectural, analytic, and user-facing capabilities needed for gaining meaningful insights from HPC Monitoring. We begin with short presentations in key areas:

1) Mike Showerman (NCSA) - Data stores to support performant analysis,

2) Jim Brandt (SNL) - Exploration of data to discover meaningful relationships,

3) Emre Ates (BU) - Machine learning for understanding application behavior,

4) Joe Greenseid (Cray) - Requirements for understanding in production systems.

Panelists were chosen based on demonstrated production expertise. Discussion with audience will follow.

II. State of the Practice:

Andre Brinkmann (JGU Mainz) will present and lead discussion on results of two reports:

1) Survey of tools used and practices at HPC sites across Germany

2) Monitoring needs and requirements based on approaches and development at ten international sites.

Discussion will include potential for standardization of formats, tools, and analyses.

BoF will include an audience questionnaire on monitoring practices and implementations.

Notes:

Most time will be spent engaging with the audience, sharing experiences to identify useful tools, techniques, and gaps to drive requirements and solutions for both production and exploratory scenarios. An artifact report detailing outcomes and opinions from the discussions and questionnaire results will be posted at the community web site: https://sites.google.com/site/monitoringlargescalehpcsystems.

We encourage attendance of a cross-section of disciplines including: a) system administrators seeking to understand system state and performance and to improve site implementations, b) users seeking to understand application performance issues, c) monitoring architecture designers seeking requirements, d) data analysts who can share techniques and learn our analysis needs, and e) visualization developers who can inform us on representations and learn our visualization needs.

BoF leaders are active in promoting Community Building, including organizing: community web site and mailing list; Cray System Monitoring Working Group; HPCMASPA workshop at IEEECluster ; SIAM Minisymposia on monitoring analysis; multiple successful BoF’s at SC, CUG, etc.

This BoF series started in SC14, exploring various aspects of HPC monitoring. SC17 attendance > 100. We request 1.5 hrs.

URL: https://sites.google.com/site/monitoringlargescalehpcsystems/

Back to Birds of a Feather Archive Listing