<span class="var-sub_title">A Year in the Life of a Parallel File System</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

A Year in the Life of a Parallel File System


Authors: Glenn K. Lockwood (Lawrence Berkeley National Laboratory), Shane Snyder (Argonne National Laboratory), Teng Wang (Lawrence Berkeley National Laboratory), Suren Byna (Lawrence Berkeley National Laboratory), Philip Carns (Argonne National Laboratory), Nicholas J. Wright (Lawrence Berkeley National Laboratory)

Abstract: I/O performance is a critical aspect of data-intensive scientific computing. We seek to advance the state of the practice in understanding and diagnosing I/O performance issues through investigation of a comprehensive I/O performance data set that captures a full year of production storage activity at two leadership-scale computing facilities. We demonstrate techniques to identify regions of interest, perform focused investigations of both long-term trends and transient anomalies, and uncover the contributing factors that lead to performance fluctuation.

We find that a year in the life of a parallel file system is comprised of distinct regions of long-term performance variation in addition to short-term performance transients. We demonstrate how systematic identification of these performance regions, combined with comprehensive analysis, allows us to isolate the factors contributing to different performance maladies at different time scales. From this, we present specific lessons learned and important considerations for HPC storage practitioners.



Presentation: file


Back to Technical Papers Archive Listing