<span class="var-sub_title">Multi-Client DeepIO for Large-Scale Deep Learning on HPC Systems</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Multi-Client DeepIO for Large-Scale Deep Learning on HPC Systems

Authors: Yue Zhu (Florida State University), Fahim Chowdhury (Florida State University), Huansong Fu (Florida State University), Adam Moody (Lawrence Livermore National Laboratory), Kathryn Mohror (Lawrence Livermore National Laboratory), Kento Sato (Lawrence Livermore National Laboratory), Weikuan Yu (Florida State University)

Abstract: With the growth of computation power, leadership High-Performance Computing (HPC) systems can train larger datasets for Deep neural networks (DNNs) more efficiently. On HPC systems, a training dataset is on a parallel file system or node-local storage devices. However, not all HPC clusters have node-local storage, and large mini-batch sizes stress the read performance of parallel systems since the large datasets cannot fit in file system caches. Thus, it is a challenge for training DNNs with large datasets on HPC systems.

In prior work, we proposed DeepIO to mitigate the I/O pressure. DeepIO is designed to assist the mini-batch generation of TensorFlow. However, DeepIO does not support multiple training workers on a single compute node. We address this gap with modification on DeepIO framework, and evaluate multi-client DeepIO performance against state-of-the-art in-memory file systems, compare DeepIO and TensorFlow data loading API, and explore the potential of DeepIO in DNN training.

Best Poster Finalist (BP): no

Poster: pdf
Poster summary: PDF

Back to Poster Archive Listing