DescriptionWith the growth of computation power, leadership High-Performance Computing (HPC) systems can train larger datasets for Deep neural networks (DNNs) more efficiently. On HPC systems, a training dataset is on a parallel file system or node-local storage devices. However, not all HPC clusters have node-local storage, and large mini-batch sizes stress the read performance of parallel systems since the large datasets cannot fit in file system caches. Thus, it is a challenge for training DNNs with large datasets on HPC systems.
In prior work, we proposed DeepIO to mitigate the I/O pressure. DeepIO is designed to assist the mini-batch generation of TensorFlow. However, DeepIO does not support multiple training workers on a single compute node. We address this gap with modification on DeepIO framework, and evaluate multi-client DeepIO performance against state-of-the-art in-memory file systems, compare DeepIO and TensorFlow data loading API, and explore the potential of DeepIO in DNN training.