<span class="var-sub_title">Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Exploring Flexible Communications for Streamlining DNN Ensemble Training Pipelines


Authors: Randall Pittman (North Carolina State University), Hui Guan (North Carolina State University), Xipeng Shen (North Carolina State University), Seung-Hwan Lim (Oak Ridge National Laboratory), Robert M. Patton (Oak Ridge National Laboratory)

Abstract: Parallel training of a Deep Neural Network (DNN) ensemble on a cluster of nodes is a common practice to train multiple models in order to construct a model with a higher prediction accuracy. Existing ensemble training pipelines can perform a great deal of redundant operations, resulting in unnecessary CPU usage, or even poor pipeline performance. In order to remove these redundancies, we need pipelines with more communication flexibility than existing DNN frameworks provide.

This project investigates a series of designs to improve pipeline flexibility and adaptivity, while also increasing performance. We implement our designs using Tensorflow with Horovod, and test it using several large DNNs. Our results show that the CPU time spent during training is reduced by 2-11X. Furthermore, our implementation can achieve up to 10X speedups when CPU core limits are imposed. Our best pipeline also reduces the average power draw of the ensemble training process by 5-16%.



Presentation: file


Back to Technical Papers Archive Listing