Communication-Efficient Parallelization Strategy for Deep Convolutional Neural Network Training
TimeMonday, November 12th10:30am - 11am
DescriptionTraining modern Convolutional Neural Network (CNN) models is extremely time-consuming, and the efficiency of its parallelization plays a key role in finishing the training in a reasonable amount of time. The well-known parallel synchronous Stochastic Gradient Descent (SGD) algorithm suffers from high costs of inter-process communication and synchronization. To address such problems, the asynchronous SGD algorithm employs a master-slave model for parameter update. However, it can result in a poor convergence rate due to the staleness of gradient. In addition, the master-slave model is not scalable when running on a large number of compute nodes. In this paper, we present a communication-efficient gradient averaging algorithm for synchronous SGD, which adopts a few design strategies to maximize the degree of overlap between computation and communication. The time complexity analysis shows our algorithm outperforms the traditional algorithms that use MPI allreduce-based communication. By training the two popular deep CNN models, VGG-16 and ResNet-50, on ImageNet dataset, our experiments performed on Cori Phase-I, a Cray XC40 supercomputer at NERSC show that our algorithm can achieve up to 2516.36x speedup for VGG-16 and 2734.25x speedup for ResNet-50 when running on up to 8192 cores.