<span class="var-sub_title">Large-Message Size Allreduce at Wire Speed for Distributed Deep Learning</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Large-Message Size Allreduce at Wire Speed for Distributed Deep Learning


Authors: Kenji Tanaka (Japan Telegraph and Telephone Corporation), Yuki Arikawa (Japan Telegraph and Telephone Corporation), Kenji Kawai (Japan Telegraph and Telephone Corporation), Junichi Kato (Japan Telegraph and Telephone Corporation), Tsuyoshi Ito (Japan Telegraph and Telephone Corporation), Huy Cu Ngo (Japan Telegraph and Telephone Corporation), Kazutaka Morita (Japan Telegraph and Telephone Corporation), Fumiaki Miura (Japan Telegraph and Telephone Corporation), Takeshi Sakamoto (Japan Telegraph and Telephone Corporation), Satoshi Shigematsu (Japan Telegraph and Telephone Corporation)

Abstract: In large-scale distributed deep learning, the Allreduce operation for large messages (100 KB or more) is critical for gathering gradients from multiple worker nodes and broadcasting the sum of the gradients to them. When the message is large, the latency in Allreduce operation would make it difficult to take advantage of large-scale distributed deep learning. To reduce the latency, we devised a dataflow architecture with an Allreduce-specific hardware accelerator that performs data aggregation and reduction while data is being transferred. The accelerator is designed to immediately start Allreduce operation before an entire message is recived. Furthermore, Allreduce can be operated at wire speed by vectorizing the gradients and summing them in parallel. Experimental results reveal that the proposed architecture performs Allreduce at 96% of wire speed for a large message. Moreover, the latency of Allreduce is reduced by 65% compared with a state-of-the-art Allreduce method when applied for ResNet-50.

Best Poster Finalist (BP): no

Poster: pdf
Poster summary: PDF


Back to Poster Archive Listing