DescriptionIn large-scale distributed deep learning, the Allreduce operation for large messages (100 KB or more) is critical for gathering gradients from multiple worker nodes and broadcasting the sum of the gradients to them. When the message is large, the latency in Allreduce operation would make it difficult to take advantage of large-scale distributed deep learning. To reduce the latency, we devised a dataflow architecture with an Allreduce-specific hardware accelerator that performs data aggregation and reduction while data is being transferred. The accelerator is designed to immediately start Allreduce operation before an entire message is recived. Furthermore, Allreduce can be operated at wire speed by vectorizing the gradients and summing them in parallel. Experimental results reveal that the proposed architecture performs Allreduce at 96% of wire speed for a large message. Moreover, the latency of Allreduce is reduced by 65% compared with a state-of-the-art Allreduce method when applied for ResNet-50.