Improving MPI Reduction Performance for Manycore Architectures with OpenMP and Data Compression
Abstract: MPI reductions are widely used in many scientific applications and often become the scaling performance bottleneck. When performing reductions on vectors, different algorithms have been developed to balance messaging overhead and bandwidth. However, most implementations have ignored the effect of single-thread performance not scaling as fast as aggregate network bandwidth. In this work, we propose, implement, and evaluate two approaches (threading and exploitation of sparsity) to accelerate MPI reductions on large vectors when running on manycore-based supercomputers. Our benchmark results show that our new techniques improve the MPI_Reduce performance up to 4x and improve BIGSTICK application performance by up to 2.6x.
Back to The 9th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems (PMBS18) Archive Listing
Back to Full Workshop Archive Listing