DescriptionCollective operations are used in MPI programs to express common communication patterns, collective computations, or synchronizations. In many collectives, such as barrier or allreduce, the intra-node component of the collective is in the critical path, as the inter-node communication cannot start until the intra-node component has been executed. Thus, with increasing number of core counts in each node, intra-node optimizations that leverage the intra-node shared memory become increasingly important.
In this paper, we focus on the performance benefit of optimizing intra-node collectives using shared memory. We optimize several collectives using the primitives in broadcast and reduce as building blocks for other collectives. A comparison of our implementation on top of MPICH shows significant performance speedups with respect to the original MPICH implementation, MVAPICH, and OpenMPI, among others.