DescriptionThis study investigates preconditioned conjugate gradient method variations designed to reduce communication costs by decreasing the number of allreduces and overlapping communication with computation using a non-blocking allreduce. Experiments show scalable PCG methods can outperform standard PCG at scale and demonstrate the robustness of these methods.
To develop the most optimal Krylov methods we need a clear understanding of the factors limiting performance at scale. Detailed timings and network counters are used to more thoroughly measure the performance of these methods. Performance models with penalty terms are developed that provide reasonable explanations of observed performance and guide development of optimizations. The effectiveness of scalable PCG methods and these performance analysis tools is demonstrated using Quda and Nek5000, two HPC applications seeking improved performance at scale.