Performance Portability of an Unstructured Hydrodynamics Mini-Application
Abstract: In this work we study the parallel performance portability of BookLeaf: a recent 2D unstructured hydrodynamics mini-application. The aim of BookLeaf is to provide a self-contained and representative testbed for exploration of the modern hydrodynamics application design-space.
We present a previously unpublished reference C++11 implementation of BookLeaf parallelised with MPI, alongside hybrid MPI+OpenMP and MPI+CUDA versions, and two implementations using C++11 performance portability frameworks: Kokkos and RAJA, which both target a variety of parallel back-ends. We assess the scalability of our implementations on the ARCHER Cray XC30 up to 4096 nodes (98,304 cores) and on the Ray EA system at Lawrence Livermore National Laboratory up to 16 nodes (64 Tesla P100 GPUs), with a particular focus on the overheads introduced by Kokkos and RAJA relative to our handwritten OpenMP and CUDA implementations. We quantify the performance portability achieved by our Kokkos and RAJA implementations across five modern architectures using a metric previously introduced by Pennycook et al.
We find that our BookLeaf implementations all scale well, in particular the hybrid configurations (the MPI+OpenMP variant achieves a parallel efficiency above 0.8 running on 49,152 cores). The Kokkos and RAJA variants exhibit competitive performance in all experiments, however their CPU performance is best in memory-bound situations where the overhead introduced by the frameworks is partially shadowed by the need to wait for data. The overheads seen in the GPU experiments are extremely low. We observe overall performance portability scores of 0.928 for Kokkos and 0.876 for RAJA.
Back to International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) Archive Listing
Back to Full Workshop Archive Listing