Heterogeneous CPU-GPU Execution of Stencil Applications
Abstract: Heterogeneous computer architectures are now ubiquitous in high performance computing; the top 7 supercomputers are all built with CPUs and accelerators. Portability across different CPUs and GPUs is becoming paramount, and heterogeneous scheduling of computations is also of increasing interest to make full use of these systems. In this paper, we present research on the hybrid CPU-GPU execution of an important class of applications: structured mesh stencil codes. Our work broadens the performance portability capabilities of the Oxford Parallel library for Structured meshes (OPS), which allows a science code written once at a high level to be automatically parallelised for a range of different architectures. We explore the traditional per-loop load balancing approach used by others, and highlighting its shortcomings, we develop an algorithm that relies on polyhedral analysis and transformations in OPS to allow load balancing on the level of larger computational stages, reducing data transfer requirements and synchronisation points.
We evaluate our algorithms on a simple heat equation benchmark, as well as a substantially more complex code, the CloverLeaf hydrodynamics mini-app. To demonstrate performance portability, we study Intel and IBM systems equipped with NVIDIA Kepler, Pascal, and Volta GPUs, evaluating CPU-only, GPU-only and hybrid CPU-GPU performance. We demonstrate a 1.05-1.2x speedup on CloverLeaf. Our results highlight the ability of the OPS domain specific language to deliver effortless performance portability for its users across a number of platforms.
Back to International Workshop on Performance, Portability, and Productivity in HPC (P3HPC) Archive Listing
Back to Full Workshop Archive Listing