DescriptionCode_Saturne is a widely used computational fluid dynamics software package that uses finite-volume methods to simulate different kinds of flows tailored to tackle multi-bilion-cell unstructured mesh simulations. This class of codes has shown to be challenging to accelerate on GPUs as they consist of many kernels and regular inter-process communication in between. In this poster we show how template pack expansion with CUDA can combine multiple kernels into a single one reducing launching latencies and along with the specification of data environments help reduce host-device communication. We tested these techniques on ORNL Summit Supercomputer based on OpenPOWER platform delivering almost 3x speedup over CPU-only runs on 256 nodes. We also show how the latest generation NVLINK(TM) interconnect available in POWER9(TM)improves scaling efficiency, enabling consistent GPU acceleration with just 100K-cells per process.