Performance Evaluation of the NVIDIA Tesla V100: Block Level Pipelining vs. Kernel Level Pipelining
TimeThursday, November 15th8:30am - 5pm
DescriptionAs accelerators become more common, expressive and performant, interfaces for them become ever more important. Programming models like OpenMP offer simple-to-use but powerful directive-based offload mechanisms. By default, these models naively copy data to or from the device without overlapping computation. Achieving performance can require extensive hand-tuning to apply optimizations such as pipelining. To pipeline a task, users must manually partition the task into multiple chunks then launch multiple sub-kernels. This approach can suffer from high kernel launch overhead. Also, the hyper parameters must be carefully tuned to achieve optimal performance. To ameliorate this issue, we propose a block-level pipeline approach that overlaps data transfers and computation in one kernel handled by different streaming multiprocessors on GPUs. Our results show that, without exhaustive tuning, our approach can provide 95% to 108% stable performance compared to the best tuned results with traditional kernel-level pipelining on NVIDIA V100 GPUs.