<span class="var-sub_title">Performance Evaluation of the NVIDIA Tesla V100: Block Level Pipelining vs. Kernel Level Pipelining</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Performance Evaluation of the NVIDIA Tesla V100: Block Level Pipelining vs. Kernel Level Pipelining

Authors: Xuewen Cui (Virginia Tech), Thomas R. W. Scogland (Lawrence Livermore National Laboratory), Bronis R. de Supinski (Lawrence Livermore National Laboratory), Wu Feng (Virginia Tech)

Abstract: As accelerators become more common, expressive and performant, interfaces for them become ever more important. Programming models like OpenMP offer simple-to-use but powerful directive-based offload mechanisms. By default, these models naively copy data to or from the device without overlapping computation. Achieving performance can require extensive hand-tuning to apply optimizations such as pipelining. To pipeline a task, users must manually partition the task into multiple chunks then launch multiple sub-kernels. This approach can suffer from high kernel launch overhead. Also, the hyper parameters must be carefully tuned to achieve optimal performance. To ameliorate this issue, we propose a block-level pipeline approach that overlaps data transfers and computation in one kernel handled by different streaming multiprocessors on GPUs. Our results show that, without exhaustive tuning, our approach can provide 95% to 108% stable performance compared to the best tuned results with traditional kernel-level pipelining on NVIDIA V100 GPUs.

Best Poster Finalist (BP): no

Poster: pdf
Poster summary: PDF
Reproducibility Description Appendix: PDF

Back to Poster Archive Listing