Abstract: Achieving portable performance over different parallel architectures and varying problem sizes is hard: e.g., a program optimized for multi-core CPUs on large input sizes can significantly differ from the same program optimized for Graphics Processing Units (GPUs) on small sizes.
We propose an approach to ensuring portability of performance by relying on multi-dimensional homomorphisms (MDHs) -- a class of parallelizable functions that cover important application areas including linear algebra routines (BLAS) and stencil computations. We develop an extended OpenCL implementation schema for MDHs that is generic in the performance-critical parameters of the OpenCL model, and we enable portability of performance by being automatically optimized for different target architectures and input sizes using the auto-tuning approach.
Our results demonstrate competitive and often even significantly better performance than state-of-the-art approaches for BLAS and Stencil as used in the important application area of deep learning.
Best Poster Finalist (BP): no
Poster summary: PDF
Reproducibility Description Appendix: PDF
Back to Poster Archive Listing