High Performance Implementation of Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme
TimeSunday, November 11th2:51pm - 3:11pm
DescriptionThis study presents a high performance implementation of Basic Linear Algebra Subprograms (BLAS) routines supporting reproducibility and tunable accuracy using the Ozaki scheme, which is an accurate matrix-multiplication method proposed by Ozaki et al. in 2011. The method ensures reproducibility and realizes tunable accuracy, including correct-rounding, by eliminating the effect of rounding-error in the computation. The most advantage of the method is that the method can be constructed based on level-3 BLAS. In this study, we show the implementation of three routines from level 1-3 BLAS: inner-product (DOT), matrix-vector multiplication (GEMV), and matrix-matrix multiplication (GEMM), with several optimization techniques for reducing the memory consumption and improving the performance. The performance evaluation on Titan V GPU demonstrates that our implementation achieves more than 73% of the expected peak performance.