D221
20181111T145100
20181111T151100
High Performance Implementation of Reproducible BLAS Routines with
Tunable Accuracy Using Ozaki Scheme
ble BLAS Routines with Tunable Accuracy Using Ozaki Scheme
Mukunoki, Ogita, Ozaki
ita, Ozaki\n\nThis study presents a high performance implementation of Bas
ic Linear Algebra Subprograms (BLAS) routines supporting reproducibility a
eproducibility and realizes tunable accuracy, including correct-rounding,
by eliminating the effect of rounding-error in the computation. The most a
dvantage of the method is that the method can be constructed based on leve
l-3 BLAS. In this study, we show the implementation of three routines from
level 1-3 BLAS: inner-product (DOT), matrix-vector multiplication (GEMV),
and matrix-matrix multiplication (GEMM), with several optimization techni
ques for reducing the memory consumption and improving the performance. Th
e performance evaluation on Titan V GPU demonstrates that our implementati
on achieves more than 73% of the expected peak performance.
e performance evaluation on Titan V GPU demonstrates that our implementati
on achieves more than 73% of the expected peak performance.
