DescriptionThe two-dimensional Fourier Transform is a widely-used computational kernel in many HPC applications. The popular NVIDIA cuFFT library provides a simple interface to compute 2D FFT on GPUs, but it's yet to utilize the recent hardware advancement in half-precision floating-point arithmetic. In this poster, we propose a mixed-precision method to accelerate 2D FFT by exploiting the FP16 matrix-multiply-and-accumulate units on the newest GPU architecture, known as tensor cores. We achieve a balance between speed and accuracy by dynamically splitting the single-precision input data into two half-precision operands and performing FFT separately. We present a CUDA-based implementation that achieves 3-digit more accuracy than half-precision cuFFT. We also demonstrate the stability and scalability of our approach and conclude that it attains high accuracy with tolerable splitting overhead.