DescriptionWe have implemented our “kinaco” numerical ocean model on Tokyo University’s Reedbush supercomputer, which utilizes the latest Nvidia Pascal P100 GPUs with GPUDirect technology. We have also optimized the model’s Poisson/Helmholtz solver by adjusting the global memory alignment and thread block configuration, introducing shuffle functions to accelerate the creation of coarse grids and merging small kernels in the multigrid preconditioner. We also utilize GPUDirect RDMA transfers to improve MPI communication efficiency. By exploiting the GPUs’ capabilities, the GPU implementation is now twice as fast as the CPU version, and it shows good weak scalability to multiple GPUs. Most of the GPU kernels are accelerated, and the velocity diagnosis functions in particular are now approximately seven times faster. The performance of inter-node data transfers using a CUDA-aware MPI library with GPUDirect RDMA transfers is comparable to that on CPUs.