<span class="var-sub_title">High-Performance Dense Tucker Decomposition on GPU Clusters</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

High-Performance Dense Tucker Decomposition on GPU Clusters


Authors: Jee Choi (IBM), Xing Liu (Intel Corporation), Venkatesan Chakaravarthy (IBM)

Abstract: The Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU.

In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three optimizations: a new partitioning strategy that improves GPU performance, a new tensor matricization layout that halves the number of communication/matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue bottleneck that arises from the high speedups gained from GPU acceleration. Our GPU implementation employing all three optimizations achieves up to 11.8x speedup on 64 nodes over state-of-the-art TuckerMPI.



Presentation: file


Back to Technical Papers Archive Listing