Abstract: The Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU.
In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three optimizations: a new partitioning strategy that improves GPU performance, a new tensor matricization layout that halves the number of communication/matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue bottleneck that arises from the high speedups gained from GPU acceleration. Our GPU implementation employing all three optimizations achieves up to 11.8x speedup on 64 nodes over state-of-the-art TuckerMPI.
Back to Technical Papers Archive Listing