<span class="var-sub_title">Tensorfolding: Improving Convolutional Neural Network Performance with Fused Microkernels</span> SC18 Proceedings

The International Conference for High Performance Computing, Networking, Storage, and Analysis

Tensorfolding: Improving Convolutional Neural Network Performance with Fused Microkernels

Authors: Michael Anderson (Intel Corporation), Evangelos Georganas (Intel Corporation), Sasikanth Avancha (Intel Corporation), Alexander Heinecke (Intel Corporation)

Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. In the recent past, several techniques to improve generalization capabilities of neural networks have been developed; the most prominent and successful is batch normalization. In deep neural network training, the batch normalization layer consists of a memory-bandwidth bound kernel. On the latest Intel Skylake based Xeon processors, a significant portion of execution time is spent in this kernel. By leveraging the CPU's large caches and its latency-optimized execution model, we are able to reduce this kernel's time to a bare minimum while allowing to improve forward pass layer runtimes by 21% compared to an unfused implementation and by 2% compared to a fused implementation.

Best Poster Finalist (BP): no

Poster: pdf
Poster summary: PDF
Reproducibility Description Appendix: PDF

Back to Poster Archive Listing