Authors:
Abstract: Convolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. In the recent past, several techniques to improve generalization capabilities of neural networks have been developed; the most prominent and successful is batch normalization. In deep neural network training, the batch normalization layer consists of a memory-bandwidth bound kernel. On the latest Intel Skylake based Xeon processors, a significant portion of execution time is spent in this kernel. By leveraging the CPU's large caches and its latency-optimized execution model, we are able to reduce this kernel's time to a bare minimum while allowing to improve forward pass layer runtimes by 21% compared to an unfused implementation and by 2% compared to a fused implementation.
Best Poster Finalist (BP): no
Poster: pdf
Poster summary: PDF
Reproducibility Description Appendix: PDF
Back to Poster Archive Listing