Tensorfolding: Improving Convolutional Neural Network Performance with Fused Microkernels
TimeThursday, November 15th8:30am - 5pm
DescriptionConvolution layers are prevalent in many classes of deep neural networks, including Convolutional Neural Networks (CNNs) which provide state-of-the-art results for tasks like image recognition, neural machine translation and speech recognition. In the recent past, several techniques to improve generalization capabilities of neural networks have been developed; the most prominent and successful is batch normalization. In deep neural network training, the batch normalization layer consists of a memory-bandwidth bound kernel. On the latest Intel Skylake based Xeon processors, a significant portion of execution time is spent in this kernel. By leveraging the CPU's large caches and its latency-optimized execution model, we are able to reduce this kernel's time to a bare minimum while allowing to improve forward pass layer runtimes by 21% compared to an unfused implementation and by 2% compared to a fused implementation.