Introduction

In comparison to time-consuming training of convolutional neural networks (CNNs), inference is a relatively lightweight operation making it amenable to execution on mobile devices. Nevertheless, compute and energy efficiency are crucial to allow for complex models and prolonged battery life. Addressing the aforementioned challenges, we propose FeatherCNN - a fast inference library for ARM CPUs - targeting the performance ceiling of mobile devices. FeatherCNN tries to improve the efficiency of inference computation on ARM-based multi-core and many-core architectures using a highly efficient TensorGEMM (generalized matrix multiplication) routine to accelerate Winograd convolution on ARM CPUs. Experimental results reveal that, forward computation for VGG-16 on a server with 64 ARM A72 cores, FeatherCNN can scale up to 64 cores with an parallel efficiency of 27%, and achieve 48x, 14x and 12x speedup over Caffe/OpenBlas, Caffe2/OpenIgum and Caffe2-NPAPACK, respectively. The source code of FeatherCNN library is publicly available at https://github.com/tencent/feathercdn.

Methods

We first introduce the symbol definitions of Convolutional Neural Networks(CNN) and refresh the basics for the Winograd algorithm, followed by three methods on optimizing fast inference computation on ARM architecture. Finally, FeatherCNN is developed with these technologies and experimental result confirms its superiorities to current state-of-the-art libraries.

1. Convolutional Neural Networks and Winograd Algorithm

A convolutional network layer receives a batch of N image data with C channels of size HW, a bank of K kernels with C channels of size R*S, and outputs K channels of size ExF. The computation of CNN is defined as:

\[ S_{b,i,j,k} = \sum_{r,s=1}^{R,S} D_{b,i+r,j+s,k} \]  

where \( D \in \mathbb{R}^N \times C \times K, X = [H \times W] \times C, \) and \( K \) is the kernel size. Winograd algorithm[1] is proposed by Andrew Lavin in 2016. There are several 2D algorithms, without loss of generality, we specifically address the optimization problem for \( F(2x2, 3x3) \).

The general 2D formula for \( F(2x2, 3x3) \) is as follows:

\[ a = \begin{bmatrix} A^{(2x2)} \end{bmatrix} \circ \begin{bmatrix} b \end{bmatrix} = A^{(2x2)} \circ b = A^{(2x2)} \circ \begin{bmatrix} b \end{bmatrix} \]  

Where \( B, G, A \) are matrices with fixed values, \( g \) is kernel, and \( d \) is the 4x4 blocks extracted from input images. Figure 1 (a) illustrates the basic workflow of convolutional Winograd algorithm. The algorithm first slide over the image to compute the input transform, and then perform element-wise multiplication with the transformed kernel. The results goes through the output transform to produce final 2x2 output. The HW input image requires \((H-2)/2 \times (W-2)/2 \) transforms. Denote \( d \) as the index of each 4x4 block, where \( 0 \leq d < (H-2)/2 \times (W-2)/2 \). Then formula 1 can be rewritten as:

\[ S_{b,i,j,k} = \sum_{r,s=1}^{R,S} D_{b,i+r,j+s,k} \]  

Let \( M_d = \sum_{c=1}^{C} A_{c} \circ V_{c,d} \), and \( A_{c} \) \( \sim \) \( V_{c,d} \) are 4x4 matrices. Then

\[ M_d = \sum_{c=1}^{C} A_{c} \circ V_{c,d} = \begin{bmatrix} c \end{bmatrix} \circ \begin{bmatrix} d \end{bmatrix} \]  

That means, \( M_d \) can be computed with 16 matrix multiplications.

2. Optimization methods on ARM architecture

We propose a reformulated Winograd algorithm by embedding a specialized variant of inner tensor product computation – namely TensorGEMM – at its core. Our approach is based on the following three key points:

1. Embedding of TensorGEMM: Winograd algorithm is reformulated by embedding TensorGEMM at its core - a memory-aware linear algebra primitive for the efficient computation of single precision tensor-valued inner products.

2. Reduction of Memory Movement: The tensors are agglomerate before TensorGEMM is performed instead of scanning the data block generated by the input transform to 8 matrices. This approach reduces the data movement stride by a factor of 16 regardless of the specific processor instructions. Note that the same strategy can be applied to data reorganization before the output transform.

3. Improvement of TensorGEMM Efficiency: Our TensorGEMM routine is optimized with register blocking to achieve the maximal compute-to-memory access ratio. Internal and external packing is further used to minimize the overhead of data placement incurred during TensorGEMM.

2.1 Embedding Winograd Convolutions with TensorGEMM Primitives

Tensors are higher-order rank generalizations of linear maps storing data over a regular grid of arbitrary dimension. As an example, rank zero tensors are scalars, rank one tensors vectors, rank two tensors matrices. Inner products of two tensors \( A \) and \( B \) with compatible shape can be interpreted as their pointwise Hadamard product \( \odot \) and subsequent sum-reduction over a set of shared rank identifiers.

TensorGEMM shall be defined as inner product of two rank three tensors \( A_{ijkl} \) and \( B_{imnl} \) with point-wise multiplication of entries can be accomplished after (virtually) broadcasting the tensors to their corresponding ranks \( A_{ijkl} \odot B_{imnl} \) for all \( i \) and \( k \).\( A_{ijkl} \odot B_{imnl} \) for all \( i \) and \( k \) and are the same in order to guarantee comparable shapes. As shown before, TensorGEMM can rewrite formula 2 as a cascade of rank two inner products (batched ordinary matrix multiplications):

\[ C_{ijkl} = (A_{ijkl} \odot B_{imnl}) = \sum_{p=1}^{P} A_{ijkl}^{p} \odot B_{imnl}^{p} = \sum_{p=1}^{P} \left( \sum_{l,m,n} A_{ijkl}^{p} B_{imnl}^{p} \right) \]  

The winograd index is defined as \( 0 \leq i \leq 5 \) in case of \( F(2x2,3x3) \) with \( L \) lines in vector format. Hence, our contribution rewrites inner operations over \( C \) input color channels can be accumulated in an inner loop over \( c \) in a vector register: \( C_{ijkl}^c = C_{ijkl}^{c-1} + A_{ijkl} \odot B_{imnl}^{c} \), where \( P = (0,...,1,...) \) is a sequence of lane identifiers. If \( L=1 \) we have to perform multiple no-warp-up passes.

2.2 Reduction of Memory Movement on Input/Output Transformations:

As mentioned before, we perform \( p \times 4 \) no-warp-up passes to the fixed-length subsequence on the ARM architecture to propagate a convolutional layer of type \( F(2x2,3x3) \). Each pass computes a part of these tensors, which are denoted in same color in Figure 2. We use 4 vectors to store a 4x4 tile each holding a row. The vectors are subsequently serialized into an intermediate buffer for repetitive use in the tensor-valued inner product computations. We have carefully designed its memory layout which agglomerates the tensors in order to minimize data movements, while also ensure high access efficiency in the subsequent TensorGEMM.

In this progress, input tensor and TensorGEMM can be equivalently optimized with memory layout packing strategies to avoid data scanning and minimize interleaved distance in Tensors. These two memory layout packing strategies are denoted as internal packing and external packing. The efficiency of the two packing approaches depends on the size of the input images. For large images, the internal packing method is based on caching blocks in superior. The external packing approach causes significantly more cache misses as it writes a row-major buffer by column. In case input images are small enough to fit into cache, the writing overhead of the external packing approach increases only marginally, and therefore achieves better performance.

Conclusions

FeatherCNN is a fast inference computation library developed with the above optimization methods. An experiment is conducted to evaluate its performance with state-of-the-art libraries on VGG-16, the test platform is equipped with dual Huawei 32-core Cortex-A72 processor and 256GB memory. The 64-thread results in Figure 4(a) reveal that FeatherCNN achieves 48x, 14x and 12x speedup over Caffe/OpenBlas, Caffe2/OpenIgum and Caffe2-NPAPACK, respectively. Strong scale test evaluates thread scalability as well as efficiency for FeatherCNN on the VGG-16 neural network. Figure 4 (b) illustrated that a percentage of time usage on PoolingLayer and FC/Layer after 8 cores increase dramatically because the two layers are pretty memory-intensive, and thus less scalable. To conclude, FeatherCNN run scale to 64 cores with an efficiency of 27%

References: