DescriptionWe propose a single-FPGA-based accelerator for ultra-low-latency inference of ImageNet in this work. The design can complete the inference of Binarized AlexNet within 29us with accuracy comparable to other BNN implementations. We achieve this performance with the following contributions: 1. We completely remove floating-point from NL through layer fusion. 2. By using model parallelism rather than data parallelism, we can simultaneously configure all layers and the control flow graphs. Also, the design is flexible enough to achieve nearly perfect load balancing, leading to extremely high resource utilization. 3. All convolution layers are fused and processed in parallel through inter-layer pipelining. Therefore, in case the pipeline is full, latency is just the delay of a single convolution layer plus the FC layers. Note that the dependency pattern of the FC layer prevents it from being integrated into the current pipeline.