Files
Abstract
Deep Learning revolutionized the field of computer vision when convolutional neural networks (CNNs) solved complex computer vision problems with promising developments in the areas of research in artificial intelligence (AI). The progress in AI has attracted the hardware community to accommodate the growing demand of computationally expensive state-of-the-art Deep CNNs, coupled with diminishing performance gains of general-purpose architectures, which has fueled the need for specialized and scalable hardware accelerator designs and architectures for Deep CNNs. Moreover, Deep Separable Convolutional Neural Networks (DSCNNs) has become an emerging paradigm in the field of computer vision by offering modular networks with structural sparsity to achieve higher accuracy with relatively lower operations and parameters. However, there is a lack of customized architectures that can provide flexible solutions that fit the sparsity of the DSCNNs. The purpose of the domain-specific accelerators is to satisfy two requirements: (1) execution of DSCNN models with low latency, high throughput, and high efficiency, and (2) flexibility to accommodate evolving state-of-the-art models like EfficientNet families without costly silicon updates. On this note, the state-of-the-art GPUs tend to be too power-hungry and ASICs are too inflexible. This is where FPGA shines due to its architectural reconfigurability, ability to accommodate custom datatypes, and process irregular parallelism, power efficiency, and low latency which extends its usability in real-time. This work proposes DeepDive, which is a fully-functional, vertical hardware-software co-design architecture for the power-efficient implementation of DSCNNs on both edge and cloud FPGA platforms. With two different architectural principles applied for DeepDive's implementation on edge and cloud, the architecture for the former demonstrates latency-orient design whereas the architecture for the later demonstrates a throughput-orient design each of which is designed to fully support DSCNNs with various convolutional operators interconnected with structural sparsity. The accelerator design for both introduces parameterized, configurable, and scalable compute units that can be configured based on the user-specific requirement depending on the hardware it is implemented on, the degree of parallelism required, and the family of DSCNN chosen for inference. This accelerator design was implemented using Xilinx Vitis HLS 2019.2 tool. The execution results for DeepDive - Edge accelerator on Xilinx ZCU102 Edge FPGA demonstrates 233.3 FPS/Watt for a compact version of EfficientNet as the state-of-the-art DSCNN. These comparisons showcase how this edge design improves FPS/Watt by 2.2x and 1.51x over Jetson Nano high and low power modes, respectively. Whereas, DeepDive - Cloud accelerator achieves 87 FPS on Xilinx Alveo U50 with a power efficiency of 7.25 FPS/Watt for the baseline version of EfficientNet.