Files
Abstract
Development of accelerators for deep learning accelerators have been gaining a lot of popularity due to sheer amount computation performed by deep learning algorithms. From the onset of Moore's law failure it has become difficult to improve the performance of the general purpose processors and hence computer architects are inclining towards more heterogeneous solutions for accelerating deep learning applications efficiently. Secondly the computation performed in the deep applications are repetitive and predictable which naturally leads to three choices $ASICSs$, \emph{GPUs} and \emph{FPGAs}. \emph{FPGAs} due to its configurability, deep pipeling abilities and high performance per watt has been one of favorite devices for accelerator architecture research. As \emph{FPGAs} are really difficult to program and thus there has been thus rise in development reusable accelerator templates which can be instantiated even by software developers.Memory always has been the main bottle-neck even for the architecture with most efficient compute data-path. This problem is further compounded as \emph{FPGAs} have low on-chip memory footprint (in form of BRAMs). Most of the deep learning applications have a very high model size (ex: AlexNet has model size of >100Mb). Thus to accelerate deep learning applications there is a need to develop memory systems to support these application. Conventional accelerators try to mitigate with these issue by accelerating single layer sequentially which has its own implication like bandwidth wastage, power consumption, etc. Since the presented work serves streaming accelerators, separate strategy has to developed. This work presents development of such memory management system called NURO-RAM. NURO-RAM uses minimum sized pre-fetch buffers and static weight scheduler in order to support the deep learning accelerator AWARE-DNN. This work implements three different network AlexNet, Shallow mobile net and Tiny Darknet to show the diversity of the NURO-RAM to serve streaming accelerators like AWARE. We then compared NURO-AWARE solution (implementing AWARE-DNN with support of NURO-RAM memory system) to Chai DNN an HLS based deep learning accelerator library and NVDIA Xavier mobile \emph{GPUs}. The purposed network consumes lower power 4.5 watts (NURO-AWARE) vs 10 watts (Chai DNN). The power consumption against \emph{GPUs} is comparable with 5.7 watts consumed by NVDIA Xavier. This work also consumes lesser BRAM for both AlexNet 75 in NURO-AWARE vs 88% in Chai DNN and for Tiny Darknet 48% in NURO-AWARE vs 88% in Chai DNN. With lesser BRAM utilization than state of the art architecture and separate utilization for three different networks shows that the NURO-AWARE architecture is resource aware as well as application aware. The lesser power consumption and resource utilization of the presented work can be attributed to the custom data path used by the AWARE DNN accelerator and the custom memory access path developed by the presented solution for each layer which reduces the off-chip memory access. The presented work beats Chai DNN which can support 10.21 FPS with fully connected layers whereas NURO-AWARE can support 30 FPS, and even Xavier \emph{GPUs} with support of 2 FPS for performance metric. This is because the presented solution uses its architectural knobs to satisfy the real-time frame rate requirement. Overall the presented solution beats \emph{GPUs} as well as \emph{FPGA} \emph{FPGA} based state of the art solution in performance per watt.