Files
Abstract
The availability of OpenCL for FPGAs along with High-Level Synthesis tools have made them an attractive platform for implementing compute-intensive massively parallel applications. FPGAs with their customizable data-path, deep pipelining abilities and enhanced power efficiency features offer the most viable solutions for programming and integrating them with heterogeneous platforms. However, OpenCL for FPGAs raise many design challenges which require an in-depth understanding to better utilize their enormous capabilities. Inefficient routing of data, high number of memory stalls exposed to execution and under-utilization of FPGA resources are significant execution bottlenecks that overshadow the advantages of data-path customization. Furthermore, leveraging OpenCL parallelism abilities and throughput oriented principles is paramount to the success of FPGAs in the high performance computing environment.In this research, we identify, analyze and categorize the architectural differences between the OpenCL parallel programming model and FPGA execution semantic. We propose a generic taxonomy for classifying FPGA parallelism potential to the fullest. To benefit massive thread-level parallelism, we introduce a unique LLVM based automation tool to decouple memory access from computation, thereby hiding memory stalls from the execution path. We further present a novel parallelism granularity that separates kernels to split them into data-path and memory-path (memory read/write) that work concurrently to overlap the computation of current threads with the memory access of future threads. We validate these principles on the Xilinx based AWS Cloud FPGA platform.We then conduct a thorough investigation into the scalability of OpenCL coarse-grain parallelism, as well as an examination of Compute Unit(CU) replication, Double Data Rate (DDR) and Burst Transfer (BT) optimizations on Cloud FPGAs. To address the issue of programming challenges, we present generic template(s) and a front end design tool to aid the programmer in rapid exploration and testing. Overall, this dissertation is an amalgamation of principles and techniques to improve the performance and programmability of OpenCL on FPGAs when running massively parallel applications on 'Edge' as well as the 'Cloud'.