Go to main content
Formats
Format
BibTeX
MARCXML
TextMARC
MARC
DublinCore
EndNote
NLM
RefWorks
RIS

Files

Abstract

Field Programmable Gate Array (FPGA) is becoming a preferred platform for the high-performance computing community because of the flexibility to adapt to new computing challenges. FPGAs also provide a greater power-efficient alternative for GPUs with its customizable data path and deep pipelining capability. Using high-level synthesis and optimization tools, we can achieve better and comparable performance to that of a CPU and GPU. OpenCL is the standard programming language for general-purpose parallel programming of a heterogeneous system. The availability of OpenCL has empowered high-performance execution of the massively parallel application. OpenCL-HLS for FPGA enables programmers to explore various software optimization with enhanced hardware capability. We introduce a novel approach to study the scalability of OpenCL coarse-grain parallelism, Compute Unit (CU) replication on cloud FPGAs. This work demonstrates that for every application there is an optimum number of CU to achieve the maximum performance benefits with higher memory bandwidth utilization and optimum FPGA resources. We also provide a generic source-code template and a front-end design exploration tool to explore and identify the optimum CU number for a given application. We have used the Xilinx SDAccel 2017.4 synthesis toolchain, which is an integrated development environment for FPGA for evaluation purposes. On the hardware side, the AWS cloud-based Xilinx VU9P FPGA was employed. This project was funded by the Xilinx University Program (XUP). Our experimental results on a mix of 15applications taken from the Xilinx benchmark suite vs2017.4 and the Rodinia Benchmark Suite vs3.1 show an average speedup of 6.4×and average bandwidth utilization improved by 3.4× over baseline. Further to this, a mere 8% average resource utilization and1.33× power overhead was reported. Our tool results in a 31% improvement in the total design synthesis time for an illustrative Histogram application. Xilinx SDAccel based ‘DDR’ and ‘burst transfer’ optimizations were also explored to improve bandwidth and performance. These optimizations are helpful for data-hungry applications that have the bandwidth as the major bottleneck. Combining CU along with DDR, we could achieve a 7.5× speedup for the Largeloop OCL (from SDAccel benchmark suite) application. In addition to this, we address the memory wall and hide the memory latency problem by using OpenCL pipes. This approach involves splitting an application into ‘read’, ‘compute’, and ‘write back’ sub kernels which work concurrently. Results on seven massively parallel applications gave an average speedup of 5.2× with 2.2× bandwidth improvement on cloud FPGAs

Details

PDF

Statistics

from
to
Export
Download Full History