Generating Intel OpenCL Design
Author: Jie Wang (jiewang@cs.ucla.edu)
AutoSA can generate systolic arrays in Intel OpenCL. This page shows an example about generating a systolic array design for Intel FPGAs.
Note
The Intel OpenCL back-end is not performant currently due to the channel overheads and may halt on-board for certain test cases. This back-end is provided only for demo purpose. Please consider Xilinx or Catapult back-end for stable use.
Generating the Design
The design example used by this tutorial is at ${AUTOSA_ROOT}/autosa_tests/mm_intel.
Run the following command to generate the systolic array.
./autosa ./autosa_tests/mm_intel/kernel.c \
--config=./autosa_config/autosa_config.json \
--target=autosa_opencl \
--output-dir=./autosa.tmp/output \
--sa-sizes="{kernel[]->space_time[3];kernel[]->array_part[16,16,16];kernel[]->array_part_L2[2,2,2];kernel[]->latency[8,8];kernel[]->simd[2]}" \
--simd-info=./autosa_tests/mm_intel/simd_info.json \
--host-serialize \
--loop-infinitize \
--double-buffer-style=0 \
--mem-port-map="{kernel[]->A[0];kernel[]->B[1];kernel[]->C[2]}"
After compilation, you will find the generated designs under the directory
${AUTOSA_ROOT}/autosa.tmp/output/src.
We also provide an example Makefile for testing the design. Copy it to the design directory.
cp ${AUTOSA_ROOT}/autosa_tests/mm_intel/Makefile ${AUTOSA_ROOT}/autosa.tmp/output/
You may modify the Makefile based on your target FPGA board or use your own Makefile. In the example Makfile, we target the Intel Stratix 10 board with HBM memory.
AOCL_BOARD ?= s10mx_hbm_es
Set up your local Intel OpenCL SDK environment. Make sure the environment variable
INTELFPGAOCLSDKROOT is set properly. Then, to perform software emulation, run:
make sw_emu_check
The design will be compiled and simulated on CPU. You should be able to see the following information printed on your terminal.
AOCX file: kernel_sw_emu.aocx
FPGA Time: 0.146633 s
Host Time: 0.14696 s
Passed!
which shows the design is successfully compiled and the simulation passed successfully.
To synthesize the design to RTL, run:
make hls
The design will be synthesized to RTL. This process will take some time to finish.
Intel OpenCL SDK generates the detailed hardware information in HTML format, which
can be found at ${AUTOSA_ROOT}/autosa.tmp/output/bin/kernel/reports.
Lastly, to generate the bitstream, run:
make hw
More Details
Compared to generating Xilinx HLS designs, when generating the Intel OpenCL code, we add the following three arguments to the compilation command.
--loop-infinitize: Xilinx HLS requires the loops to be bounded. Such a limitation is
no longer required for Intel OpenCL. Loops can be eliminated if possible as the function can be
run infinitely. Performing loop infitinization will eliminate the unnecessary outer loops
in each function to reduce the hardware overheads.
--double-buffer-style=0: When generating the double buffer logic, by default,
we will generate the ping-pong logic explicitly as you may see in the Xilinx HLS code as below.
// outer loops
for (...)
for (...) {
// double buffer logic
if (arb == 0) {
func1(ping_array);
func2(pong_array);
} else if (arb == 1) {
func1(pong_array);
func2(ping_array);
}
}
However, such a coding style no longer works in Intel OpenCL design as Intel OpenCL SDK
lacks the ability to identify that func1 and func2 can be executed in parallel.
As a temporary solution, we will modify this coding style by inlining the function contents of
func1 and func2 directly. By setting --double-buffer-style=0, we will generate the
functional double buffering logic for Intel OpenCL. The generated logic looks like below:
while (1) {
if (func1_en) {
// func1 logic
...
}
if (func2_en) {
// func2 logic
...
}
}
--mem-port-map="{kernel[]->A[0];kernel[]->B[1];kernel[]->C[2]}":
As the target FPGA board is equipped with HBM memory, we may assign the global pointer to
different HBM banks. In Xilinx Vitis flow, we will write a separate configuration file
to map global pointers to different banks. However, in Intel flow, we will need to code it
explicitly in the OpenCL kernel code. This arugment is optional. It maps the global pointers
A, B, and C to bank 0, 1, and 2. You should find the following code in the OpenCL code.
__kernel void A_IO_L3_in_serialize(__global volatile __attribute__((buffer_location("HBM0"))) A_t16 *restrict A)
in which we use the __attribute__((buffer_location("HBM0"))) to assign the pointer A to the bank HBM0.