Added GPU+FPGA Accelerated Application Example

c303f8a7 · Filip Vaverka · Jan Siwiec · d1d9f293 · c303f8a7
Commit c303f8a7 authored 1 year ago by Filip Vaverka Committed by Jan Siwiec 1 year ago
--- a/docs.it4i/cs/guides/xilinx.md
+++ b/docs.it4i/cs/guides/xilinx.md
@@ -226,11 +226,7 @@ Number of devices:                               2

 which shows that both `Xilinx` platform and accelerator devices are present.

-## Building Applications for Emulation
-
-The two main approaches to building FPGA accelerated applications using Xilinx platform and HLS are **XRT** and **OpenCL**.
-The XRT uses dedicated host interface and dialect of C for accelerated kernels,
-while OpenCL allows to share most of the host interface with other accelerators.
+## Building Applications

 To simplify the build process we define two environment variables `IT4I_PLATFORM` and `IT4I_BUILD_MODE`.
 The first `IT4I_PLATFORM` denotes specific accelerator hardware such as `Alveo u250` or `Alveo u280`
@@ -278,13 +274,45 @@ The `IT4I_BUILD_MODE` is used to specify build type (`hw`, `hw_emu` and `sw_emu`
 - `hw_emu` allows to run both synthesis and emulation for debugging
 - `sw_emu` compiles kernels only for emulation (doesn't require accelerator and allows much faster build)

-For example, to configure software emulation mode build for `Alveo u280` we set:
+For example to configure build for `Alveo u280` we set:

 ```console
 $ export IT4I_PLATFORM=xilinx_u280_gen3x16_xdma_1_202211_1
+```
+
+### Software Emulation Mode
+
+The software emulation mode is preferable for development as HLS synthesis is very time consuming. To build following applications in this mode we set:
+
+```console
 $ export IT4I_BUILD_MODE=sw_emu
 ```

+and run each application with `XCL_EMULATION_MODE` set to `sw_emu`:
+
+```
+$ XCL_EMULATION_MODE=sw_emu <application>
+```
+
+### Hardware Synthesis Mode
+
+!!! note
+    The HLS of these simple applications **can take up to 2 hours** to finish.
+
+To allow the application to utilize real hardware we have to synthetize FPGA design for the accelerator. This can be done by repeating same steps used to build kernels in emulation mode, but with `IT4I_BUILD_MODE` set to `hw` like so:
+```console
+$ export IT4I_BUILD_MODE=hw
+```
+the host application binary can be reused, but it has to be run without `XCL_EMULATION_MODE`:
+```console
+$ <application>
+```
+
+## Sample Applications
+
+The first two samples illustrate two main approaches to building FPGA accelerated applications using Xilinx platform - **XRT** and **OpenCL**. 
+The final example combines **HIP** with **XRT** to show basics necessary to build application, which utilizes both GPU and FPGA accelerators.
+
 ### Using HLS and XRT

 The applications are typically separated into host and accelerator/kernel side.
@@ -417,7 +445,7 @@ $ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM vadd.xo -o vadd.xclbin

 This process should result in `vadd.xclbin`, which can be loaded by host-side application.

-### Running in Emulation Mode
+### Running the Application

 With both host application and kernel binary at hand the application (in emulation mode) can be launched as

@@ -425,6 +453,12 @@ With both host application and kernel binary at hand the application (in emulati
 $ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
 ```

+or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
+
+```console
+./host vadd.xclbin
+```
+
 ## Using HLS and OpenCL

 The host-side application code should be saved as `host.cpp`.
@@ -617,7 +651,7 @@ $ v++ -p vadd.link.xclbin -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.

 This process should result in `vadd.xclbin`, which can be loaded by host-side application.

-### Running in Emulation Mode
+### Running the Application

 With both host application and kernel binary at hand the application (in emulation mode) can be launched as

@@ -625,23 +659,210 @@ With both host application and kernel binary at hand the application (in emulati
 $ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
 ```

-## Building Application for Real HW
+or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
+
+```console
+./host vadd.xclbin
+```

-So far we have assumed software emulation (`sw_emu`), however the same steps can be used to build application for real hardware.
-To do so we have to rebuild our kernel binaries in `hw` mode by setting
+## Hybrid GPU and FPGA Application (HIP+XRT)
+
+This simple 8-bit quantized dot product (`R = sum(X[i]*Y[i])`) example illustrates basic approach to utilize both GPU and FPGA accelerators in a single application. 
+The application takes the simplest approach, where both synchronization and data transfers are handled explicitly by the host. 
+The HIP toolchain is used to compile the single source host/GPU code as usual, but it is also linked with XRT runtime, which allows host to control the FPGA accelerator. 
+The FPGA kernels are built separately as in previous examples.
+
+The host/GPU HIP code should be saved as `main.hip`
+
+```c++
+#include <iostream>
+#include <vector>
+
+#include "xrt/xrt_bo.h"
+#include "experimental/xrt_xclbin.h"
+#include "xrt/xrt_device.h"
+#include "xrt/xrt_kernel.h"
+#include "hip/hip_runtime.h"
+
+const size_t DATA_SIZE = 1024;
+
+float compute_reference(const float *srcX, const float *srcY, size_t count);
+
+__global__ void quantize(int8_t *out, const float *in, size_t count)
+{
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
+        out[i] = int8_t(in[i] * 127);
+}
+
+__global__ void dequantize(float *out, const int16_t *in, size_t count)
+{
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
+        out[i] = float(in[i] / float(127*127));
+}
+
+int main(int argc, char *argv[])
+{
+    if(argc != 2)
+    {
+        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Prepare experiment data
+    std::vector<float> srcX(DATA_SIZE);
+    std::vector<float> srcY(DATA_SIZE);
+    float outR = 0.0f;
+
+    for(size_t i = 0; i < DATA_SIZE; ++i)
+    {
+        srcX[i] = float(rand()) / float(RAND_MAX);
+        srcY[i] = float(rand()) / float(RAND_MAX);
+        outR += srcX[i] * srcY[i];
+    }
+
+    float outR_quant = compute_reference(srcX.data(), srcY.data(), DATA_SIZE);
+
+    std::cout << "REFERENCE: " << outR_quant << " (" << outR << ")" << std::endl;
+
+    // Initialize XRT (FPGA device), load kernels binary and create kernel object
+    xrt::device device(0);
+    std::cout << "Loading xclbin file " << argv[1] << std::endl;
+    xrt::uuid xclbinId = device.load_xclbin(argv[1]);
+    xrt::kernel mulKernel(device, xclbinId, "multiply", xrt::kernel::cu_access_mode::exclusive);
+
+    // Allocate GPU buffers
+    float   *srcX_gpu, *srcY_gpu, *res_gpu;
+    int8_t  *srcX_gpu_quant, *srcY_gpu_quant;
+    int16_t *res_gpu_quant;
+    hipMalloc(&srcX_gpu, DATA_SIZE * sizeof(float));
+    hipMalloc(&srcY_gpu, DATA_SIZE * sizeof(float));
+    hipMalloc(&res_gpu,  DATA_SIZE * sizeof(float));
+    hipMalloc(&srcX_gpu_quant, DATA_SIZE * sizeof(int8_t));
+    hipMalloc(&srcY_gpu_quant, DATA_SIZE * sizeof(int8_t));
+    hipMalloc(&res_gpu_quant,  DATA_SIZE * sizeof(int16_t));
+
+    // Allocate FPGA buffers
+    xrt::bo srcX_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(0));
+    xrt::bo srcY_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(1));
+    xrt::bo res_fpga_quant(device, DATA_SIZE * sizeof(int16_t), mulKernel.group_id(2));
+
+    // Copy experiment data from HOST to GPU
+    hipMemcpy(srcX_gpu, srcX.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
+    hipMemcpy(srcY_gpu, srcY.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
+
+    // Execute quantization kernels on both input vectors
+    quantize<<<16, 256>>>(srcX_gpu_quant, srcX_gpu, DATA_SIZE);
+    quantize<<<16, 256>>>(srcY_gpu_quant, srcY_gpu, DATA_SIZE);
+
+    // Map FPGA buffers into HOST memory, copy data from GPU to these mapped buffers and synchronize them into FPGA memory
+    hipMemcpy(srcX_fpga_quant.map<int8_t *>(), srcX_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
+    srcX_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+    hipMemcpy(srcY_fpga_quant.map<int8_t *>(), srcY_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
+    srcY_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+
+    // Execute FPGA kernel (8-bit integer multiplication)
+    auto kernelRun = mulKernel(res_fpga_quant, srcX_fpga_quant, srcY_fpga_quant, DATA_SIZE);
+    kernelRun.wait();
+
+    // Synchronize output FPGA buffer back to HOST and copy its contents to GPU buffer for dequantization
+    res_fpga_quant.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
+    hipMemcpy(res_gpu_quant, res_fpga_quant.map<int16_t *>(), DATA_SIZE * sizeof(int16_t), hipMemcpyDeviceToHost);
+
+    // Dequantize multiplication result on GPU
+    dequantize<<<16, 256>>>(res_gpu, res_gpu_quant, DATA_SIZE);
+
+    // Copy dequantized results from GPU to HOST
+    std::vector<float> res(DATA_SIZE);
+    hipMemcpy(res.data(), res_gpu, DATA_SIZE * sizeof(float), hipMemcpyDeviceToHost);
+
+    // Perform simple sum on CPU
+    float out = 0.0;
+    for(size_t i = 0; i < DATA_SIZE; ++i)
+        out += res[i];
+
+    std::cout << "RESULT: " << out << std::endl;
+
+    hipFree(srcX_gpu);
+    hipFree(srcY_gpu);
+    hipFree(res_gpu);
+    hipFree(srcX_gpu_quant);
+    hipFree(srcY_gpu_quant);
+    hipFree(res_gpu_quant);
+
+    return 0;
+}
+
+float compute_reference(const float *srcX, const float *srcY, size_t count)
+{
+    float out = 0.0f;
+
+    for(size_t i = 0; i < count; ++i)
+    {
+        int16_t quantX(srcX[i] * 127);
+        int16_t quantY(srcY[i] * 127);
+
+        out += float(int16_t(quantX * quantY) / float(127*127));
+    }
+
+    return out;
+}
+```
+
+The host/GPU application can be built using HIPCC as:

 ```console
-$ export IT4I_BUILD_MODE=hw
+$ hipcc -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil main.hip -o host
+```
+
+The accelerator side (simple vector-multiply kernel) should be saved as `kernels.cpp`.
+
+```c++
+extern "C" {
+    void multiply(
+        short *out,
+        const char *inX,
+        const char *inY,
+        int size)
+    {
+#pragma HLS INTERFACE m_axi port=inX bundle=aximm1
+#pragma HLS INTERFACE m_axi port=inY bundle=aximm2
+#pragma HLS INTERFACE m_axi port=out bundle=aximm1
+        for(int i = 0; i < size; ++i)
+            out[i] = short(inX[i]) * short(inY[i]);
+    }
+}
 ```

-and the application can be run without setting `XCL_EMULATION_MODE`:
+Once again the HLS kernel is build using Vitis `v++` in two steps:

 ```console
-$ ./host vadd.xclbin
+v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k multiply kernels.cpp -o kernels.xo
+v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM kernels.xo -o kernels.xclbin
 ```

-!!! note
-    The HLS of these simple applications **can take up to 2 hours** to finish.
+### Running the Application
+
+In emulation mode (FPGA emulation, GPU HW is required) the application can be launched as:
+
+```console
+$ XCL_EMULATION_MODE=sw_emu ./host kernels.xclbin
+REFERENCE: 256.554 (260.714)
+Loading xclbin file ./kernels.xclbin
+RESULT: 256.554
+```
+
+or, having compiled kernels with `IT4I_BUILD_MODE=hw` set, using real hardware (both FPGA and GPU HW is required)
+
+```console
+$ ./host kernels.xclbin
+REFERENCE: 256.554 (260.714)
+Loading xclbin file ./kernels.xclbin
+RESULT: 256.554
+```

 ## Additional Resources