Skip to content
Snippets Groups Projects
Commit c303f8a7 authored by Filip Vaverka's avatar Filip Vaverka Committed by Jan Siwiec
Browse files

Added GPU+FPGA Accelerated Application Example

parent d1d9f293
No related branches found
No related tags found
1 merge request!445Added GPU+FPGA Accelerated Application Example
......@@ -226,11 +226,7 @@ Number of devices: 2
which shows that both `Xilinx` platform and accelerator devices are present.
## Building Applications for Emulation
The two main approaches to building FPGA accelerated applications using Xilinx platform and HLS are **XRT** and **OpenCL**.
The XRT uses dedicated host interface and dialect of C for accelerated kernels,
while OpenCL allows to share most of the host interface with other accelerators.
## Building Applications
To simplify the build process we define two environment variables `IT4I_PLATFORM` and `IT4I_BUILD_MODE`.
The first `IT4I_PLATFORM` denotes specific accelerator hardware such as `Alveo u250` or `Alveo u280`
......@@ -278,13 +274,45 @@ The `IT4I_BUILD_MODE` is used to specify build type (`hw`, `hw_emu` and `sw_emu`
- `hw_emu` allows to run both synthesis and emulation for debugging
- `sw_emu` compiles kernels only for emulation (doesn't require accelerator and allows much faster build)
For example, to configure software emulation mode build for `Alveo u280` we set:
For example to configure build for `Alveo u280` we set:
```console
$ export IT4I_PLATFORM=xilinx_u280_gen3x16_xdma_1_202211_1
```
### Software Emulation Mode
The software emulation mode is preferable for development as HLS synthesis is very time consuming. To build following applications in this mode we set:
```console
$ export IT4I_BUILD_MODE=sw_emu
```
and run each application with `XCL_EMULATION_MODE` set to `sw_emu`:
```
$ XCL_EMULATION_MODE=sw_emu <application>
```
### Hardware Synthesis Mode
!!! note
The HLS of these simple applications **can take up to 2 hours** to finish.
To allow the application to utilize real hardware we have to synthetize FPGA design for the accelerator. This can be done by repeating same steps used to build kernels in emulation mode, but with `IT4I_BUILD_MODE` set to `hw` like so:
```console
$ export IT4I_BUILD_MODE=hw
```
the host application binary can be reused, but it has to be run without `XCL_EMULATION_MODE`:
```console
$ <application>
```
## Sample Applications
The first two samples illustrate two main approaches to building FPGA accelerated applications using Xilinx platform - **XRT** and **OpenCL**.
The final example combines **HIP** with **XRT** to show basics necessary to build application, which utilizes both GPU and FPGA accelerators.
### Using HLS and XRT
The applications are typically separated into host and accelerator/kernel side.
......@@ -417,7 +445,7 @@ $ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM vadd.xo -o vadd.xclbin
This process should result in `vadd.xclbin`, which can be loaded by host-side application.
### Running in Emulation Mode
### Running the Application
With both host application and kernel binary at hand the application (in emulation mode) can be launched as
......@@ -425,6 +453,12 @@ With both host application and kernel binary at hand the application (in emulati
$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
```
or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
```console
./host vadd.xclbin
```
## Using HLS and OpenCL
The host-side application code should be saved as `host.cpp`.
......@@ -617,7 +651,7 @@ $ v++ -p vadd.link.xclbin -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.
This process should result in `vadd.xclbin`, which can be loaded by host-side application.
### Running in Emulation Mode
### Running the Application
With both host application and kernel binary at hand the application (in emulation mode) can be launched as
......@@ -625,23 +659,210 @@ With both host application and kernel binary at hand the application (in emulati
$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
```
## Building Application for Real HW
or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
```console
./host vadd.xclbin
```
So far we have assumed software emulation (`sw_emu`), however the same steps can be used to build application for real hardware.
To do so we have to rebuild our kernel binaries in `hw` mode by setting
## Hybrid GPU and FPGA Application (HIP+XRT)
This simple 8-bit quantized dot product (`R = sum(X[i]*Y[i])`) example illustrates basic approach to utilize both GPU and FPGA accelerators in a single application.
The application takes the simplest approach, where both synchronization and data transfers are handled explicitly by the host.
The HIP toolchain is used to compile the single source host/GPU code as usual, but it is also linked with XRT runtime, which allows host to control the FPGA accelerator.
The FPGA kernels are built separately as in previous examples.
The host/GPU HIP code should be saved as `main.hip`
```c++
#include <iostream>
#include <vector>
#include "xrt/xrt_bo.h"
#include "experimental/xrt_xclbin.h"
#include "xrt/xrt_device.h"
#include "xrt/xrt_kernel.h"
#include "hip/hip_runtime.h"
const size_t DATA_SIZE = 1024;
float compute_reference(const float *srcX, const float *srcY, size_t count);
__global__ void quantize(int8_t *out, const float *in, size_t count)
{
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
out[i] = int8_t(in[i] * 127);
}
__global__ void dequantize(float *out, const int16_t *in, size_t count)
{
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
out[i] = float(in[i] / float(127*127));
}
int main(int argc, char *argv[])
{
if(argc != 2)
{
std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
return EXIT_FAILURE;
}
// Prepare experiment data
std::vector<float> srcX(DATA_SIZE);
std::vector<float> srcY(DATA_SIZE);
float outR = 0.0f;
for(size_t i = 0; i < DATA_SIZE; ++i)
{
srcX[i] = float(rand()) / float(RAND_MAX);
srcY[i] = float(rand()) / float(RAND_MAX);
outR += srcX[i] * srcY[i];
}
float outR_quant = compute_reference(srcX.data(), srcY.data(), DATA_SIZE);
std::cout << "REFERENCE: " << outR_quant << " (" << outR << ")" << std::endl;
// Initialize XRT (FPGA device), load kernels binary and create kernel object
xrt::device device(0);
std::cout << "Loading xclbin file " << argv[1] << std::endl;
xrt::uuid xclbinId = device.load_xclbin(argv[1]);
xrt::kernel mulKernel(device, xclbinId, "multiply", xrt::kernel::cu_access_mode::exclusive);
// Allocate GPU buffers
float *srcX_gpu, *srcY_gpu, *res_gpu;
int8_t *srcX_gpu_quant, *srcY_gpu_quant;
int16_t *res_gpu_quant;
hipMalloc(&srcX_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&srcY_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&res_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&srcX_gpu_quant, DATA_SIZE * sizeof(int8_t));
hipMalloc(&srcY_gpu_quant, DATA_SIZE * sizeof(int8_t));
hipMalloc(&res_gpu_quant, DATA_SIZE * sizeof(int16_t));
// Allocate FPGA buffers
xrt::bo srcX_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(0));
xrt::bo srcY_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(1));
xrt::bo res_fpga_quant(device, DATA_SIZE * sizeof(int16_t), mulKernel.group_id(2));
// Copy experiment data from HOST to GPU
hipMemcpy(srcX_gpu, srcX.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(srcY_gpu, srcY.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
// Execute quantization kernels on both input vectors
quantize<<<16, 256>>>(srcX_gpu_quant, srcX_gpu, DATA_SIZE);
quantize<<<16, 256>>>(srcY_gpu_quant, srcY_gpu, DATA_SIZE);
// Map FPGA buffers into HOST memory, copy data from GPU to these mapped buffers and synchronize them into FPGA memory
hipMemcpy(srcX_fpga_quant.map<int8_t *>(), srcX_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
srcX_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
hipMemcpy(srcY_fpga_quant.map<int8_t *>(), srcY_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
srcY_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
// Execute FPGA kernel (8-bit integer multiplication)
auto kernelRun = mulKernel(res_fpga_quant, srcX_fpga_quant, srcY_fpga_quant, DATA_SIZE);
kernelRun.wait();
// Synchronize output FPGA buffer back to HOST and copy its contents to GPU buffer for dequantization
res_fpga_quant.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
hipMemcpy(res_gpu_quant, res_fpga_quant.map<int16_t *>(), DATA_SIZE * sizeof(int16_t), hipMemcpyDeviceToHost);
// Dequantize multiplication result on GPU
dequantize<<<16, 256>>>(res_gpu, res_gpu_quant, DATA_SIZE);
// Copy dequantized results from GPU to HOST
std::vector<float> res(DATA_SIZE);
hipMemcpy(res.data(), res_gpu, DATA_SIZE * sizeof(float), hipMemcpyDeviceToHost);
// Perform simple sum on CPU
float out = 0.0;
for(size_t i = 0; i < DATA_SIZE; ++i)
out += res[i];
std::cout << "RESULT: " << out << std::endl;
hipFree(srcX_gpu);
hipFree(srcY_gpu);
hipFree(res_gpu);
hipFree(srcX_gpu_quant);
hipFree(srcY_gpu_quant);
hipFree(res_gpu_quant);
return 0;
}
float compute_reference(const float *srcX, const float *srcY, size_t count)
{
float out = 0.0f;
for(size_t i = 0; i < count; ++i)
{
int16_t quantX(srcX[i] * 127);
int16_t quantY(srcY[i] * 127);
out += float(int16_t(quantX * quantY) / float(127*127));
}
return out;
}
```
The host/GPU application can be built using HIPCC as:
```console
$ export IT4I_BUILD_MODE=hw
$ hipcc -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil main.hip -o host
```
The accelerator side (simple vector-multiply kernel) should be saved as `kernels.cpp`.
```c++
extern "C" {
void multiply(
short *out,
const char *inX,
const char *inY,
int size)
{
#pragma HLS INTERFACE m_axi port=inX bundle=aximm1
#pragma HLS INTERFACE m_axi port=inY bundle=aximm2
#pragma HLS INTERFACE m_axi port=out bundle=aximm1
for(int i = 0; i < size; ++i)
out[i] = short(inX[i]) * short(inY[i]);
}
}
```
and the application can be run without setting `XCL_EMULATION_MODE`:
Once again the HLS kernel is build using Vitis `v++` in two steps:
```console
$ ./host vadd.xclbin
v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k multiply kernels.cpp -o kernels.xo
v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM kernels.xo -o kernels.xclbin
```
!!! note
The HLS of these simple applications **can take up to 2 hours** to finish.
### Running the Application
In emulation mode (FPGA emulation, GPU HW is required) the application can be launched as:
```console
$ XCL_EMULATION_MODE=sw_emu ./host kernels.xclbin
REFERENCE: 256.554 (260.714)
Loading xclbin file ./kernels.xclbin
RESULT: 256.554
```
or, having compiled kernels with `IT4I_BUILD_MODE=hw` set, using real hardware (both FPGA and GPU HW is required)
```console
$ ./host kernels.xclbin
REFERENCE: 256.554 (260.714)
Loading xclbin file ./kernels.xclbin
RESULT: 256.554
```
## Additional Resources
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment