Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Select Git revision
Show changes
Showing
with 3144 additions and 0 deletions
# Heterogeneous Memory Management on Intel Platforms
Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
## Overview
At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
```bash
ml memkind
```
### Process Level (NUMACTL)
The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
```bash
numactl --membind <node_ids_set>
```
or select single preffered node
```bash
numactl --preffered <node_id>
```
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check `numactl` configuration is
```bash
numactl -s
```
which prints configuration in its execution environment eg.
```bash
numactl --membind 8-15 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
```
The last row shows allocations memory are restricted to NUMA nodes `8-15`.
### Allocation Level (MEMKIND)
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
```cpp
void *pData = malloc(<SIZE>);
/* ... */
free(pData);
```
with
```cpp
#include <memkind.h>
void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address
```
Similarly other memory types can be chosen.
!!! note
The allocation will return `NULL` pointer when memory of specified kind is not available.
## High Bandwidth Memory (HBM)
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
```bash
numactl -H
```
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
![](../../img/cs/guides/p10_numa_sc4_flat.png)
### Process Level
With this we can easily restrict application to DDR DRAM or HBM memory:
```bash
# Only DDR DRAM
numactl --membind 0-7 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 369745.8 0.043355 0.043273 0.043588
Scale: 366989.8 0.043869 0.043598 0.045355
Add: 378054.0 0.063652 0.063483 0.063899
Triad: 377852.5 0.063621 0.063517 0.063884
# Only HBM
numactl --membind 8-15 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 1128430.1 0.015214 0.014179 0.015615
Scale: 1045065.2 0.015814 0.015310 0.016309
Add: 1096992.2 0.022619 0.021878 0.024182
Triad: 1065152.4 0.023449 0.022532 0.024559
```
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
```bash
#!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
```
and can be used as
```bash
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
```
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
```bash
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
```
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
![](../../img/cs/guides/p10_stream_dram.png)
![](../../img/cs/guides/p10_stream_hbm.png)
### Allocation Level
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
```cpp
#include <memkind.h>
// ...
STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
// ...
memkind_free(NULL, a);
memkind_free(NULL, b);
memkind_free(NULL, c);
```
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library
```bash
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
```
and can be run as
```bash
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
```
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
![](../../img/cs/guides/p10_stream_memkind.png)
With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
## Simple Application
One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
```cpp
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
#include <cstring>
#include <omp.h>
#include <memkind.h>
const size_t N_DATA_SIZE = 2 * 1024 * 1024 * 1024ull;
const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
const size_t N_ITERS = 10;
#if defined(HBM)
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_HBW_ALL
#else
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_REGULAR
#endif
int main(int argc, char *argv[])
{
const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
#pragma omp parallel
{
drand48_data state;
srand48_r(omp_get_thread_num(), &state);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
drand48_r(&state, &pData[i]);
}
auto c1 = std::chrono::steady_clock::now();
for(size_t it = 0; it < N_ITERS; ++it)
{
#pragma omp parallel
{
for(size_t i = 0; i < N_BINS_COUNT; ++i)
pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
{
const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
}
}
}
auto c2 = std::chrono::steady_clock::now();
#pragma omp parallel for
for(size_t i = 0; i < N_BINS_COUNT; ++i)
{
for(size_t j = 1; j < omp_get_max_threads(); ++j)
pBins[i] += pBins[j*N_BINS_COUNT + i];
}
std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
size_t total = 0;
#pragma omp parallel for reduction(+:total)
for(size_t i = 0; i < N_BINS_COUNT; ++i)
total += pBins[i];
std::cout << "Total Items: " << total << std::endl;
memkind_free(NULL, pData);
memkind_free(NULL, pBins);
return 0;
}
```
### Using HBM Memory (P10-Intel)
Following commands can be used to compile and run example application above
```bash
ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
```
Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
## Additional Resources
- [https://linux.die.net/man/8/numactl][1]
- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
[1]: https://linux.die.net/man/8/numactl
[2]: http://memkind.github.io/memkind/man_pages/memkind.html
[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
\ No newline at end of file
# Using VMware Horizon
VMware Horizon is a virtual desktop infrastructure (VDI) solution
that enables users to access virtual desktops and applications from any device and any location.
It provides a comprehensive end-to-end solution for managing and delivering virtual desktops and applications,
including features such as session management, user authentication, and virtual desktop provisioning.
![](../../img/horizon.png)
## How to Access VMware Horizon
!!! important
Access to VMware Horizon requires IT4I VPN.
1. Contact [IT4I support][a] with a request for an access and VM allocation.
1. [Download][1] and install the VMware Horizon Client for Windows.
1. Add a new server `https://vdi-cs01.msad.it4i.cz/` in the Horizon client.
1. Connect to the server using your IT4I username and password.
Username is in the `domain\username` format and the domain is `msad.it4i.cz`.
For example: `msad.it4i.cz\user123`
## Example
Below is an example of how to mount a remote folder and check the conection on Windows OS:
### Prerequsities
3D applications
* [Blender][3]
SSHFS for remote access
* [sshfs-win][4]
* [winfsp][5]
* [shfs-win-manager][6]
* ssh keys for access to clusters
### Steps
1. Start the VPN and connect to the server via VMware Horizon Client.
![](../../img/vmware.png)
1. Mount a remote folder.
* Run sshfs-win-manager.
![](../../img/sshfs.png)
* Add a new connection.
![](../../img/sshfs1.png)
* Click on **Connect**.
![](../../img/sshfs2.png)
1. Check that the folder is mounted.
![](../../img/mount.png)
1. Check the GPU resources.
![](../../img/gpu.png)
### Blender
Now if you run, for example, Blender, you can check the available GPU resources in Blender Preferences.
![](../../img/blender.png)
[a]: mailto:support@it4i.cz
[1]: https://vdi-cs01.msad.it4i.cz/
[2]: https://www.paraview.org/download/
[3]: https://www.blender.org/download/
[4]: https://github.com/winfsp/sshfs-win/releases
[5]: https://github.com/winfsp/winfsp/releases/
[6]: https://github.com/evsar3/sshfs-win-manager/releases
# Using IBM Power Partition
For testing your application on the IBM Power partition,
you need to prepare a job script for that partition or use the interactive job:
```console
scalloc -N 1 -c 192 -A PROJECT-ID -p p07-power --time=08:00:00
```
where:
- `-N 1` means allocation single node,
- `-c 192` means allocation 192 cores (threads),
- `-p p07-power` is IBM Power partition,
- `--time=08:00:00` means allocation for 8 hours.
On the partition, you should reload the list of modules:
```
ml architecture/ppc64le
```
The platform offers both `GNU` based and proprietary IBM toolchains for building applications. IBM also provides optimized BLAS routines library ([ESSL](https://www.ibm.com/docs/en/essl/6.1)), which can be used by both toolchain.
## Building Applications
Our sample application depends on `BLAS`, therefore we start by loading following modules (regardless of which toolchain we want to use):
```
ml GCC OpenBLAS
```
### GCC Toolchain
In the case of GCC toolchain we can go ahead and compile the application as usual using either `g++`
```
g++ -lopenblas hello.cpp -o hello
```
or `gfortran`
```
gfortran -lopenblas hello.f90 -o hello
```
as usual.
### IBM Toolchain
The IBM toolchain requires additional environment setup as it is installed in `/opt/ibm` and is not exposed as a module
```
IBM_ROOT=/opt/ibm
OPENXLC_ROOT=$IBM_ROOT/openxlC/17.1.1
OPENXLF_ROOT=$IBM_ROOT/openxlf/17.1.1
export PATH=$OPENXLC_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLC_ROOT/lib:$LD_LIBRARY_PATH
export PATH=$OPENXLF_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLF_ROOT/lib:$LD_LIBRARY_PATH
```
from there we can use either `ibm-clang++`
```
ibm-clang++ -lopenblas hello.cpp -o hello
```
or `xlf`
```
xlf -lopenblas hello.f90 -o hello
```
to build the application as usual.
!!! note
Combination of `xlf` and `openblas` seems to cause severe performance degradation. Therefore `ESSL` library should be preferred (see below).
### Using ESSL Library
The [ESSL](https://www.ibm.com/docs/en/essl/6.1) library is installed in `/opt/ibm/math/essl/7.1` so we define additional environment variables
```
IBM_ROOT=/opt/ibm
ESSL_ROOT=${IBM_ROOT}math/essl/7.1
export LD_LIBRARY_PATH=$ESSL_ROOT/lib64:$LD_LIBRARY_PATH
```
The simplest way to utilize `ESSL` in application, which already uses `BLAS` or `CBLAS` routines is to link with the provided `libessl.so`. This can be done by replacing `-lopenblas` with `-lessl` or `-lessl -lopenblas` (in case `ESSL` does not provide all required `BLAS` routines).
In practice this can look like
```
g++ -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.cpp -o hello
```
or
```
gfortran -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.f90 -o hello
```
and similarly for IBM compilers (`ibm-clang++` and `xlf`).
## Hello World Applications
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in `C++`:
```c++
#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"
const size_t ITERATIONS = 32;
const size_t MATRIX_SIZE = 1024;
int main(int argc, char *argv[])
{
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
for(size_t i = 0; i < MATRIX_SIZE; ++i)
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
a[0] = 0.5f;
std::vector<float> w1(matrixElements, 0.0f);
std::vector<float> w2(matrixElements, 0.0f);
std::copy(a.begin(), a.end(), w1.begin());
std::vector<float> *t1, *t2;
t1 = &w1;
t2 = &w2;
auto c1 = std::chrono::steady_clock::now();
for(size_t i = 0; i < ITERATIONS; ++i)
{
std::fill(t2->begin(), t2->end(), 0.0f);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
1.0f, t1->data(), MATRIX_SIZE,
a.data(), MATRIX_SIZE,
1.0f, t2->data(), MATRIX_SIZE);
std::swap(t1, t2);
}
auto c2 = std::chrono::steady_clock::now();
for(size_t i = 0; i < MATRIX_SIZE; ++i)
{
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
}
std::cout << std::endl;
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
return 0;
}
```
Stationary probability vector estimation in `Fortran`:
```fortran
program main
implicit none
integer :: matrix_size, iterations
integer :: i
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
real, pointer :: out_data(:), out_diag(:)
integer :: cr, cm, c1, c2
iterations = 32
matrix_size = 1024
call system_clock(count_rate=cr)
call system_clock(count_max=cm)
allocate(a(matrix_size, matrix_size))
allocate(w1(matrix_size, matrix_size))
allocate(w2(matrix_size, matrix_size))
a(:,:) = 1.0 / real(matrix_size)
a(:,1) = 0.5 / real(matrix_size - 1)
a(1,1) = 0.5
w1 = a
w2(:,:) = 0.0
t1 => w1
t2 => w2
call system_clock(c1)
do i = 0, iterations
t2(:,:) = 0.0
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
tmp => t1
t1 => t2
t2 => tmp
end do
call system_clock(c2)
out_data(1:size(t1)) => t1
out_diag => out_data(1::matrix_size+1)
print *, out_diag
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
deallocate(a)
deallocate(w1)
deallocate(w2)
end program main
```
# Using Xilinx Accelerator Platform
The first step to use Xilinx accelerators is to initialize Vitis (compiler) and XRT (runtime) environments.
```console
$ . /tools/Xilinx/Vitis/2023.1/settings64.sh
$ . /opt/xilinx/xrt/setup.sh
```
## Platform Level Accelerator Management
This should allow to examine current platform using `xbutil examine`,
which should output user-level information about XRT platform and list available devices
```
$ xbutil examine
System Configuration
OS Name : Linux
Release : 4.18.0-477.27.1.el8_8.x86_64
Version : #1 SMP Thu Aug 31 10:29:22 EDT 2023
Machine : x86_64
CPU Cores : 64
Memory : 257145 MB
Distribution : Red Hat Enterprise Linux 8.8 (Ootpa)
GLIBC : 2.28
Model : ProLiant XL675d Gen10 Plus
XRT
Version : 2.16.0
Branch : master
Hash : f2524a2fcbbabd969db19abf4d835c24379e390d
Hash Date : 2023-10-11 14:01:19
XOCL : 2.16.0, f2524a2fcbbabd969db19abf4d835c24379e390d
XCLMGMT : 2.16.0, f2524a2fcbbabd969db19abf4d835c24379e390d
Devices present
BDF : Shell Logic UUID Device ID Device Ready*
-------------------------------------------------------------------------------------------------------------------------
[0000:88:00.1] : xilinx_u280_gen3x16_xdma_base_1 283BAB8F-654D-8674-968F-4DA57F7FA5D7 user(inst=132) Yes
[0000:8c:00.1] : xilinx_u280_gen3x16_xdma_base_1 283BAB8F-654D-8674-968F-4DA57F7FA5D7 user(inst=133) Yes
* Devices that are not ready will have reduced functionality when using XRT tools
```
Here two Xilinx Alveo u280 accelerators (`0000:88:00.1` and `0000:8c:00.1`) are available.
The `xbutil` can be also used to query additional information about specific device using its BDF address
```console
$ xbutil examine -d "0000:88:00.1"
-------------------------------------------------
[0000:88:00.1] : xilinx_u280_gen3x16_xdma_base_1
-------------------------------------------------
Platform
XSA Name : xilinx_u280_gen3x16_xdma_base_1
Logic UUID : 283BAB8F-654D-8674-968F-4DA57F7FA5D7
FPGA Name :
JTAG ID Code : 0x14b7d093
DDR Size : 0 Bytes
DDR Count : 0
Mig Calibrated : true
P2P Status : disabled
Performance Mode : not supported
P2P IO space required : 64 GB
Clocks
DATA_CLK (Data) : 300 MHz
KERNEL_CLK (Kernel) : 500 MHz
hbm_aclk (System) : 450 MHz
Mac Addresses : 00:0A:35:0E:20:B0
: 00:0A:35:0E:20:B1
Device Status: HEALTHY
Hardware Context ID: 0
Xclbin UUID: 6306D6AE-1D66-AEA7-B15D-446D4ECC53BD
PL Compute Units
Index Name Base Address Usage Status
-------------------------------------------------
0 vadd:vadd_1 0x800000 1 (IDLE)
```
Basic functionality of the device can be checked using `xbutil validate -d <BDF>` as
```console
$ xbutil validate -d "0000:88:00.1"
Validate Device : [0000:88:00.1]
Platform : xilinx_u280_gen3x16_xdma_base_1
SC Version : 4.3.27
Platform ID : 283BAB8F-654D-8674-968F-4DA57F7FA5D7
-------------------------------------------------------------------------------
Test 1 [0000:88:00.1] : aux-connection
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 2 [0000:88:00.1] : pcie-link
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 3 [0000:88:00.1] : sc-version
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 4 [0000:88:00.1] : verify
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 5 [0000:88:00.1] : dma
Details : Buffer size - '16 MB' Memory Tag - 'HBM[0]'
Host -> PCIe -> FPGA write bandwidth = 11988.9 MB/s
Host <- PCIe <- FPGA read bandwidth = 12571.2 MB/s
...
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 6 [0000:88:00.1] : iops
Details : IOPS: 387240(verify)
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 7 [0000:88:00.1] : mem-bw
Details : Throughput (Type: DDR) (Bank count: 2) : 33932.9MB/s
Throughput of Memory Tag: DDR[0] is 16974.1MB/s
Throughput of Memory Tag: DDR[1] is 16974.2MB/s
Throughput (Type: HBM) (Bank count: 1) : 12383.7MB/s
Test Status : [PASSED]
-------------------------------------------------------------------------------
Test 8 [0000:88:00.1] : p2p
Test 9 [0000:88:00.1] : vcu
Test 10 [0000:88:00.1] : aie
Test 11 [0000:88:00.1] : ps-aie
Test 12 [0000:88:00.1] : ps-pl-verify
Test 13 [0000:88:00.1] : ps-verify
Test 14 [0000:88:00.1] : ps-iops
```
Finally, the device can be reinitialized using `xbutil reset -d <BDF>` as
```console
$ xbutil reset -d "0000:88:00.1"
Performing 'HOT Reset' on '0000:88:00.1'
Are you sure you wish to proceed? [Y/n]: Y
Successfully reset Device[0000:88:00.1]
```
This can be useful to recover the device from states such as `HANGING`, reported by `xbutil examine -d <BDF>`.
## OpenCL Platform Level
The `clinfo` utility can be used to verify that the accelerator is visible to OpenCL
```console
$ clinfo
Number of platforms: 2
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (3590.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Profile: EMBEDDED_PROFILE
Platform Version: OpenCL 1.0
Platform Name: Xilinx
Platform Vendor: Xilinx
Platform Extensions: cl_khr_icd
<...>
Platform Name: Xilinx
Number of devices: 2
Device Type: CL_DEVICE_TYPE_ACCRLERATOR
Vendor ID: 0h
Max compute units: 0
Max work items dimensions: 3
Max work items[0]: 4294967295
Max work items[1]: 4294967295
Max work items[2]: 4294967295
Max work group size: 4294967295
Preferred vector width char: 1
Preferred vector width short: 1
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 0
Max clock frequency: 0Mhz
Address bits: 64
Max memory allocation: 4294967296
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 8192
Max image 2D height: 8192
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 0
Max size of kernel argument: 2048
Alignment (bits) of base address: 32768
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: No
Round to +ve and infinity: No
IEEE754-2008 fused multiply-add: No
Cache type: None
Cache line size: 64
Cache size: 0
Global memory size: 0
Constant buffer size: 4194304
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 16384
Error correction support: 1
Profiling timer resolution: 1
Device endianess: Little
Available: No
Compiler available: No
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: Yes
Profiling: Yes
Platform ID: 0x16fbae8
Name: xilinx_u280_gen3x16_xdma_base_1
Vendor: Xilinx
Driver version: 1.0
Profile: EMBEDDED_PROFILE
Version: OpenCL 1.0
<...>
```
which shows that both `Xilinx` platform and accelerator devices are present.
## Building Applications
To simplify the build process we define two environment variables `IT4I_PLATFORM` and `IT4I_BUILD_MODE`.
The first `IT4I_PLATFORM` denotes specific accelerator hardware such as `Alveo u250` or `Alveo u280`
and its configuration stored in (`*.xpfm` files).
The list of available platforms can be obtained using `platforminfo` utility:
```console
$ platforminfo -l
{
"platforms": [
{
"baseName": "xilinx_u280_gen3x16_xdma_1_202211_1",
"version": "202211.1",
"type": "sdaccel",
"dataCenter": "true",
"embedded": "false",
"externalHost": "true",
"serverManaged": "true",
"platformState": "impl",
"usesPR": "true",
"platformFile": "\/opt\/xilinx\/platforms\/xilinx_u280_gen3x16_xdma_1_202211_1\/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm"
},
{
"baseName": "xilinx_u250_gen3x16_xdma_4_1_202210_1",
"version": "202210.1",
"type": "sdaccel",
"dataCenter": "true",
"embedded": "false",
"externalHost": "true",
"serverManaged": "true",
"platformState": "impl",
"usesPR": "true",
"platformFile": "\/opt\/xilinx\/platforms\/xilinx_u250_gen3x16_xdma_4_1_202210_1\/xilinx_u250_gen3x16_xdma_4_1_202210_1.xpfm"
}
]
}
```
Here, `baseName` and potentially `platformFile` are of interest and either can be specified as value of `IT4I_PLATFORM`.
In this case we have platform files `xilinx_u280_gen3x16_xdma_1_202211_1` (Alveo u280) and `xilinx_u250_gen3x16_xdma_4_1_202210_1` (Alveo u250).
The `IT4I_BUILD_MODE` is used to specify build type (`hw`, `hw_emu` and `sw_emu`):
- `hw` performs full synthesis for the accelerator
- `hw_emu` allows to run both synthesis and emulation for debugging
- `sw_emu` compiles kernels only for emulation (doesn't require accelerator and allows much faster build)
For example to configure build for `Alveo u280` we set:
```console
$ export IT4I_PLATFORM=xilinx_u280_gen3x16_xdma_1_202211_1
```
### Software Emulation Mode
The software emulation mode is preferable for development as HLS synthesis is very time consuming. To build following applications in this mode we set:
```console
$ export IT4I_BUILD_MODE=sw_emu
```
and run each application with `XCL_EMULATION_MODE` set to `sw_emu`:
```
$ XCL_EMULATION_MODE=sw_emu <application>
```
### Hardware Synthesis Mode
!!! note
The HLS of these simple applications **can take up to 2 hours** to finish.
To allow the application to utilize real hardware we have to synthetize FPGA design for the accelerator. This can be done by repeating same steps used to build kernels in emulation mode, but with `IT4I_BUILD_MODE` set to `hw` like so:
```console
$ export IT4I_BUILD_MODE=hw
```
the host application binary can be reused, but it has to be run without `XCL_EMULATION_MODE`:
```console
$ <application>
```
## Sample Applications
The first two samples illustrate two main approaches to building FPGA accelerated applications using Xilinx platform - **XRT** and **OpenCL**.
The final example combines **HIP** with **XRT** to show basics necessary to build application, which utilizes both GPU and FPGA accelerators.
### Using HLS and XRT
The applications are typically separated into host and accelerator/kernel side.
The following host-side code should be saved as `host.cpp`
```c++
/*
# Copyright (C) 2023, Advanced Micro Devices, Inc. All rights reserved.
# SPDX-License-Identifier: X11
*/
#include <iostream>
#include <cstring>
// XRT includes
#include "xrt/xrt_bo.h"
#include <experimental/xrt_xclbin.h>
#include "xrt/xrt_device.h"
#include "xrt/xrt_kernel.h"
#define DATA_SIZE 4096
int main(int argc, char** argv)
{
if(argc != 2)
{
std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
return EXIT_FAILURE;
}
// Read settings
std::string binaryFile = argv[1];
int device_index = 0;
std::cout << "Open the device" << device_index << std::endl;
auto device = xrt::device(device_index);
std::cout << "Load the xclbin " << binaryFile << std::endl;
auto uuid = device.load_xclbin("./vadd.xclbin");
size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
//auto krnl = xrt::kernel(device, uuid, "vadd");
auto krnl = xrt::kernel(device, uuid, "vadd", xrt::kernel::cu_access_mode::exclusive);
std::cout << "Allocate Buffer in Global Memory\n";
auto boIn1 = xrt::bo(device, vector_size_bytes, krnl.group_id(0)); //Match kernel arguments to RTL kernel
auto boIn2 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
auto boOut = xrt::bo(device, vector_size_bytes, krnl.group_id(2));
// Map the contents of the buffer object into host memory
auto bo0_map = boIn1.map<int*>();
auto bo1_map = boIn2.map<int*>();
auto bo2_map = boOut.map<int*>();
std::fill(bo0_map, bo0_map + DATA_SIZE, 0);
std::fill(bo1_map, bo1_map + DATA_SIZE, 0);
std::fill(bo2_map, bo2_map + DATA_SIZE, 0);
// Create the test data
int bufReference[DATA_SIZE];
for (int i = 0; i < DATA_SIZE; ++i)
{
bo0_map[i] = i;
bo1_map[i] = i;
bufReference[i] = bo0_map[i] + bo1_map[i]; //Generate check data for validation
}
// Synchronize buffer content with device side
std::cout << "synchronize input buffer data to device global memory\n";
boIn1.sync(XCL_BO_SYNC_BO_TO_DEVICE);
boIn2.sync(XCL_BO_SYNC_BO_TO_DEVICE);
std::cout << "Execution of the kernel\n";
auto run = krnl(boIn1, boIn2, boOut, DATA_SIZE); //DATA_SIZE=size
run.wait();
// Get the output;
std::cout << "Get the output data from the device" << std::endl;
boOut.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
// Validate results
if (std::memcmp(bo2_map, bufReference, vector_size_bytes))
throw std::runtime_error("Value read back does not match reference");
std::cout << "TEST PASSED\n";
return 0;
}
```
The host-side code can now be compiled using GCC toolchain as:
```console
$ g++ host.cpp -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil -o host
```
The accelerator side (simple vector-add kernel) should be saved as `vadd.cpp`.
```c++
/*
# Copyright (C) 2023, Advanced Micro Devices, Inc. All rights reserved.
# SPDX-License-Identifier: X11
*/
extern "C" {
void vadd(
const unsigned int *in1, // Read-Only Vector 1
const unsigned int *in2, // Read-Only Vector 2
unsigned int *out, // Output Result
int size // Size in integer
)
{
#pragma HLS INTERFACE m_axi port=in1 bundle=aximm1
#pragma HLS INTERFACE m_axi port=in2 bundle=aximm2
#pragma HLS INTERFACE m_axi port=out bundle=aximm1
for(int i = 0; i < size; ++i)
{
out[i] = in1[i] + in2[i];
}
}
}
```
The accelerator-side code is build using Vitis `v++`.
This is two-step process, which either builds emulation binary or performs full HLS (depending on the value of `-t` argument).
The platform (specific accelerator) has to be also specified at this step (both for emulation and full HLS).
```console
$ v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k vadd vadd.cpp -o vadd.xo
$ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM vadd.xo -o vadd.xclbin
```
This process should result in `vadd.xclbin`, which can be loaded by host-side application.
### Running the Application
With both host application and kernel binary at hand the application (in emulation mode) can be launched as
```console
$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
```
or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
```console
./host vadd.xclbin
```
## Using HLS and OpenCL
The host-side application code should be saved as `host.cpp`.
This application attempts to find `Xilinx` OpenCL platform in the system and selects first device in that platform.
The device is then configured with provided kernel binary.
Other than that the only difference to typical vector-add in OpenCL is use of `enqueueTask(...)` to launch the kernel
(compared to typical `enqueueNDRangeKernel`).
```c++
#include <iostream>
#include <fstream>
#include <iterator>
#include <vector>
#define CL_HPP_TARGET_OPENCL_VERSION 120
#define CL_HPP_MINIMUM_OPENCL_VERSION 120
#define CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY 1
#define CL_USE_DEPRECATED_OPENCL_1_2_APIS
#include <CL/cl2.hpp>
#include <CL/cl_ext_xilinx.h>
std::vector<unsigned char> read_binary_file(const std::string &filename)
{
std::cout << "INFO: Reading " << filename << std::endl;
std::ifstream file(filename, std::ios::binary);
file.unsetf(std::ios::skipws);
std::streampos file_size;
file.seekg(0, std::ios::end);
file_size = file.tellg();
file.seekg(0, std::ios::beg);
std::vector<unsigned char> data;
data.reserve(file_size);
data.insert(data.begin(),
std::istream_iterator<unsigned char>(file),
std::istream_iterator<unsigned char>());
return data;
}
cl::Device select_device()
{
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
cl::Platform platform;
for(cl::Platform &p: platforms)
{
const std::string name = p.getInfo<CL_PLATFORM_NAME>();
std::cout << "PLATFORM: " << name << std::endl;
if(name == "Xilinx")
{
platform = p;
break;
}
}
if(platform == cl::Platform())
{
std::cout << "Xilinx platform not found!" << std::endl;
exit(EXIT_FAILURE);
}
std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
return devices[0];
}
static const int DATA_SIZE = 1024;
int main(int argc, char *argv[])
{
if(argc != 2)
{
std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
return EXIT_FAILURE;
}
std::string binary_file = argv[1];
std::vector<int> source_a(DATA_SIZE, 10);
std::vector<int> source_b(DATA_SIZE, 32);
auto program_binary = read_binary_file(binary_file);
cl::Program::Binaries bins{{program_binary.data(), program_binary.size()}};
cl::Device device = select_device();
cl::Context context(device, nullptr, nullptr, nullptr);
cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
cl::Program program(context, {device}, bins, nullptr);
cl::Kernel vadd_kernel = cl::Kernel(program, "vector_add");
cl::Buffer buffer_a(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, source_a.size() * sizeof(int), source_a.data());
cl::Buffer buffer_b(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, source_b.size() * sizeof(int), source_b.data());
cl::Buffer buffer_res(context, CL_MEM_READ_WRITE, source_a.size() * sizeof(int));
int narg = 0;
vadd_kernel.setArg(narg++, buffer_res);
vadd_kernel.setArg(narg++, buffer_a);
vadd_kernel.setArg(narg++, buffer_b);
vadd_kernel.setArg(narg++, DATA_SIZE);
q.enqueueTask(vadd_kernel);
std::vector<int> result(DATA_SIZE, 0);
q.enqueueReadBuffer(buffer_res, CL_TRUE, 0, result.size() * sizeof(int), result.data());
int mismatch_count = 0;
for(size_t i = 0; i < DATA_SIZE; ++i)
{
int host_result = source_a[i] + source_b[i];
if(result[i] != host_result)
{
mismatch_count++;
std::cout << "ERROR: " << result[i] << " != " << host_result << std::endl;
break;
}
}
std::cout << "RESULT: " << (mismatch_count == 0 ? "PASSED" : "FAILED") << std::endl;
return 0;
}
```
The host-side code can now be compiled using GCC toolchain as:
```console
$ g++ host.cpp -I$XILINX_XRT/include -I$XILINX_VIVADO/include -lOpenCL -o host
```
The accelerator side (simple vector-add kernel) should be saved as `vadd.cl`.
```c++
#define BUFFER_SIZE 256
#define DATA_SIZE 1024
// TRIPCOUNT indentifier
__constant uint c_len = DATA_SIZE / BUFFER_SIZE;
__constant uint c_size = BUFFER_SIZE;
__attribute__((reqd_work_group_size(1, 1, 1)))
__kernel void vector_add(__global int* c,
__global const int* a,
__global const int* b,
const int n_elements)
{
int arrayA[BUFFER_SIZE];
int arrayB[BUFFER_SIZE];
__attribute__((xcl_loop_tripcount(c_len, c_len)))
for (int i = 0; i < n_elements; i += BUFFER_SIZE)
{
int size = BUFFER_SIZE;
if(i + size > n_elements)
size = n_elements - i;
__attribute__((xcl_loop_tripcount(c_size, c_size)))
__attribute__((xcl_pipeline_loop(1))) readA:
for(int j = 0; j < size; j++)
arrayA[j] = a[i + j];
__attribute__((xcl_loop_tripcount(c_size, c_size)))
__attribute__((xcl_pipeline_loop(1))) readB:
for(int j = 0; j < size; j++)
arrayB[j] = b[i + j];
__attribute__((xcl_loop_tripcount(c_size, c_size)))
__attribute__((xcl_pipeline_loop(1))) vadd_writeC:
for(int j = 0; j < size; j++)
c[i + j] = arrayA[j] + arrayB[j];
}
}
```
The accelerator-side code is build using Vitis `v++`.
This is three-step process, which either builds emulation binary or performs full HLS (depending on the value of `-t` argument).
The platform (specific accelerator) has to be also specified at this step (both for emulation and full HLS).
```console
$ v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k vector_add -o vadd.xo vadd.cl
$ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.link.xclbin vadd.xo
$ v++ -p vadd.link.xclbin -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.xclbin
```
This process should result in `vadd.xclbin`, which can be loaded by host-side application.
### Running the Application
With both host application and kernel binary at hand the application (in emulation mode) can be launched as
```console
$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
```
or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
```console
./host vadd.xclbin
```
## Hybrid GPU and FPGA Application (HIP+XRT)
This simple 8-bit quantized dot product (`R = sum(X[i]*Y[i])`) example illustrates basic approach to utilize both GPU and FPGA accelerators in a single application.
The application takes the simplest approach, where both synchronization and data transfers are handled explicitly by the host.
The HIP toolchain is used to compile the single source host/GPU code as usual, but it is also linked with XRT runtime, which allows host to control the FPGA accelerator.
The FPGA kernels are built separately as in previous examples.
The host/GPU HIP code should be saved as `main.hip`
```c++
#include <iostream>
#include <vector>
#include "xrt/xrt_bo.h"
#include "experimental/xrt_xclbin.h"
#include "xrt/xrt_device.h"
#include "xrt/xrt_kernel.h"
#include "hip/hip_runtime.h"
const size_t DATA_SIZE = 1024;
float compute_reference(const float *srcX, const float *srcY, size_t count);
__global__ void quantize(int8_t *out, const float *in, size_t count)
{
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
out[i] = int8_t(in[i] * 127);
}
__global__ void dequantize(float *out, const int16_t *in, size_t count)
{
size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
out[i] = float(in[i] / float(127*127));
}
int main(int argc, char *argv[])
{
if(argc != 2)
{
std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
return EXIT_FAILURE;
}
// Prepare experiment data
std::vector<float> srcX(DATA_SIZE);
std::vector<float> srcY(DATA_SIZE);
float outR = 0.0f;
for(size_t i = 0; i < DATA_SIZE; ++i)
{
srcX[i] = float(rand()) / float(RAND_MAX);
srcY[i] = float(rand()) / float(RAND_MAX);
outR += srcX[i] * srcY[i];
}
float outR_quant = compute_reference(srcX.data(), srcY.data(), DATA_SIZE);
std::cout << "REFERENCE: " << outR_quant << " (" << outR << ")" << std::endl;
// Initialize XRT (FPGA device), load kernels binary and create kernel object
xrt::device device(0);
std::cout << "Loading xclbin file " << argv[1] << std::endl;
xrt::uuid xclbinId = device.load_xclbin(argv[1]);
xrt::kernel mulKernel(device, xclbinId, "multiply", xrt::kernel::cu_access_mode::exclusive);
// Allocate GPU buffers
float *srcX_gpu, *srcY_gpu, *res_gpu;
int8_t *srcX_gpu_quant, *srcY_gpu_quant;
int16_t *res_gpu_quant;
hipMalloc(&srcX_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&srcY_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&res_gpu, DATA_SIZE * sizeof(float));
hipMalloc(&srcX_gpu_quant, DATA_SIZE * sizeof(int8_t));
hipMalloc(&srcY_gpu_quant, DATA_SIZE * sizeof(int8_t));
hipMalloc(&res_gpu_quant, DATA_SIZE * sizeof(int16_t));
// Allocate FPGA buffers
xrt::bo srcX_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(0));
xrt::bo srcY_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(1));
xrt::bo res_fpga_quant(device, DATA_SIZE * sizeof(int16_t), mulKernel.group_id(2));
// Copy experiment data from HOST to GPU
hipMemcpy(srcX_gpu, srcX.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
hipMemcpy(srcY_gpu, srcY.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
// Execute quantization kernels on both input vectors
quantize<<<16, 256>>>(srcX_gpu_quant, srcX_gpu, DATA_SIZE);
quantize<<<16, 256>>>(srcY_gpu_quant, srcY_gpu, DATA_SIZE);
// Map FPGA buffers into HOST memory, copy data from GPU to these mapped buffers and synchronize them into FPGA memory
hipMemcpy(srcX_fpga_quant.map<int8_t *>(), srcX_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
srcX_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
hipMemcpy(srcY_fpga_quant.map<int8_t *>(), srcY_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
srcY_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
// Execute FPGA kernel (8-bit integer multiplication)
auto kernelRun = mulKernel(res_fpga_quant, srcX_fpga_quant, srcY_fpga_quant, DATA_SIZE);
kernelRun.wait();
// Synchronize output FPGA buffer back to HOST and copy its contents to GPU buffer for dequantization
res_fpga_quant.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
hipMemcpy(res_gpu_quant, res_fpga_quant.map<int16_t *>(), DATA_SIZE * sizeof(int16_t), hipMemcpyDeviceToHost);
// Dequantize multiplication result on GPU
dequantize<<<16, 256>>>(res_gpu, res_gpu_quant, DATA_SIZE);
// Copy dequantized results from GPU to HOST
std::vector<float> res(DATA_SIZE);
hipMemcpy(res.data(), res_gpu, DATA_SIZE * sizeof(float), hipMemcpyDeviceToHost);
// Perform simple sum on CPU
float out = 0.0;
for(size_t i = 0; i < DATA_SIZE; ++i)
out += res[i];
std::cout << "RESULT: " << out << std::endl;
hipFree(srcX_gpu);
hipFree(srcY_gpu);
hipFree(res_gpu);
hipFree(srcX_gpu_quant);
hipFree(srcY_gpu_quant);
hipFree(res_gpu_quant);
return 0;
}
float compute_reference(const float *srcX, const float *srcY, size_t count)
{
float out = 0.0f;
for(size_t i = 0; i < count; ++i)
{
int16_t quantX(srcX[i] * 127);
int16_t quantY(srcY[i] * 127);
out += float(int16_t(quantX * quantY) / float(127*127));
}
return out;
}
```
The host/GPU application can be built using HIPCC as:
```console
$ hipcc -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil main.hip -o host
```
The accelerator side (simple vector-multiply kernel) should be saved as `kernels.cpp`.
```c++
extern "C" {
void multiply(
short *out,
const char *inX,
const char *inY,
int size)
{
#pragma HLS INTERFACE m_axi port=inX bundle=aximm1
#pragma HLS INTERFACE m_axi port=inY bundle=aximm2
#pragma HLS INTERFACE m_axi port=out bundle=aximm1
for(int i = 0; i < size; ++i)
out[i] = short(inX[i]) * short(inY[i]);
}
}
```
Once again the HLS kernel is build using Vitis `v++` in two steps:
```console
v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k multiply kernels.cpp -o kernels.xo
v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM kernels.xo -o kernels.xclbin
```
### Running the Application
In emulation mode (FPGA emulation, GPU HW is required) the application can be launched as:
```console
$ XCL_EMULATION_MODE=sw_emu ./host kernels.xclbin
REFERENCE: 256.554 (260.714)
Loading xclbin file ./kernels.xclbin
RESULT: 256.554
```
or, having compiled kernels with `IT4I_BUILD_MODE=hw` set, using real hardware (both FPGA and GPU HW is required)
```console
$ ./host kernels.xclbin
REFERENCE: 256.554 (260.714)
Loading xclbin file ./kernels.xclbin
RESULT: 256.554
```
## Additional Resources
- [https://xilinx.github.io/Vitis-Tutorials/][1]
- [http://xilinx.github.io/Vitis_Accel_Examples/][2]
[1]: https://xilinx.github.io/Vitis-Tutorials/
[2]: http://xilinx.github.io/Vitis_Accel_Examples/
# Complementary Systems
Complementary systems offer development environment for users
that need to port and optimize their code and applications
for various hardware architectures and software technologies
that are not available on standard clusters.
## Complementary Systems 1
First stage of complementary systems implementation comprises of these partitions:
- compute partition 0 – based on ARM technology - legacy
- compute partition 1 – based on ARM technology - A64FX
- compute partition 2 – based on Intel technologies - Ice Lake, NVDIMMs + Bitware FPGAs
- compute partition 3 – based on AMD technologies - Milan, MI100 GPUs + Xilinx FPGAs
- compute partition 4 – reflecting Edge type of servers
- partition 5 – FPGA synthesis server
![](../img/cs1_1.png)
## Complementary Systems 2
Second stage of complementary systems implementation comprises of these partitions:
- compute partition 6 - based on ARM technology + CUDA programmable GPGPU accelerators on ampere architecture + DPU network processing units
- compute partition 7 - based on IBM Power10 architecture
- compute partition 8 - modern CPU with a very high L3 cache capacity (over 750MB)
- compute partition 9 - virtual GPU accelerated workstations
- compute partition 10 - Sapphire Rapids-HBM server
- compute partition 11 - NVIDIA Grace CPU Superchip
![](../img/cs2_2.png)
## Modules and Architecture Availability
Complementary systems list available modules automatically based on the detected architecture.
However, you can load one of the three modules -- `aarch64`, `avx2`, and `avx512` --
to reload the list of modules available for the respective architecture:
```console
[user@login.cs ~]$ ml architecture/aarch64
aarch64 modules + all modules
[user@login.cs ~]$ ml architecture/avx2
avx2 modules + all modules
[user@login.cs ~]$ ml architecture/avx512
avx512 modules + all modules
```
# Complementary System Job Scheduling
## Introduction
[Slurm][1] workload manager is used to allocate and access Complementary systems resources.
## Getting Partition Information
Display partitions/queues
```console
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
p00-arm up 1-00:00:00 0/1/0/1 p00-arm01
p01-arm* up 1-00:00:00 0/8/0/8 p01-arm[01-08]
p02-intel up 1-00:00:00 0/2/0/2 p02-intel[01-02]
p03-amd up 1-00:00:00 0/2/0/2 p03-amd[01-02]
p04-edge up 1-00:00:00 0/1/0/1 p04-edge01
p05-synt up 1-00:00:00 0/1/0/1 p05-synt01
p06-arm up 1-00:00:00 0/2/0/2 p06-arm[01-02]
p07-power up 1-00:00:00 0/1/0/1 p07-power01
p08-amd up 1-00:00:00 0/1/0/1 p08-amd01
p10-intel up 1-00:00:00 0/1/0/1 p10-intel01
```
## Getting Job Information
Show jobs
```console
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 p01-arm interact user R 1:48 2 p01-arm[01-02]
```
Show job details for specific job
```console
$ scontrol -d show job JOBID
```
Show job details for executing job from job session
```console
$ scontrol -d show job $SLURM_JOBID
```
## Running Interactive Jobs
Run interactive job
```console
$ salloc -A PROJECT-ID -p p01-arm
```
Run interactive job, with X11 forwarding
```console
$ salloc -A PROJECT-ID -p p01-arm --x11
```
!!! warning
Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
## Running Batch Jobs
Run batch job
```console
$ sbatch -A PROJECT-ID -p p01-arm ./script.sh
```
Useful command options (salloc, sbatch, srun)
* -n, --ntasks
* -c, --cpus-per-task
* -N, --nodes
## Slurm Job Environment Variables
Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
See all Slurm variables
```
set | grep ^SLURM
```
### Useful Variables
| variable name | description | example |
| ------ | ------ | ------ |
| SLURM_JOB_ID | job id of the executing job| 593 |
| SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
| SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
| SLURM_JOB_PARTITION | name of the partition | p03-amd |
| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
See [Slurm srun documentation][2] for details.
Get job nodelist
```
$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]
```
Expand nodelist to list of nodes.
```
$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02
```
## Modifying Jobs
```
$ scontrol update JobId=JOBID ATTR=VALUE
```
for example
```
$ scontrol update JobId=JOBID Comment='The best job ever'
```
## Deleting Jobs
```
$ scancel JOBID
```
## Partitions
| PARTITION | nodes | whole node | cores per node | features |
| --------- | ----- | ---------- | -------------- | -------- |
| p00-arm | 1 | yes | 64 | aarch64,cortex-a72 |
| p01-arm | 8 | yes | 48 | aarch64,a64fx,ib |
| p02-intel | 2 | no | 64 | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
| p03-amd | 2 | no | 64 | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
| p04-edge | 1 | yes | 16 | 86_64,intel,broadwell,ib |
| p05-synt | 1 | yes | 8 | x86_64,amd,milan,ib,ht |
| p06-arm | 2 | yes | 80 | aarch64,ib |
| p07-power | 1 | yes | 192 | ppc64le,ib |
| p08-amd | 1 | yes | 128 | x86_64,amd,milan-x,ib,ht |
| p10-intel | 1 | yes | 96 | x86_64,intel,sapphire_rapids,ht|
Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
FIFO scheduling with backfilling is employed.
## Partition 00 - ARM (Cortex-A72)
Whole node allocation.
One node:
```console
salloc -A PROJECT-ID -p p00-arm
```
## Partition 01 - ARM (A64FX)
Whole node allocation.
One node:
```console
salloc -A PROJECT-ID -p p01-arm
```
```console
salloc -A PROJECT-ID -p p01-arm -N=1
```
Multiple nodes:
```console
salloc -A PROJECT-ID -p p01-arm -N=8
```
## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per FPGA, resource separation is not enforced.
Use only FPGAs allocated to the job!
One FPGA:
```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga
```
Two FPGAs on the same node:
```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
```
## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
GPUs and FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per GPU and per FPGA, resource separation is not enforced.
Use only GPUs and FPGAs allocated to the job!
One GPU:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu
```
Two GPUs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:2
```
Four GPUs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4
```
All GPUs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
```
One FPGA:
```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga
```
Two FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
```
One GPU and one FPGA on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga
```
Four GPUs and two FPGAs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
```
All GPUs and FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
```
## Partition 04 - Edge Server
Whole node allocation:
```console
salloc -A PROJECT-ID -p p04-edge
```
## Partition 05 - FPGA Synthesis Server
Whole node allocation:
```console
salloc -A PROJECT-ID -p p05-synt
```
## Partition 06 - ARM
Whole node allocation:
```console
salloc -A PROJECT-ID -p p06-arm
```
## Partition 07 - IBM Power
Whole node allocation:
```console
salloc -A PROJECT-ID -p p07-power
```
## Partition 08 - AMD Milan-X
Whole node allocation:
```console
salloc -A PROJECT-ID -p p08-amd
```
## Partition 10 - Intel Sapphire Rapids
Whole node allocation:
```console
salloc -A PROJECT-ID -p p10-intel
```
## Features
Nodes have feature tags assigned to them.
Users can select nodes based on the feature tags using --constraint option.
| Feature | Description |
| ------ | ------ |
| aarch64 | platform |
| x86_64 | platform |
| ppc64le | platform |
| amd | manufacturer |
| intel | manufacturer |
| icelake | processor family |
| broadwell | processor family |
| sapphire_rapids | processor family |
| milan | processor family |
| milan-x | processor family |
| ib | Infiniband |
| gpu | equipped with GPU |
| fpga | equipped with FPGA |
| nvdimm | equipped with NVDIMMs |
| ht | Hyperthreading enabled |
| noht | Hyperthreading disabled |
```
$ sinfo -o '%16N %f'
NODELIST AVAIL_FEATURES
p00-arm01 aarch64,cortex-a72
p01-arm[01-08] aarch64,a64fx,ib
p02-intel01 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd02 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
p03-amd01 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
p04-edge01 x86_64,intel,broadwell,ib,ht
p05-synt01 x86_64,amd,milan,ib,ht
p06-arm[01-02] aarch64,ib
p07-power01 ppc64le,ib
p08-amd01 x86_64,amd,milan-x,ib,ht
p10-intel01 x86_64,intel,sapphire_rapids,ht
```
```
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
```
```
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
```
## Resources, GRES
Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
!!! warning
Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
### Node Resources
Get information about GRES on node.
```
$ scontrol -d show node p02-intel01 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p02-intel02 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p03-amd01 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
```
### Request Resources
To allocate required resources (GPUs or FPGAs) use the `--gres salloc/srun` option.
Example: Allocate one FPGA
```
$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
```
### Find Out Allocated Resources
Information about allocated resources is available in Slurm job details, attributes `JOB_GRES` and `GRES`.
```
$ scontrol -d show job $SLURM_JOBID |grep GRES=
JOB_GRES=fpga:xilinx_alveo_u250:1
Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
```
IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are `fpga:xilinx_alveo_u250:1(IDX:0)`, we should use FPGA with index/number 0 on node p03-amd01.
### Request Specific Resources
It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
```
$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job
$ scontrol -d show job $SLURM_JOBID | grep -i gres
JOB_GRES=fpga:xilinx_alveo_u280:2
Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
TresPerNode=gres:fpga:xilinx_alveo_u280:2
```
[1]: https://slurm.schedmd.com/
[2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
# Complementary Systems Specifications
Below are the technical specifications of individual Complementary systems.
## Partition 0 - ARM (Cortex-A72)
The partition is based on the [ARMv8-A 64-bit][4] nebo architecture.
- Cortex-A72
- ARMv8-A 64-bit
- 2x 32 cores @ 2 GHz
- 255 GB memory
- disk capacity 3,7 TB
- 1x Infiniband FDR 56 Gb/s
## Partition 1 - ARM (A64FX)
The partition is based on the Armv8.2-A architecture
with SVE extension of instruction set and
consists of 8 compute nodes with the following per-node parameters:
- 1x Fujitsu A64FX CPU
- Arm v8.2-A ISA CPU with Scalable Vector Extension (SVE) extension
- 48 cores at 2.0 GHz
- 32 GB of HBM2 memory
- 400 GB SSD (m.2 form factor) – mixed used type
- 1x Infiniband HDR100 interface
- connected via 16x PCI-e Gen3 slot to the CPU
## Partition 2 - Intel (Ice Lake, NVDIMMs) <!--- + Bitware FPGAs) -->
The partition is based on the Intel Ice Lake x86 architecture.
It contains two servers with Intel NVDIMM memories.
<!--- The key technologies installed are Intel NVDIMM memories. and Intel FPGA accelerators.
The partition contains two servers each with two FPGA accelerators. -->
Each server has the following parameters:
- 2x 3rd Gen Xeon Scalable Processors Intel Xeon Gold 6338 CPU
- 32-cores @ 2.00GHz
- 16x 16GB RAM with ECC
- DDR4-3200
- 1x Infiniband HDR100 interface
- connected to CPU 8x PCI-e Gen4 interface
- 3.2 TB NVMe local storage – mixed use type
<!---
2x FPGA accelerators
Bitware [520N-MX][1]
-->
In addition, the servers has the following parameters:
- Intel server 1 – low NVDIMM memory server with 2304 GB NVDIMM memory
- 16x 128GB NVDIMM persistent memory modules
- Intel server 2 – high NVDIMM memory server with 8448 GB NVDIMM memory
- 16x 512GB NVDIMM persistent memory modules
Software installed on the partition:
FPGA boards support application development using following design flows:
- OpenCL
- High-Level Synthesis (C/C++) including support for OneAPI
- Verilog and VHDL
## Partition 3 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
The partition is based on two servers equipped with AMD Milan x86 CPUs,
AMD GPUs and Xilinx FPGAs architectures and represents an alternative
to the Intel-based partition's ecosystem.
Each server has the following parameters:
- 2x AMD Milan 7513 CPU
- 32 cores @ 2.6 GHz
- 16x 16GB RAM with ECC
- DDR4-3200
- 4x AMD GPU accelerators MI 100
- Interconnected with AMD Infinity Fabric™ Link for fast GPU to GPU communication
- 1x 100 GBps Infiniband HDR100
- connected to CPU via 8x PCI-e Gen4 interface
- 3.2 TB NVMe local storage – mixed use
In addition:
- AMD server 1 has 2x FPGA [Xilinx Alveo U250 Data Center Accelerator Card][2]
- AMD server 2 has 2x FPGA [Xilinx Alveo U280 Data Center Accelerator Card][3]
Software installed on the partition:
FPGA boards support application development using following design flows:
- OpenCL
- High-Level Synthesis (C/C++)
- Verilog and VHDL
- developer tools and libraries for AMD GPUs.
## Partition 4 - Edge Server
The partition provides overview of the so-called edge computing class of resources
with solutions powerful enough to provide data analytic capabilities (both CPU and GPU)
in a form factor which cannot require a data center to operate.
The partition consists of one edge computing server with following parameters:
- 1x x86_64 CPU Intel Xeon D-1587
- TDP 65 W,
- 16 cores,
- 435 GFlop/s theoretical max performance in double precision
- 1x CUDA programmable GPU NVIDIA Tesla T4
- TDP 70W
- theoretical performance 8.1 TFlop/s in FP32
- 128 GB RAM
- 1.92TB SSD storage
- connectivity:
- 2x 10 Gbps Ethernet,
- WiFi 802.11 ac,
- LTE connectivity
## Partition 5 - FPGA Synthesis Server
FPGAs design tools usually run for several hours to one day to generate a final bitstream (logic design) of large FPGA chips. These tools are usually sequential, therefore part of the system is a dedicated server for this task.
This server is used by development tools needed for FPGA boards installed in both Compute partition 2 and 3.
- AMD EPYC 72F3, 8 cores @ 3.7 GHz nominal frequency
- 8 memory channels with ECC
- 128 GB of DDR4-3200 memory with ECC
- memory is fully populated to maximize memory subsystem performance
- 1x 10Gb Ethernet port used for connection to LAN
- NVMe local storage
- 2x NVMe disks 3.2TB, configured RAID 1
## Partition 6 - ARM + CUDA GPGU (Ampere) + DPU
This partition is based on ARM architecture and is equipped with CUDA programmable GPGPU accelerators
based on Ampere architecture and DPU network processing units.
The partition consists of two nodes with the following per-node parameters:
- Server Gigabyte G242-P36, Ampere Altra Q80-30 (80c, 3.0GHz)
- 512GB DIMM DDR4, 3200MHz, ECC, CL22
- 2x Micron 7400 PRO 1920GB NVMe M.2 Non-SED Enterprise SSD
- 2x NVIDIA A30 GPU Accelerator
- 2x NVIDIA BlueField-2 E-Series DPU 25GbE Dual-Port SFP56, PCIe Gen4 x16, 16GB DDR + 64, 200Gb Ethernet
- Mellanox ConnectX-5 EN network interface card, 10/25GbE dual-port SFP28, PCIe3.0 x8
- Mellanox ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), single-port QSFP56
## Partition 7 - IBM
The IBM Power10 server is a single-node partition with the following parameters:
- Server IBM POWER S1022
- 2x Power10 12-CORE TYPICAL 2.90 TO 4.0 GHZ (MAX) PO
- 512GB DDIMMS, 3200 MHZ, 8GBIT DDR4
- 2x ENTERPRISE 1.6 TB SSD PCIE4 NVME U.2 MOD
- 2x ENTERPRISE 6.4 TB SSD PCIE4 NVME U.2 MOD
- PCIE3 LP 2-PORT 25/10GB NIC&ROCE SR/CU A
## Partition 8 - HPE Proliant
This partition provides a modern CPU with a very large L3 cache.
The goal is to enable users to develop algorithms and libraries
that will efficiently utilize this technology.
The processor is very efficient, for example, for linear algebra on relatively small matrices.
This is a single-node partition with the following parameters:
- Server HPE Proliant DL 385 Gen10 Plus v2 CTO
- 2x AMD EPYC 7773X Milan-X, 64 cores, 2.2GHz, 768 MB L3 cache
- 16x HPE 16GB (1x+16GB) x4 DDR4-3200 Registered Smart Memory Kit
- 2x 3.84TB NVMe RI SFF BC U.3ST MV SSD
- BCM 57412 10GbE 2p SFP+ OCP3 Adptr
- HPE IB HDR100/EN 100Gb 1p QSFP56 Adptr1
- HPE Cray Programming Environment for x86 Systems 2 Seats
## Partition 9 - Virtual GPU Accelerated Workstation
This partition provides users with a remote/virtual workstation running MS Windows OS.
It offers rich graphical environment with a focus on 3D OpenGL
or RayTracing-based applications with the smallest possible degradation of user experience.
The partition consists of two nodes with the following per-node parameters:
- Server HPE Proliant DL 385 Gen10 Plus v2 CTO
- 2x AMD EPYC 7413, 24 cores, 2.55GHz
- 16x HPE 32GB 2Rx4 PC4-3200AA-R Smart Kit
- 2x 3.84TB NVMe RI SFF BC U.3ST MV SSD
- BCM 57412 10GbE 2p SFP+ OCP3 Adptr
- 2x NVIDIA A40 48GB GPU Accelerator
### Available Software
The following is the list of software available on partiton 09:
- Academic VMware Horizon 8 Enterprise Term Edition: 10 Concurrent User Pack for 4 year term license; includes SnS
- 8x NVIDIA RTX Virtual Workstation, per concurrent user, EDU, perpetual license
- 32x NVIDIA RTX Virtual Workstation, per concurrent user, EDU SUMS per year
- 7x Windows Server 2022 Standard - 16 Core License Pack
- 10x Windows Server 2022 - 1 User CAL
- 40x Windows 10/11 Enterprise E3 VDA (Microsoft) per year
- Hardware VMware Horizon management
## Partition 10 - Sapphire Rapids-HBM Server
The primary purpose of this server is to evaluate the impact of the HBM memory on the x86 processor
on the performance of the user applications.
This is a new feature previously available only on the GPGPU accelerators
and provided a significant boost to the memory-bound applications.
Users can also compare the impact of the HBM memory with the impact of the large L3 cache
available on the AMD Milan-X processor also available on the complementary systems.
The server is also equipped with DDR5 memory and enables the comparative studies with reference to DDR4 based systems.
- 2x Intel® Xeon® CPU Max 9468 48 cores base 2.1GHz, max 3.5Ghz
- 16x 16GB DDR5 4800Mhz
- 2x Intel D3 S4520 960GB SATA 6Gb/s
- 1x Supermicro Standard LP 2-port 10GbE RJ45, Broadcom BCM57416
## Partition 11 - NVIDIA Grace CPU Superchip
The [NVIDIA Grace CPU Superchip][6] uses the [NVIDIA® NVLink®-C2C][5] technology to deliver 144 Arm® Neoverse V2 cores and 1TB/s of memory bandwidth.
Runs all NVIDIA software stacks and platforms, including NVIDIA RTX™, NVIDIA HPC SDK, NVIDIA AI, and NVIDIA Omniverse™.
- Superchip design with up to 144 Arm Neoverse V2 CPU cores with Scalable Vector Extensions (SVE2)
- World’s first LPDDR5X with error-correcting code (ECC) memory, 1TB/s total bandwidth
- 900GB/s coherent interface, 7X faster than PCIe Gen 5
- NVIDIA Scalable Coherency Fabric with 3.2TB/s of aggregate bisectional bandwidth
- 2X the packaging density of DIMM-based solutions
- 2X the performance per watt of today’s leading CPU
- FP64 Peak of 7.1TFLOPS
[1]: https://www.bittware.com/fpga/520n-mx/
[2]: https://www.xilinx.com/products/boards-and-kits/alveo/u250.html#overview
[3]: https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#overview
[4]: https://developer.arm.com/documentation/100095/0003/
[5]: https://www.nvidia.com/en-us/data-center/nvlink-c2c/
[6]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/
# Accessing the DGX-2
## Before You Access
!!! warning
GPUs are single-user devices. GPU memory is not purged between job runs and it can be read (but not written) by any user. Consider the confidentiality of your running jobs.
## How to Access
The DGX-2 machine is integrated into [Barbora cluster][3].
The DGX-2 machine can be accessed from Barbora login nodes `barbora.it4i.cz` through the Barbora scheduler queue qdgx as a compute node cn202.
## Storage
There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT.
### HOME
The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1].
### SCRATCH
The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory.
Accessible capacity is 22TB, shared among all users.
!!! warning
Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted.
### PROJECT
The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters.
For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section.
[1]: ../../barbora/storage/#home-file-system
[2]: ../../storage/project-storage
[3]: ../../barbora/introduction
# NVIDIA DGX-2
The DGX-2 is a very powerful computational node, featuring high end x86_64 processors and 16 NVIDIA V100-SXM3 GPUs.
| NVIDIA DGX-2 | |
| --- | --- |
| CPUs | 2 x Intel Xeon Platinum |
| GPUs | 16 x NVIDIA Tesla V100 32GB HBM2 |
| System Memory | Up to 1.5 TB DDR4 |
| GPU Memory | 512 GB HBM2 (16 x 32 GB) |
| Storage | 30 TB NVMe, Up to 60 TB |
| Networking | 8 x Infiniband or 8 x 100 GbE |
| Power | 10 kW |
| Size | 350 lbs |
| GPU Throughput | Tensor: 1920 TFLOPs, FP16: 520 TFLOPs, FP32: 260 TFLOPs, FP64: 130 TFLOPs |
The [DGX-2][a] introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe.
With NVLink2, it enables 16x NVIDIA V100-SXM3 GPUs in a single system, for a total bandwidth going beyond 14 TB/s.
Featuring pair of Xeon 8168 CPUs, 1.5 TB of memory, and 30 TB of NVMe storage,
we get a system that consumes 10 kW, weighs 163.29 kg, but offers double precision performance in excess of 130TF.
The DGX-2 is designed to be a powerful server in its own right.
On the storage side, the DGX-2 comes with 30TB of NVMe-based solid state storage.
For clustering or further inter-system communications, it also offers InfiniBand and 100GigE connectivity, up to eight of them.
Further, the [DGX-2][b] offers a total of ~2 PFLOPs of half precision performance in a single system, when using the tensor cores.
![](../img/dgx1.png)
With DGX-2, AlexNET, the network that 'started' the latest machine learning revolution, now takes 18 minutes.
The DGX-2 is able to complete the training process
for FAIRSEQ – a neural network model for language translation – 10x faster than a DGX-1 system,
bringing it down to less than two days total rather than 15 days.
The new NVSwitches means that the PCIe lanes of the CPUs can be redirected elsewhere, most notably towards storage and networking connectivity.
The topology of the DGX-2 means that all 16 GPUs are able to pool their memory into a unified memory space,
though with the usual tradeoffs involved if going off-chip.
![](../img/dgx2-nvlink.png)
[a]: https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/dgx-2/nvidia-dgx-2-datasheet.pdf
[b]: https://www.youtube.com/embed/OTOGw0BRqK0
# Resource Allocation and Job Execution
To run a job, computational resources of DGX-2 must be allocated.
The DGX-2 machine is integrated to and accessible through Barbora cluster, the queue for the DGX-2 machine is called **qdgx**.
When allocating computational resources for the job, specify:
1. your Project ID
1. a queue for your job - **qdgx**;
1. the maximum time allocated to your calculation (default is **4 hour**, maximum is **48 hour**);
1. a jobscript if batch processing is intended.
Submit the job using the `sbatch` (for batch processing) or `salloc` (for interactive session) command:
**Example**
```console
[kru0052@login2.barbora ~]$ salloc -A PROJECT-ID -p qdgx --time=02:00:00
salloc: Granted job allocation 36631
salloc: Waiting for resource configuration
salloc: Nodes cn202 are ready for job
kru0052@cn202:~$ nvidia-smi
Wed Jun 16 07:46:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:34:00.0 Off | 0 |
| N/A 32C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM3... On | 00000000:36:00.0 Off | 0 |
| N/A 31C P0 48W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM3... On | 00000000:39:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM3... On | 00000000:3B:00.0 Off | 0 |
| N/A 36C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 |
| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 |
| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 Tesla V100-SXM3... On | 00000000:B7:00.0 Off | 0 |
| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 Tesla V100-SXM3... On | 00000000:B9:00.0 Off | 0 |
| N/A 30C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 10 Tesla V100-SXM3... On | 00000000:BC:00.0 Off | 0 |
| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 Tesla V100-SXM3... On | 00000000:BE:00.0 Off | 0 |
| N/A 35C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 Tesla V100-SXM3... On | 00000000:E0:00.0 Off | 0 |
| N/A 31C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 Tesla V100-SXM3... On | 00000000:E2:00.0 Off | 0 |
| N/A 29C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 Tesla V100-SXM3... On | 00000000:E5:00.0 Off | 0 |
| N/A 34C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
kru0052@cn202:~$ exit
```
!!! tip
Submit the interactive job using the `salloc` command.
## Job Execution
The DGX-2 machine runs only a bare-bone, minimal operating system. Users are expected to run
**[Apptainer/Singularity][1]** containers in order to enrich the environment according to the needs.
Containers (Docker images) optimized for DGX-2 may be downloaded from
[NVIDIA Gpu Cloud][2]. Select the code of interest and
copy the docker nvcr.io link from the Pull Command section. This link may be directly used
to download the container via Apptainer/Singularity, see the example below:
### Example - Apptainer/Singularity Run Tensorflow
```console
[kru0052@login2.barbora ~] $ salloc -A PROJECT-ID -p qdgx --time=02:00:00
salloc: Granted job allocation 36633
salloc: Waiting for resource configuration
salloc: Nodes cn202 are ready for job
kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
Singularity tensorflow_19.02-py3.sif:~>
Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
...
...
...
2019-03-11 08:30:12.263822: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
1 1.0 338.2 6.999 7.291 2.00000
10 10.0 3658.6 5.658 5.950 1.62000
20 20.0 25628.6 2.957 3.258 1.24469
30 30.0 30815.1 0.177 0.494 0.91877
40 40.0 30826.3 0.004 0.330 0.64222
50 50.0 30884.3 0.002 0.327 0.41506
60 60.0 30888.7 0.001 0.325 0.23728
70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025
kru0052@cn202:~$ exit
```
**GPU stat**
The GPU load can be determined by the `gpustat` utility.
```console
Every 2,0s: gpustat --color
dgx Mon Mar 11 09:31:00 2019
[0] Tesla V100-SXM3-32GB | 47'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[1] Tesla V100-SXM3-32GB | 48'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[2] Tesla V100-SXM3-32GB | 56'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[3] Tesla V100-SXM3-32GB | 57'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[4] Tesla V100-SXM3-32GB | 46'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[5] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[6] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[7] Tesla V100-SXM3-32GB | 54'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[8] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[9] Tesla V100-SXM3-32GB | 46'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
[10] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[11] Tesla V100-SXM3-32GB | 56'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[12] Tesla V100-SXM3-32GB | 47'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
[13] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[14] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[15] Tesla V100-SXM3-32GB | 58'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
```
[1]: https://docs.it4i.cz/software/tools/singularity/
[2]: https://ngc.nvidia.com/
# Software Deployment
Software deployment on DGX-2 is based on containers. NVIDIA provides a wide range of prepared Docker containers with a variety of different software. Users can easily download these containers and use them directly on the DGX-2.
The catalog of all container images can be found on [NVIDIA site][a]. Supported software includes:
* TensorFlow
* MATLAB
* GROMACS
* Theano
* Caffe2
* LAMMPS
* ParaView
* ...
## Running Containers on DGX-2
NVIDIA expects usage of Docker as a containerization tool, but Docker is not a suitable solution in a multiuser environment. For this reason, the [Apptainer/Singularity container][b] solution is used.
Singularity can be used similarly to Docker, just change the image URL address. For example, original command for Docker `docker run -it nvcr.io/nvidia/theano:18.08` should be changed to `singularity shell docker://nvcr.io/nvidia/theano:18.08`. More about Apptainer/Singularity [here][1].
For fast container deployment, all images are cached after first use in the *lscratch* directory. This behavior can be changed by the *SINGULARITY_CACHEDIR* environment variable, but the start time of the container will increase significantly.
```console
$ ml av Singularity
---------------------------- /apps/modules/tools ----------------------------
Singularity/3.3.0
```
## MPI Modules
```console
$ ml av MPI
---------------------------- /apps/modules/mpi ----------------------------
OpenMPI/2.1.5-GCC-6.3.0-2.27 OpenMPI/3.1.4-GCC-6.3.0-2.27 OpenMPI/4.0.0-GCC-6.3.0-2.27 (D) impi/2017.4.239-iccifort-2017.7.259-GCC-6.3.0-2.27
```
## Compiler Modules
```console
$ ml av gcc
---------------------------- /apps/modules/compiler ----------------------------
GCC/6.3.0-2.27 GCCcore/6.3.0 icc/2017.7.259-GCC-6.3.0-2.27 ifort/2017.7.259-GCC-6.3.0-2.27
```
[1]: ../software/tools/singularity.md
[a]: https://ngc.nvidia.com/catalog/landing
[b]: https://www.sylabs.io/
# What Is DICE Project?
DICE (Data Infrastructure Capacity for EOSC) is an international project funded by the European Union
that provides cutting-edge data management services and a significant amount of storage resources for the EOSC.
The EOSC (European Open Science Cloud) project provides European researchers, innovators, companies,
and citizens with a federated and open multi-disciplinary environment
where they can publish, find, and re-use data, tools, and services for research, innovation and educational purposes.
For more information, see the official [DICE project][b] and [EOSC project][q] pages.
**IT4Innovations participates in DICE. DICE uses the iRODS software**
The integrated Rule-Oriented Data System (iRODS) is an open source data management software
used by research organizations and government agencies worldwide.
iRODS is released as a production-level distribution aimed at deployment in mission critical environments.
It virtualizes data storage resources, so users can take control of their data,
regardless of where and on what device the data is stored.
As data volumes grow and data services become more complex,
iRODS is serving an increasingly important role in data management.
For more information, see [the official iRODS page][c].
## How to Put Your Data to Our Server
**Prerequisities:**
First, we need to verify your identity, this is done through the following steps:
1. Sign in with your organization [B2ACCESS][d]; the page requests a valid personal certificate (e.g. GEANT).
Accounts with "Low" level of assurance are not granted access to IT4I zone.
1. Confirm your certificate in the browser:
![](img/B2ACCESS_chrome_eng.jpg)
1. Confirm your certificate in the OS (Windows):
![](img/crypto_v2.jpg)
1. Sign to EUDAT/B2ACCESS:
![](img/eudat_v2.jpg)
1. After successful login to B2Access:
1. **For Non IT4I Users**
Sign in to our [AAI][f] through your B2Access account.
You have to set a new password for iRODS access.
1. **For IT4I Users**
Sign in to our [AAI][f] through your B2Access account and link your B2ACCESS identity with your existing account.
The iRODS password will be the same as your IT4I LDAP password (i.e. code.it4i.cz password).
![](img/aai.jpg)
![](img/aai2.jpg)
![](img/aai3-passwd.jpg)
![](img/irods_linking_link.jpg)
1. Contact [support@it4i.cz][a], so we can create your account at our iRODS server.
1. **Fill this request on [EOSC-MARKETPLACE][h] (recommended)** or at [EUDAT][l], please specify the requested capacity.
![](img/eosc-marketplace-active.jpg)
![](img/eosc-providers.jpg)
![](img/eudat_request.jpg)
## Access to iRODS Collection From Karolina
Access to iRODS Collection requires access to the Karolina cluster (i.e. [IT4I account][4]),
since iRODS clients are provided as a module on Karolina (Barbora is in progress).
The `irodsfs` module loads config file for irodsfs and icommands, too.
Note that you can change password to iRODS at [aai.it4i.cz][m].
### Mounting Your Collection
```console
ssh some_user@karolina.it4i.cz
ml irodsfs
```
Now you can choose between the Fuse client or iCommands:
#### Fuse
```console
ssh some_user@karolina.it4i.cz
[some_use@login4.karolina ~]$ ml irodsfs
irodsfs configuration file has been created at /home/dvo0012/.irods/config.yml
iRODS environment file has been created at /home/dvo0012/.irods/irods_environment.json
to start irodsfs, run: irodsfs -config ~/.irods/config.yml ~/IRODS
to start iCommands, run: iinit
For more information, see https://docs.it4i.cz/dice/
```
To mount your iRODS collection to ~/IRODS, run
```console
[some_user@login4.karolina ~]$ irodsfs -config ~/.irods/config.yml ~/IRODS
time="2022-08-04 08:54:13.222836" level=info msg="Logging to /tmp/irodsfs_cblmq5ab1lsaj31vrv20.log" function=processArguments package=main
Password:
time="2022-08-04 08:54:18.698811" level=info msg="Found FUSE Device. Starting iRODS FUSE Lite." function=parentMain package=main
time="2022-08-04 08:54:18.699080" level=info msg="Running the process in the background mode" function=parentRun package=main
time="2022-08-04 08:54:18.699544" level=info msg="Process id = 27145" function=parentRun package=main
time="2022-08-04 08:54:18.699572" level=info msg="Sending configuration data" function=parentRun package=main
time="2022-08-04 08:54:18.699730" level=info msg="Successfully sent configuration data to background process" function=parentRun package=main
time="2022-08-04 08:54:18.922490" level=info msg="Successfully started background process" function=parentRun package=main
```
To unmount it, run
```console
fusermount -u ~/IRODS
```
You can work with Fuse as an ordinary directory (`ls`, `cd`, `cp`, `mv`, etc.).
#### iCommands
```console
ssh some_user@karolina.it4i.cz
[some_use@login4.karolina ~]$ ml irodsfs
irodsfs configuration file has been created at /home/dvo0012/.irods/config.yml.
to start irods fs run: irodsfs -config ~/.irods/config.yml ~/IRODS
iCommands environment file has been created at /home/$USER/.irods/irods_environment.json.
to start iCommands run: iinit
[some_user@login4.karolina ~]$ iinit
Enter your current PAM password:
```
```console
[some_use@login4.karolina ~]$ ils
/IT4I/home/some_user:
test.1
test.2
test.3
test.4
```
Use the command `iput` for upload, `iget` for download, or `ihelp` for help.
## Access to iRODS Collection From Other Resource
!!! note
This guide assumes you are uploading your data from your local PC/VM.
Use the password from [AAI][f].
### You Need a Client to Connect to iRODS Server
There are many iRODS clients, but we recommend the following:
- Cyberduck - Windows/Mac, GUI
- Fuse (irodsfs lite) - Linux, CLI
- iCommands - Linux, CLI.
For access, set PAM passwords at [AAI][f].
### Cyberduck
1. Download [Cyberduck][i].
2. Download [connection profile][1] for IT4I iRods server.
3. Left double-click this file to open connection.
![](img/irods-cyberduck.jpg)
### Fuse
!!!note "Linux client only"
This is a Linux client only, basic knowledge of the command line is necessary.
Fuse allows you to work with your iRODS collection like an ordinary directory.
```console
cd ~
wget https://github.com/cyverse/irodsfs/releases/download/v0.7.6/irodsfs_amd64_linux_v0.7.6.tar
tar -xvf ~/irodsfs_amd64_linux_v0.7.6.tar
mkdir ~/IRODS ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/config.yml
wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods/
```
Edit `~/.irods/config.yml` with username from [AAI][f].
#### Mounting Your Collection
```console
[some_user@local_pc ~]$ ./irodsfs -config ~/.irods/config.yml ~/IRODS
time="2022-07-29 09:51:11.720831" level=info msg="Logging to /tmp/irodsfs_cbhp2rucso0ef0s7dtl0.log" function=processArguments package=main
Password:
time="2022-07-29 09:51:17.691988" level=info msg="Found FUSE Device. Starting iRODS FUSE Lite." function=parentMain package=main
time="2022-07-29 09:51:17.692683" level=info msg="Running the process in the background mode" function=parentRun package=main
time="2022-07-29 09:51:17.693381" level=info msg="Process id = 74772" function=parentRun package=main
time="2022-07-29 09:51:17.693421" level=info msg="Sending configuration data" function=parentRun package=main
time="2022-07-29 09:51:17.693772" level=info msg="Successfully sent configuration data to background process" function=parentRun package=main
time="2022-07-29 09:51:18.008166" level=info msg="Successfully started background process" function=parentRun package=main
```
#### Putting Your Data to iRODS
```console
[some_use@local_pc ~]$ cp test1G.txt ~/IRODS
```
It works as ordinary file system
```console
[some_user@local_pc ~]$ ls -la ~/IRODS
total 0
-rwx------ 1 some_user some_user 1073741824 Nov 4 2021 test1G.txt
```
#### Unmounting Your Collection
To stop/unmount your collection, use:
```console
[some_user@local_pc ~]$ fusermount -u ~/IRODS
```
### iCommands
!!!note "Linux client only"
This is a Linux client only, basic knowledge of the command line is necessary.
We recommend Centos7, Ubuntu 20 is optional.
#### Steps for Ubuntu 20
```console
LSB_RELEASE="bionic"
wget -qO - https://packages.irods.org/irods-signing-key.asc | sudo apt-key add -
echo "deb [arch=amd64] https://packages.irods.org/apt/ ${LSB_RELEASE} main" \
> | sudo tee /etc/apt/sources.list.d/renci-irods.list
deb [arch=amd64] https://packages.irods.org/apt/ bionic main
sudo apt-get update
apt-cache search irods
wget -c \
http://security.ubuntu.com/ubuntu/pool/main/p/python-urllib3/python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
http://security.ubuntu.com/ubuntu/pool/main/r/requests/python-requests_2.18.4-2ubuntu0.1_all.deb \
http://security.ubuntu.com/ubuntu/pool/main/o/openssl1.0/libssl1.0.0_1.0.2n-1ubuntu5.10_amd64.deb
sudo apt install \
./python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
./python-requests_2.18.4-2ubuntu0.1_all.deb \
./libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb
sudo rm -rf \
./python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
./python-requests_2.18.4-2ubuntu0.1_all.deb \
./libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb
sudo apt install -y irods-icommands
mkdir ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/irods_environment.json
wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods
sed -i 's,~,'"$HOME"',g' ~/.irods/irods_environment.json
```
#### Steps for Centos
```console
sudo rpm --import https://packages.irods.org/irods-signing-key.asc
sudo wget -qO - https://packages.irods.org/renci-irods.yum.repo | sudo tee /etc/yum.repos.d/renci-irods.yum.repo
sudo yum install epel-release -y
sudo yum install python-psutil python-jsonschema
sudo yum install irods-icommands
mkdir ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/irods_environment.json
wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods
sed -i 's,~,'"$HOME"',g' ~/.irods/irods_environment.json
```
Edit ***irods_user_name*** in `~/.irods/irods_environment.json` with the username from [AAI][f].
```console
[some_user@local_pc ~]$ pwd
/some_user/.irods
[some_user@local_pc ~]$ ls -la
total 16
drwx------. 2 some_user some_user 136 Sep 29 08:53 .
dr-xr-x---. 6 some_user some_user 206 Sep 29 08:53 ..
-rw-r--r--. 1 some_user some_user 253 Sep 29 08:14 irods_environment.json
```
**How to Start:**
**step 1:**
```console
[some_user@local_pc ~]$ iinit
Enter your current PAM password:
[some_user@local_pc ~]$ ils
/IT4I/home/some_user:
file.jpg
```
**How to put your data to iRODS**
```console
[some_user@local_pc ~]$ iput cesnet.crt
```
```console
[some_user@local_pc ~]$ ils
/IT4I/home/some_user:
cesnet.crt
```
**How to download data**
```console
[some_user@local_pc ~]$ iget cesnet.crt
ls -la ~
-rw-r--r--. 1 some_user some_user 1464 Jul 20 13:44 cesnet.crt
```
For more commands, use the `ihelp` command.
## PID Services
You, as user, may want to index your datasets and allocate some PIDs - Persistent Identifiers for them. We host pid system by hdl-surfsara ([https://it4i-handle.it4i.cz][o]), wich is conected to [https://hdl.handle.net][p], and you are able to create your own PID by calling some of irule.
### How to Create PID
Pids are created by calling `irule`, you have to create at your `$HOME` or everewhere you want,
but you have to specify the path correctly.
Rules for pid operations have always `.r suffix`.
It can by done only through `iCommands`.
Example of a rule for PID creating only:
```console
user in ~ λ pwd
/home/user
user in ~ λ ils
/IT4I/home/user:
C- /IT4I/home/dvo0012/Collection_A
user in ~ λ ls -l | grep pid
-rw-r--r-- 1 user user 249 Sep 30 10:55 create_pid.r
user in ~ λ cat create_pid.r
PID_DO_reg {
EUDATCreatePID(*parent_pid, *source, *ror, *fio, *fixed, *newPID);
writeLine("stdout","PID: *newPID");
}
INPUT *source="/IT4I/home/user/Collection_A",*parent_pid="None",*ror="None",*fio="None",*fixed="true"
OUTPUT ruleExecOut
user in ~ λ irule -F create_pid.r
PID: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
```
After creation, your PID is searchable worldwide:
![](img/hdl_net.jpg)
![](img/hdl_pid.jpg)
**More info at [www.eudat.eu][n]**
### Metadata
For adding metadata to you collection/dataset, you can use imeta from iCommands.
This is after PID creation:
```console
user in ~ λ imeta ls -C /IT4I/home/user/Collection_A
AVUs defined for collection /IT4I/home/user/Collection_A:
attribute: EUDAT/FIXED_CONTENT
value: True
units:
----
attribute: PID
value: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
units:
```
For adding any other metadata you can use:
```console
user in ~ λ imeta add -C /IT4I/home/user/Collection_A EUDAT_B2SHARE_TITLE Some_Title
user in ~ λ imeta ls -C /IT4I/home/user/Collection_A
AVUs defined for collection /IT4I/home/user/Collection_A:
attribute: EUDAT/FIXED_CONTENT
value: True
units:
----
attribute: PID
value: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
units:
----
attribute: EUDAT_B2SHARE_TITLE
value: Some_Title
units:
```
[1]: irods.cyberduckprofile
[2]: irods_environment.json
[3]: config.yml
[4]: general/access/account-introduction.md
[a]: mailto:support@it4i.cz
[b]: https://www.dice-eosc.eu/
[c]: https://irods.org/
[d]: https://b2access.eudat.eu/
[f]: https://aai.it4i.cz/realms/IT4i_IRODS/account/#/
[h]: https://marketplace.eosc-portal.eu/services/b2safe/offers
[i]: https://cyberduck.io/download/
[l]: https://www.eudat.eu/contact-support-request?Service=B2SAFE
[m]: https://aai.it4i.cz/
[n]: https://www.eudat.eu/catalogue/b2handle
[o]: https://it4i-handle.it4i.cz
[p]: https://hdl.handle.net
[q]: https://eosc-portal.eu/
# Migration to e-INFRA CZ
## Introduction
IT4Innovations is a part of [e-INFRA CZ][1] - strategic research infrastructure of the Czech Republic, which provides capacities and resources for the transmission, storage, and processing of scientific and research data. In January 2022, IT4I has begun the process of integration of its services.
As a part of the process, a joint e-INFRA CZ user base has been established. This included a migration of eligible IT4I accounts.
## Who Has Been Affected
The migration affects all accounts of users affiliated with an academic organizations in the Czech Republic who also have an OPEN-XX-XX project. Affected users have received an email with information about changes in personal data processing.
## Who Has Not Been Affected
Commercial users, training accounts, suppliers, and service accounts were **not** affected by the migration.
## Process
During the process, additional steps have been required for successful migration.
This may have included:
1. e-INFRA CZ registration, if one does not already exist.
2. e-INFRA CZ password reset, if one does not already exist.
## Steps After Migration
After the migration, you must use your **e-INFRA CZ credentials** to access all IT4I services as well as [e-INFRA CZ services][5].
Successfully migrated accounts tied to e-INFRA CZ can be self-managed at [e-INFRA CZ User profile][4].
!!! tip "Recommendation"
We recommend [verifying your SSH keys][6] for cluster access.
## Troubleshooting
If you have a problem with your account migrated to e-INFRA CZ user base, contact the [CESNET support][7].
If you have questions or a problem with IT4I account (i.e. account not eligible for migration), contact the [IT4I support][2].
[1]: https://www.e-infra.cz/en
[2]: mailto:support@it4i.cz
[3]: https://www.cesnet.cz/?lang=en
[4]: https://profile.e-infra.cz/
[5]: https://www.e-infra.cz/en/services
[6]: https://profile.e-infra.cz/profile/settings/sshKeys
[7]: mailto:support@cesnet.cz
# Environment and Modules
## Shells on Clusters
The table shows which shells are available on the IT4Innovations clusters.
Note that bash is the only supported shell.
| Cluster Name | bash | tcsh | zsh | ksh | dash |
| --------------- | ---- | ---- | --- | --- | ---- |
| Karolina | yes | yes | yes | yes | yes |
| Barbora | yes | yes | yes | yes | no |
| DGX-2 | yes | no | no | no | no |
!!! info
Bash is the default shell. Should you need a different shell, contact [support\[at\]it4i.cz][3].
## Environment Customization
After logging in, you may want to configure the environment. Write your preferred path definitions, aliases, functions, and module loads in the .bashrc file
```console
# ./bashrc
# users compilation path
export MODULEPATH=${MODULEPATH}:/home/$USER/.local/easybuild/modules/all
# User specific aliases and functions
alias sq='squeue --me'
# load default intel compilator !!! is not recommended !!!
ml intel
# Display information to standard output - only in interactive ssh session
if [ -n "$SSH_TTY" ]
then
ml # Display loaded modules
fi
```
!!! note
Do not run commands outputting to standard output (echo, module list, etc.) in .bashrc for non-interactive SSH sessions. It breaks the fundamental functionality (SCP) of your account. Take care for SSH session interactivity for such commands as stated in the previous example.
### Application Modules
In order to configure your shell for running a particular application on clusters, we use a module package interface.
Application modules on clusters are built using [EasyBuild][1]. The modules are divided into the following groups:
```
base: Default module class
bio: Bioinformatics, biology and biomedical
cae: Computer Aided Engineering (incl. CFD)
chem: Chemistry, Computational Chemistry and Quantum Chemistry
compiler: Compilers
data: Data management & processing tools
debugger: Debuggers
devel: Development tools
geo: Earth Sciences
ide: Integrated Development Environments (e.g. editors)
lang: Languages and programming aids
lib: General purpose libraries
math: High-level mathematical software
mpi: MPI stacks
numlib: Numerical Libraries
perf: Performance tools
phys: Physics and physical systems simulations
system: System utilities (e.g. highly depending on system OS and hardware)
toolchain: EasyBuild toolchains
tools: General purpose tools
vis: Visualization, plotting, documentation and typesetting
OS: singularity image
python: python packages
```
!!! note
The modules set up the application paths, library paths and environment variables for running a particular application.
The modules may be loaded, unloaded, and switched according to momentary needs. For details, see [lmod][2].
[1]: software/tools/easybuild.md
[2]: software/modules/lmod.md
[3]: mailto:support@it4i.cz
File added
File added
# Introduction
This section provides basic information on how to gain access to IT4Innovations Information systems and project membership.
## Account Types
There are two types of accounts at IT4Innovations:
* [**e-INFRA CZ Account**][1]
intended for all persons affiliated with an academic institution from the Czech Republic ([eduID.cz][a]).
* [**IT4I Account**][2]
intended for all persons who are not eligible for an e-INFRA CZ account.
Once you create an account, you can use it only for communication with IT4I support and accessing the SCS information system.
If you want to access IT4I clusters, your account must also be **assigned to a project**.
For more information, see the section:
* [**Get Project Membership**][3]
if you want to become a collaborator on a project, or
* [**Get Project**][4]
if you want to become a project owner.
[1]: ./einfracz-account.md
[2]: ../obtaining-login-credentials/obtaining-login-credentials.md
[3]: ../access/project-access.md
[4]: ../applying-for-resources.md
[a]: https://www.eduid.cz/
# e-INFRA CZ Account
[e-INFRA CZ][1] is a unique research and development e-infrastructure in the Czech Republic,
which provides capacities and resources for the transmission, storage and processing of scientific and research data.
IT4Innovations has become a member of e-INFRA CZ on January 2022.
!!! important
Only persons affiliated with an academic institution from the Czech Republic ([eduID.cz][6]) are eligible for an e-INFRA CZ account.
## Request e-INFRA CZ Account
1. Request an account:
1. Go to [https://signup.e-infra.cz/fed/registrar/?vo=IT4Innovations][2]
1. Select a member academic institution you are affiliated with.
1. Fill out the e-INFRA CZ Account information (username, password and ssh key(s)).
Your account should be created in a few minutes after submitting the request.
Once your e-INFRA CZ account is created, it is propagated into IT4I systems
and can be used to access [SCS portal][3] and [Request Tracker][4].
1. Provide additional information via [IT4I support][a] or email [support\[at\]it4i.cz][b] (**required**, note that without this information, you cannot use IT4I resources):
1. **Full name**
1. **Gender**
1. **Citizenship**
1. **Country of residence**
1. **Organization/affiliation**
1. **Organization/affiliation country**
1. **Organization/affiliation type** (university, company, R&D institution, private/public sector (hospital, police), academy of sciences, etc.)
1. **Job title** (student, PhD student, researcher, research assistant, employee, etc.)
Continue to apply for a project or project membership to access clusters through the [SCS portal][3].
## Logging Into IT4I Services
The table below shows how different IT4I services are accessed:
| Services | Access |
| -------- | ------- |
| Clusters | SSH key |
| IS, RT, web, VPN | e-INFRA CZ login |
| Profile<br>Change&nbsp;password<br>Change&nbsp;SSH&nbsp;key | Academic institution's credentials<br>e-INFRA CZ / eduID |
You can change you profile settings at any time.
[1]: https://www.e-infra.cz/en
[2]: https://signup.e-infra.cz/fed/registrar/?vo=IT4Innovations
[3]: https://scs.it4i.cz/
[4]: https://support.it4i.cz/
[5]: ../../management/einfracz-profile.md
[6]: https://www.eduid.cz/
[a]: https://support.it4i.cz/rt/
[b]: mailto:support@it4i.cz
# Get Project Membership
!!! note
You need to be named as a collaborator by a Primary Investigator (PI) in order to access and use the clusters.
## Authorization by Web
This is a preferred method if you have an IT4I or e-INFRA CZ account.
Log in to the [IT4I SCS portal][a] and go to the **Authorization Requests** section. Here you can submit your requests for becoming a project member. You will have to wait until the project PI authorizes your request.
## Authorization by Email
An alternative way to become a project member is on request sent via [email by the project PI][1].
[1]: ../../applying-for-resources/#authorization-by-email-an-alternative-approach
[a]: https://scs.it4i.cz/