Compare revisions

4b81d92e · 4b81d92e · 4b81d92e · 4b81d92e · 4b81d92e · 4b81d92e
--- a/docs.it4i/cs/guides/hm_management.md
+++ b/docs.it4i/cs/guides/hm_management.md
+# Heterogeneous Memory Management on Intel Platforms
+
+Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
+
+## Overview
+
+At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
+
+```bash
+ml memkind
+```
+
+### Process Level (NUMACTL)
+
+The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
+
+```bash
+numactl --membind <node_ids_set>
+```
+
+or select single preffered node
+
+```bash
+numactl --preffered <node_id>
+```
+
+where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
+
+Convenient way to check `numactl` configuration is
+
+```bash
+numactl -s
+```
+
+which prints configuration in its execution environment eg.
+
+```bash
+numactl --membind 8-15 numactl -s
+policy: bind
+preferred node: 0
+physcpubind: 0 1 2 ... 189 190 191
+cpubind: 0 1 2 3 4 5 6 7
+nodebind: 0 1 2 3 4 5 6 7
+membind: 8 9 10 11 12 13 14 15
+```
+
+The last row shows allocations memory are restricted to NUMA nodes `8-15`.
+
+### Allocation Level (MEMKIND)
+
+The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
+
+```cpp
+void *pData = malloc(<SIZE>);
+/* ... */
+free(pData);
+```
+
+with
+
+```cpp
+#include <memkind.h>
+
+void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
+/* ... */
+memkind_free(NULL, pData); // "kind" parameter is deduced from the address
+```
+
+Similarly other memory types can be chosen.
+
+!!! note
+    The allocation will return `NULL` pointer when memory of specified kind is not available.
+
+## High Bandwidth Memory (HBM)
+
+Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
+
+```bash
+numactl -H
+```
+
+which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
+
+![](../../img/cs/guides/p10_numa_sc4_flat.png)
+
+### Process Level
+
+With this we can easily restrict application to DDR DRAM or HBM memory:
+
+```bash
+# Only DDR DRAM
+numactl --membind 0-7 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:          369745.8     0.043355     0.043273     0.043588
+Scale:         366989.8     0.043869     0.043598     0.045355
+Add:           378054.0     0.063652     0.063483     0.063899
+Triad:         377852.5     0.063621     0.063517     0.063884
+
+# Only HBM
+numactl --membind 8-15 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:         1128430.1     0.015214     0.014179     0.015615
+Scale:        1045065.2     0.015814     0.015310     0.016309
+Add:          1096992.2     0.022619     0.021878     0.024182
+Triad:        1065152.4     0.023449     0.022532     0.024559
+```
+
+The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
+
+Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
+
+```bash
+#!/bin/bash
+numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
+```
+
+and can be used as
+
+```bash
+mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
+```
+
+(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
+
+```bash
+mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
+```
+
+is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
+![](../../img/cs/guides/p10_stream_dram.png)
+![](../../img/cs/guides/p10_stream_hbm.png)
+
+### Allocation Level
+
+Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
+
+```cpp
+#include <memkind.h>
+// ...
+STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+// ...
+memkind_free(NULL, a);
+memkind_free(NULL, b);
+memkind_free(NULL, c);
+```
+
+Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
+The code then has to be linked with `memkind` library
+
+```bash
+gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
+```
+
+and can be run as
+
+```bash
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
+```
+
+While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
+
+![](../../img/cs/guides/p10_stream_memkind.png)
+
+With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
+
+## Simple Application
+
+One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
+
+```cpp
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include <cmath>
+#include <cstring>
+#include <omp.h>
+#include <memkind.h>
+
+const size_t N_DATA_SIZE  = 2 * 1024 * 1024 * 1024ull;
+const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
+const size_t N_ITERS      = 10;
+
+#if defined(HBM)
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_HBW_ALL
+#else
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_REGULAR
+#endif
+
+int main(int argc, char *argv[])
+{
+    const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
+
+    double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
+    size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
+
+    #pragma omp parallel
+    {
+        drand48_data state;
+        srand48_r(omp_get_thread_num(), &state);
+
+        #pragma omp for
+        for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            drand48_r(&state, &pData[i]);
+    }
+
+    auto c1 = std::chrono::steady_clock::now();
+
+    for(size_t it = 0; it < N_ITERS; ++it)
+    {
+        #pragma omp parallel
+        {
+            for(size_t i = 0; i < N_BINS_COUNT; ++i)
+                pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
+
+            #pragma omp for
+            for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            {
+                const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
+                pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
+            }
+        }
+    }
+
+    auto c2 = std::chrono::steady_clock::now();
+
+    #pragma omp parallel for
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+    {
+        for(size_t j = 1; j < omp_get_max_threads(); ++j)
+            pBins[i] += pBins[j*N_BINS_COUNT + i];
+    }
+
+    std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+
+    size_t total = 0;
+    #pragma omp parallel for reduction(+:total)
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+        total += pBins[i];
+
+    std::cout << "Total Items: " << total << std::endl;
+
+    memkind_free(NULL, pData);
+    memkind_free(NULL, pBins);
+
+    return 0;
+}
+```
+
+### Using HBM Memory (P10-Intel)
+
+Following commands can be used to compile and run example application above
+
+```bash
+ml GCC memkind
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
+g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
+```
+
+Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
+
+## Additional Resources
+
+- [https://linux.die.net/man/8/numactl][1]
+- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
+- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
+
+[1]: https://linux.die.net/man/8/numactl
+[2]: http://memkind.github.io/memkind/man_pages/memkind.html
+[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
\ No newline at end of file
--- a/docs.it4i/cs/guides/horizon.md
+++ b/docs.it4i/cs/guides/horizon.md
+# Using VMware Horizon
+
+VMware Horizon is a virtual desktop infrastructure (VDI) solution
+that enables users to access virtual desktops and applications from any device and any location.
+It provides a comprehensive end-to-end solution for managing and delivering virtual desktops and applications,
+including features such as session management, user authentication, and virtual desktop provisioning.
+
+![](../../img/horizon.png)
+
+## How to Access VMware Horizon
+
+!!! important
+    Access to VMware Horizon requires IT4I VPN.
+
+1. Contact [IT4I support][a] with a request for an access and VM allocation.
+1. [Download][1] and install the VMware Horizon Client for Windows.
+1. Add a new server `https://vdi-cs01.msad.it4i.cz/` in the Horizon client.
+1. Connect to the server using your IT4I username and password.
+   Username is in the `domain\username` format and the domain is `msad.it4i.cz`.
+   For example: `msad.it4i.cz\user123`
+
+## Example
+
+Below is an example of how to mount a remote folder and check the conection on Windows OS:
+
+### Prerequsities
+
+3D applications
+
+* [Blender][3]
+
+SSHFS for remote access
+
+* [sshfs-win][4]
+* [winfsp][5]
+* [shfs-win-manager][6]
+* ssh keys for access to clusters
+
+### Steps
+
+1. Start the VPN and connect to the server via VMware Horizon Client.
+
+    ![](../../img/vmware.png)
+
+1. Mount a remote folder.
+    * Run sshfs-win-manager.
+
+    ![](../../img/sshfs.png)
+
+    * Add a new connection.
+
+    ![](../../img/sshfs1.png)
+
+    * Click on **Connect**.
+
+    ![](../../img/sshfs2.png)
+
+1. Check that the folder is mounted.
+
+    ![](../../img/mount.png)
+
+1. Check the GPU resources.
+
+    ![](../../img/gpu.png)
+
+### Blender
+
+Now if you run, for example, Blender, you can check the available GPU resources in Blender Preferences.
+
+  ![](../../img/blender.png)
+
+[a]: mailto:support@it4i.cz
+
+[1]: https://vdi-cs01.msad.it4i.cz/
+[2]: https://www.paraview.org/download/
+[3]: https://www.blender.org/download/
+[4]: https://github.com/winfsp/sshfs-win/releases
+[5]: https://github.com/winfsp/winfsp/releases/
+[6]: https://github.com/evsar3/sshfs-win-manager/releases
--- a/docs.it4i/cs/guides/power10.md
+++ b/docs.it4i/cs/guides/power10.md
+# Using IBM Power Partition
+
+For testing your application on the IBM Power partition,
+you need to prepare a job script for that partition or use the interactive job:
+
+```console
+scalloc -N 1 -c 192 -A PROJECT-ID -p p07-power --time=08:00:00
+```
+
+where:
+
+- `-N 1` means allocation single node,
+- `-c 192` means allocation 192 cores (threads),
+- `-p p07-power` is IBM Power partition,
+- `--time=08:00:00` means allocation for 8 hours.
+
+On the partition, you should reload the list of modules:
+
+```
+ml architecture/ppc64le
+```
+
+The platform offers both `GNU` based and proprietary IBM toolchains for building applications. IBM also provides optimized BLAS routines library ([ESSL](https://www.ibm.com/docs/en/essl/6.1)), which can be used by both toolchain.
+
+## Building Applications
+
+Our sample application depends on `BLAS`, therefore we start by loading following modules (regardless of which toolchain we want to use):
+
+```
+ml GCC OpenBLAS
+```
+
+### GCC Toolchain
+
+In the case of GCC toolchain we can go ahead and compile the application as usual using either `g++`
+
+```
+g++ -lopenblas hello.cpp -o hello
+```
+
+or `gfortran`
+
+```
+gfortran -lopenblas hello.f90 -o hello
+```
+
+as usual.
+
+### IBM Toolchain
+
+The IBM toolchain requires additional environment setup as it is installed in `/opt/ibm` and is not exposed as a module
+
+```
+IBM_ROOT=/opt/ibm
+OPENXLC_ROOT=$IBM_ROOT/openxlC/17.1.1
+OPENXLF_ROOT=$IBM_ROOT/openxlf/17.1.1
+
+export PATH=$OPENXLC_ROOT/bin:$PATH
+export LD_LIBRARY_PATH=$OPENXLC_ROOT/lib:$LD_LIBRARY_PATH
+
+export PATH=$OPENXLF_ROOT/bin:$PATH
+export LD_LIBRARY_PATH=$OPENXLF_ROOT/lib:$LD_LIBRARY_PATH
+```
+
+from there we can use either `ibm-clang++`
+
+```
+ibm-clang++ -lopenblas hello.cpp -o hello
+```
+
+or `xlf`
+
+```
+xlf -lopenblas hello.f90 -o hello
+```
+
+to build the application as usual.
+
+!!! note
+    Combination of `xlf` and `openblas` seems to cause severe performance degradation. Therefore `ESSL` library should be preferred (see below).
+
+### Using ESSL Library
+
+The [ESSL](https://www.ibm.com/docs/en/essl/6.1) library is installed in `/opt/ibm/math/essl/7.1` so we define additional environment variables
+
+```
+IBM_ROOT=/opt/ibm
+ESSL_ROOT=${IBM_ROOT}math/essl/7.1
+export LD_LIBRARY_PATH=$ESSL_ROOT/lib64:$LD_LIBRARY_PATH
+```
+
+The simplest way to utilize `ESSL` in application, which already uses `BLAS` or `CBLAS` routines is to link with the provided `libessl.so`. This can be done by replacing `-lopenblas` with `-lessl` or `-lessl -lopenblas` (in case `ESSL` does not provide all required `BLAS` routines).
+In practice this can look like
+
+```
+g++ -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.cpp -o hello
+```
+
+or
+
+```
+gfortran -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.f90 -o hello
+```
+
+and similarly for IBM compilers (`ibm-clang++` and `xlf`).
+
+## Hello World Applications
+
+The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
+
+Stationary probability vector estimation in `C++`:
+
+```c++
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include "cblas.h"
+
+const size_t ITERATIONS  = 32;
+const size_t MATRIX_SIZE = 1024;
+
+int main(int argc, char *argv[])
+{
+    const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
+
+    std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+        a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
+    a[0] = 0.5f;
+
+    std::vector<float> w1(matrixElements, 0.0f);
+    std::vector<float> w2(matrixElements, 0.0f);
+
+    std::copy(a.begin(), a.end(), w1.begin());
+
+    std::vector<float> *t1, *t2;
+    t1 = &w1;
+    t2 = &w2;
+
+    auto c1 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < ITERATIONS; ++i)
+    {
+        std::fill(t2->begin(), t2->end(), 0.0f);
+
+        cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
+                    1.0f, t1->data(), MATRIX_SIZE,
+                    a.data(), MATRIX_SIZE,
+                    1.0f, t2->data(), MATRIX_SIZE);
+
+        std::swap(t1, t2);
+    }
+
+    auto c2 = std::chrono::steady_clock::now();
+
+    for(size_t i = 0; i < MATRIX_SIZE; ++i)
+    {
+        std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
+    }
+
+    std::cout << std::endl;
+
+    std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+
+    return 0;
+}
+```
+
+Stationary probability vector estimation in `Fortran`:
+
+```fortran
+program main
+    implicit none
+
+    integer :: matrix_size, iterations
+    integer :: i
+    real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
+    real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
+    real, pointer :: out_data(:), out_diag(:)
+    integer :: cr, cm, c1, c2
+
+    iterations  = 32
+    matrix_size = 1024
+
+    call system_clock(count_rate=cr)
+    call system_clock(count_max=cm)
+
+    allocate(a(matrix_size, matrix_size))
+    allocate(w1(matrix_size, matrix_size))
+    allocate(w2(matrix_size, matrix_size))
+
+    a(:,:) = 1.0 / real(matrix_size)
+    a(:,1) = 0.5 / real(matrix_size - 1)
+    a(1,1) = 0.5
+
+    w1 = a
+    w2(:,:) = 0.0
+
+    t1 => w1
+    t2 => w2
+
+    call system_clock(c1)
+
+    do i = 0, iterations
+        t2(:,:) = 0.0
+
+        call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
+
+        tmp => t1
+        t1  => t2
+        t2  => tmp
+    end do
+
+    call system_clock(c2)
+
+    out_data(1:size(t1)) => t1
+    out_diag => out_data(1::matrix_size+1)
+
+    print *, out_diag
+    print *, "Elapsed Time: ", (c2 - c1) / real(cr)
+
+    deallocate(a)
+    deallocate(w1)
+    deallocate(w2)
+end program main
+```
--- a/docs.it4i/cs/guides/xilinx.md
+++ b/docs.it4i/cs/guides/xilinx.md
+# Using Xilinx Accelerator Platform
+
+The first step to use Xilinx accelerators is to initialize Vitis (compiler) and XRT (runtime) environments.
+
+```console
+$ . /tools/Xilinx/Vitis/2023.1/settings64.sh
+$ . /opt/xilinx/xrt/setup.sh
+```
+
+## Platform Level Accelerator Management
+
+This should allow to examine current platform using `xbutil examine`,
+which should output user-level information about XRT platform and list available devices
+
+```
+$ xbutil examine
+System Configuration
+  OS Name              : Linux
+  Release              : 4.18.0-477.27.1.el8_8.x86_64
+  Version              : #1 SMP Thu Aug 31 10:29:22 EDT 2023
+  Machine              : x86_64
+  CPU Cores            : 64
+  Memory               : 257145 MB
+  Distribution         : Red Hat Enterprise Linux 8.8 (Ootpa)
+  GLIBC                : 2.28
+  Model                : ProLiant XL675d Gen10 Plus
+
+XRT
+  Version              : 2.16.0
+  Branch               : master
+  Hash                 : f2524a2fcbbabd969db19abf4d835c24379e390d
+  Hash Date            : 2023-10-11 14:01:19
+  XOCL                 : 2.16.0, f2524a2fcbbabd969db19abf4d835c24379e390d
+  XCLMGMT              : 2.16.0, f2524a2fcbbabd969db19abf4d835c24379e390d
+
+Devices present
+BDF             :  Shell                            Logic UUID                            Device ID       Device Ready*
+-------------------------------------------------------------------------------------------------------------------------
+[0000:88:00.1]  :  xilinx_u280_gen3x16_xdma_base_1  283BAB8F-654D-8674-968F-4DA57F7FA5D7  user(inst=132)  Yes
+[0000:8c:00.1]  :  xilinx_u280_gen3x16_xdma_base_1  283BAB8F-654D-8674-968F-4DA57F7FA5D7  user(inst=133)  Yes
+
+
+* Devices that are not ready will have reduced functionality when using XRT tools
+```
+
+Here two Xilinx Alveo u280 accelerators (`0000:88:00.1` and `0000:8c:00.1`) are available.
+The `xbutil` can be also used to query additional information about specific device using its BDF address
+
+```console
+$ xbutil examine -d "0000:88:00.1"
+
+-------------------------------------------------
+[0000:88:00.1] : xilinx_u280_gen3x16_xdma_base_1
+-------------------------------------------------
+Platform
+  XSA Name               : xilinx_u280_gen3x16_xdma_base_1
+  Logic UUID             : 283BAB8F-654D-8674-968F-4DA57F7FA5D7
+  FPGA Name              :
+  JTAG ID Code           : 0x14b7d093
+  DDR Size               : 0 Bytes
+  DDR Count              : 0
+  Mig Calibrated         : true
+  P2P Status             : disabled
+  Performance Mode       : not supported
+  P2P IO space required  : 64 GB
+
+Clocks
+  DATA_CLK (Data)        : 300 MHz
+  KERNEL_CLK (Kernel)    : 500 MHz
+  hbm_aclk (System)      : 450 MHz
+
+Mac Addresses            : 00:0A:35:0E:20:B0
+                         : 00:0A:35:0E:20:B1
+
+  Device Status: HEALTHY
+  Hardware Context ID: 0
+    Xclbin UUID: 6306D6AE-1D66-AEA7-B15D-446D4ECC53BD
+    PL Compute Units
+      Index  Name         Base Address  Usage  Status
+      -------------------------------------------------
+      0      vadd:vadd_1  0x800000      1      (IDLE)
+```
+
+Basic functionality of the device can be checked using `xbutil validate -d <BDF>` as
+
+```console
+$ xbutil validate -d "0000:88:00.1"
+Validate Device           : [0000:88:00.1]
+    Platform              : xilinx_u280_gen3x16_xdma_base_1
+    SC Version            : 4.3.27
+    Platform ID           : 283BAB8F-654D-8674-968F-4DA57F7FA5D7
+-------------------------------------------------------------------------------
+Test 1 [0000:88:00.1]     : aux-connection
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 2 [0000:88:00.1]     : pcie-link
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 3 [0000:88:00.1]     : sc-version
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 4 [0000:88:00.1]     : verify
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 5 [0000:88:00.1]     : dma
+    Details               : Buffer size - '16 MB' Memory Tag - 'HBM[0]'
+                            Host -> PCIe -> FPGA write bandwidth = 11988.9 MB/s
+                            Host <- PCIe <- FPGA read bandwidth = 12571.2 MB/s
+                            ...
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 6 [0000:88:00.1]     : iops
+    Details               : IOPS: 387240(verify)
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 7 [0000:88:00.1]     : mem-bw
+    Details               : Throughput (Type: DDR) (Bank count: 2) : 33932.9MB/s
+                            Throughput of Memory Tag: DDR[0] is 16974.1MB/s
+                            Throughput of Memory Tag: DDR[1] is 16974.2MB/s
+                            Throughput (Type: HBM) (Bank count: 1) : 12383.7MB/s
+    Test Status           : [PASSED]
+-------------------------------------------------------------------------------
+Test 8 [0000:88:00.1]     : p2p
+Test 9 [0000:88:00.1]     : vcu
+Test 10 [0000:88:00.1]    : aie
+Test 11 [0000:88:00.1]    : ps-aie
+Test 12 [0000:88:00.1]    : ps-pl-verify
+Test 13 [0000:88:00.1]    : ps-verify
+Test 14 [0000:88:00.1]    : ps-iops
+```
+
+Finally, the device can be reinitialized using `xbutil reset -d <BDF>` as
+
+```console
+$ xbutil reset -d "0000:88:00.1"
+Performing 'HOT Reset' on '0000:88:00.1'
+Are you sure you wish to proceed? [Y/n]: Y
+Successfully reset Device[0000:88:00.1]
+```
+
+This can be useful to recover the device from states such as `HANGING`, reported by `xbutil examine -d <BDF>`.
+
+## OpenCL Platform Level
+
+The `clinfo` utility can be used to verify that the accelerator is visible to OpenCL
+
+```console
+$ clinfo
+Number of platforms:                             2
+  Platform Profile:                              FULL_PROFILE
+  Platform Version:                              OpenCL 2.1 AMD-APP (3590.0)
+  Platform Name:                                 AMD Accelerated Parallel Processing
+  Platform Vendor:                               Advanced Micro Devices, Inc.
+  Platform Extensions:                           cl_khr_icd cl_amd_event_callback
+  Platform Profile:                              EMBEDDED_PROFILE
+  Platform Version:                              OpenCL 1.0
+  Platform Name:                                 Xilinx
+  Platform Vendor:                               Xilinx
+  Platform Extensions:                           cl_khr_icd
+<...>
+  Platform Name:                                 Xilinx
+Number of devices:                               2
+  Device Type:                                   CL_DEVICE_TYPE_ACCRLERATOR
+  Vendor ID:                                     0h
+  Max compute units:                             0
+  Max work items dimensions:                     3
+    Max work items[0]:                           4294967295
+    Max work items[1]:                           4294967295
+    Max work items[2]:                           4294967295
+  Max work group size:                           4294967295
+  Preferred vector width char:                   1
+  Preferred vector width short:                  1
+  Preferred vector width int:                    1
+  Preferred vector width long:                   1
+  Preferred vector width float:                  1
+  Preferred vector width double:                 0
+  Max clock frequency:                           0Mhz
+  Address bits:                                  64
+  Max memory allocation:                         4294967296
+  Image support:                                 Yes
+  Max number of images read arguments:           128
+  Max number of images write arguments:          8
+  Max image 2D width:                            8192
+  Max image 2D height:                           8192
+  Max image 3D width:                            2048
+  Max image 3D height:                           2048
+  Max image 3D depth:                            2048
+  Max samplers within kernel:                    0
+  Max size of kernel argument:                   2048
+  Alignment (bits) of base address:              32768
+  Minimum alignment (bytes) for any datatype:    128
+  Single precision floating point capability
+    Denorms:                                     No
+    Quiet NaNs:                                  Yes
+    Round to nearest even:                       Yes
+    Round to zero:                               No
+    Round to +ve and infinity:                   No
+    IEEE754-2008 fused multiply-add:             No
+  Cache type:                                    None
+  Cache line size:                               64
+  Cache size:                                    0
+  Global memory size:                            0
+  Constant buffer size:                          4194304
+  Max number of constant args:                   8
+  Local memory type:                             Scratchpad
+  Local memory size:                             16384
+  Error correction support:                      1
+  Profiling timer resolution:                    1
+  Device endianess:                              Little
+  Available:                                     No
+  Compiler available:                            No
+  Execution capabilities:
+    Execute OpenCL kernels:                      Yes
+    Execute native function:                     No
+  Queue on Host properties:
+    Out-of-Order:                                Yes
+    Profiling:                                   Yes
+  Platform ID:                                   0x16fbae8
+  Name:                                          xilinx_u280_gen3x16_xdma_base_1
+  Vendor:                                        Xilinx
+  Driver version:                                1.0
+  Profile:                                       EMBEDDED_PROFILE
+  Version:                                       OpenCL 1.0
+<...>
+```
+
+which shows that both `Xilinx` platform and accelerator devices are present.
+
+## Building Applications
+
+To simplify the build process we define two environment variables `IT4I_PLATFORM` and `IT4I_BUILD_MODE`.
+The first `IT4I_PLATFORM` denotes specific accelerator hardware such as `Alveo u250` or `Alveo u280`
+and its configuration stored in (`*.xpfm` files).
+The list of available platforms can be obtained using `platforminfo` utility:
+
+```console
+$ platforminfo -l
+{
+    "platforms": [
+        {
+            "baseName": "xilinx_u280_gen3x16_xdma_1_202211_1",
+            "version": "202211.1",
+            "type": "sdaccel",
+            "dataCenter": "true",
+            "embedded": "false",
+            "externalHost": "true",
+            "serverManaged": "true",
+            "platformState": "impl",
+            "usesPR": "true",
+            "platformFile": "\/opt\/xilinx\/platforms\/xilinx_u280_gen3x16_xdma_1_202211_1\/xilinx_u280_gen3x16_xdma_1_202211_1.xpfm"
+        },
+        {
+            "baseName": "xilinx_u250_gen3x16_xdma_4_1_202210_1",
+            "version": "202210.1",
+            "type": "sdaccel",
+            "dataCenter": "true",
+            "embedded": "false",
+            "externalHost": "true",
+            "serverManaged": "true",
+            "platformState": "impl",
+            "usesPR": "true",
+            "platformFile": "\/opt\/xilinx\/platforms\/xilinx_u250_gen3x16_xdma_4_1_202210_1\/xilinx_u250_gen3x16_xdma_4_1_202210_1.xpfm"
+        }
+    ]
+}
+```
+
+Here, `baseName` and potentially `platformFile` are of interest and either can be specified as value of `IT4I_PLATFORM`.
+In this case we have platform files `xilinx_u280_gen3x16_xdma_1_202211_1` (Alveo u280) and `xilinx_u250_gen3x16_xdma_4_1_202210_1` (Alveo u250).
+
+The `IT4I_BUILD_MODE` is used to specify build type (`hw`, `hw_emu` and `sw_emu`):
+
+- `hw` performs full synthesis for the accelerator
+- `hw_emu` allows to run both synthesis and emulation for debugging
+- `sw_emu` compiles kernels only for emulation (doesn't require accelerator and allows much faster build)
+
+For example to configure build for `Alveo u280` we set:
+
+```console
+$ export IT4I_PLATFORM=xilinx_u280_gen3x16_xdma_1_202211_1
+```
+
+### Software Emulation Mode
+
+The software emulation mode is preferable for development as HLS synthesis is very time consuming. To build following applications in this mode we set:
+
+```console
+$ export IT4I_BUILD_MODE=sw_emu
+```
+
+and run each application with `XCL_EMULATION_MODE` set to `sw_emu`:
+
+```
+$ XCL_EMULATION_MODE=sw_emu <application>
+```
+
+### Hardware Synthesis Mode
+
+!!! note
+    The HLS of these simple applications **can take up to 2 hours** to finish.
+
+To allow the application to utilize real hardware we have to synthetize FPGA design for the accelerator. This can be done by repeating same steps used to build kernels in emulation mode, but with `IT4I_BUILD_MODE` set to `hw` like so:
+
+```console
+$ export IT4I_BUILD_MODE=hw
+```
+
+the host application binary can be reused, but it has to be run without `XCL_EMULATION_MODE`:
+
+```console
+$ <application>
+```
+
+## Sample Applications
+
+The first two samples illustrate two main approaches to building FPGA accelerated applications using Xilinx platform - **XRT** and **OpenCL**.
+The final example combines **HIP** with **XRT** to show basics necessary to build application, which utilizes both GPU and FPGA accelerators.
+
+### Using HLS and XRT
+
+The applications are typically separated into host and accelerator/kernel side.
+The following host-side code should be saved as `host.cpp`
+
+```c++
+/*
+# Copyright (C) 2023, Advanced Micro Devices, Inc. All rights reserved.
+# SPDX-License-Identifier: X11
+*/
+#include <iostream>
+#include <cstring>
+
+// XRT includes
+#include "xrt/xrt_bo.h"
+#include <experimental/xrt_xclbin.h>
+#include "xrt/xrt_device.h"
+#include "xrt/xrt_kernel.h"
+
+#define DATA_SIZE 4096
+
+int main(int argc, char** argv)
+{
+    if(argc != 2)
+    {
+        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Read settings
+    std::string binaryFile = argv[1];
+    int device_index = 0;
+
+    std::cout << "Open the device" << device_index << std::endl;
+    auto device = xrt::device(device_index);
+    std::cout << "Load the xclbin " << binaryFile << std::endl;
+    auto uuid = device.load_xclbin("./vadd.xclbin");
+
+    size_t vector_size_bytes = sizeof(int) * DATA_SIZE;
+
+    //auto krnl = xrt::kernel(device, uuid, "vadd");
+    auto krnl = xrt::kernel(device, uuid, "vadd", xrt::kernel::cu_access_mode::exclusive);
+
+    std::cout << "Allocate Buffer in Global Memory\n";
+    auto boIn1 = xrt::bo(device, vector_size_bytes, krnl.group_id(0)); //Match kernel arguments to RTL kernel
+    auto boIn2 = xrt::bo(device, vector_size_bytes, krnl.group_id(1));
+    auto boOut = xrt::bo(device, vector_size_bytes, krnl.group_id(2));
+
+    // Map the contents of the buffer object into host memory
+    auto bo0_map = boIn1.map<int*>();
+    auto bo1_map = boIn2.map<int*>();
+    auto bo2_map = boOut.map<int*>();
+    std::fill(bo0_map, bo0_map + DATA_SIZE, 0);
+    std::fill(bo1_map, bo1_map + DATA_SIZE, 0);
+    std::fill(bo2_map, bo2_map + DATA_SIZE, 0);
+
+    // Create the test data
+    int bufReference[DATA_SIZE];
+    for (int i = 0; i < DATA_SIZE; ++i)
+    {
+        bo0_map[i] = i;
+        bo1_map[i] = i;
+        bufReference[i] = bo0_map[i] + bo1_map[i]; //Generate check data for validation
+    }
+
+    // Synchronize buffer content with device side
+    std::cout << "synchronize input buffer data to device global memory\n";
+    boIn1.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+    boIn2.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+
+    std::cout << "Execution of the kernel\n";
+    auto run = krnl(boIn1, boIn2, boOut, DATA_SIZE); //DATA_SIZE=size
+    run.wait();
+
+    // Get the output;
+    std::cout << "Get the output data from the device" << std::endl;
+    boOut.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
+
+    // Validate results
+    if (std::memcmp(bo2_map, bufReference, vector_size_bytes))
+        throw std::runtime_error("Value read back does not match reference");
+
+    std::cout << "TEST PASSED\n";
+    return 0;
+}
+```
+
+The host-side code can now be compiled using GCC toolchain as:
+
+```console
+$ g++ host.cpp -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil -o host
+```
+
+The accelerator side (simple vector-add kernel) should be saved as `vadd.cpp`.
+
+```c++
+/*
+# Copyright (C) 2023, Advanced Micro Devices, Inc. All rights reserved.
+# SPDX-License-Identifier: X11
+*/
+
+extern "C" {
+	void vadd(
+	        const unsigned int *in1, // Read-Only Vector 1
+	        const unsigned int *in2, // Read-Only Vector 2
+	        unsigned int *out,       // Output Result
+	        int size                 // Size in integer
+	        )
+	{
+#pragma HLS INTERFACE m_axi port=in1 bundle=aximm1
+#pragma HLS INTERFACE m_axi port=in2 bundle=aximm2
+#pragma HLS INTERFACE m_axi port=out bundle=aximm1
+
+	    for(int i = 0; i < size; ++i)
+	    {
+	        out[i] = in1[i] + in2[i];
+	    }
+	}
+}
+```
+
+The accelerator-side code is build using Vitis `v++`.
+This is two-step process, which either builds emulation binary or performs full HLS (depending on the value of `-t` argument).
+The platform (specific accelerator) has to be also specified at this step (both for emulation and full HLS).
+
+```console
+$ v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k vadd vadd.cpp -o vadd.xo
+$ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM vadd.xo -o vadd.xclbin
+```
+
+This process should result in `vadd.xclbin`, which can be loaded by host-side application.
+
+### Running the Application
+
+With both host application and kernel binary at hand the application (in emulation mode) can be launched as
+
+```console
+$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
+```
+
+or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
+
+```console
+./host vadd.xclbin
+```
+
+## Using HLS and OpenCL
+
+The host-side application code should be saved as `host.cpp`.
+This application attempts to find `Xilinx` OpenCL platform in the system and selects first device in that platform.
+The device is then configured with provided kernel binary.
+Other than that the only difference to typical vector-add in OpenCL is use of `enqueueTask(...)` to launch the kernel
+(compared to typical `enqueueNDRangeKernel`).
+
+```c++
+#include <iostream>
+#include <fstream>
+#include <iterator>
+#include <vector>
+
+#define CL_HPP_TARGET_OPENCL_VERSION 120
+#define CL_HPP_MINIMUM_OPENCL_VERSION 120
+#define CL_HPP_ENABLE_PROGRAM_CONSTRUCTION_FROM_ARRAY_COMPATIBILITY 1
+#define CL_USE_DEPRECATED_OPENCL_1_2_APIS
+
+#include <CL/cl2.hpp>
+#include <CL/cl_ext_xilinx.h>
+
+std::vector<unsigned char> read_binary_file(const std::string &filename)
+{
+    std::cout << "INFO: Reading " << filename << std::endl;
+    std::ifstream file(filename, std::ios::binary);
+    file.unsetf(std::ios::skipws);
+
+    std::streampos file_size;
+    file.seekg(0, std::ios::end);
+    file_size = file.tellg();
+    file.seekg(0, std::ios::beg);
+
+    std::vector<unsigned char> data;
+    data.reserve(file_size);
+    data.insert(data.begin(),
+        std::istream_iterator<unsigned char>(file),
+        std::istream_iterator<unsigned char>());
+
+    return data;
+}
+
+cl::Device select_device()
+{
+    std::vector<cl::Platform> platforms;
+    cl::Platform::get(&platforms);
+    cl::Platform platform;
+
+    for(cl::Platform &p: platforms)
+    {
+        const std::string name = p.getInfo<CL_PLATFORM_NAME>();
+        std::cout << "PLATFORM: " << name << std::endl;
+        if(name == "Xilinx")
+        {
+            platform = p;
+            break;
+        }
+    }
+
+    if(platform == cl::Platform())
+    {
+        std::cout << "Xilinx platform not found!" << std::endl;
+        exit(EXIT_FAILURE);
+    }
+
+    std::vector<cl::Device> devices;
+    platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
+    return devices[0];
+}
+
+static const int DATA_SIZE = 1024;
+
+int main(int argc, char *argv[])
+{
+    if(argc != 2)
+    {
+        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    std::string binary_file = argv[1];
+
+    std::vector<int> source_a(DATA_SIZE, 10);
+    std::vector<int> source_b(DATA_SIZE, 32);
+
+    auto program_binary = read_binary_file(binary_file);
+    cl::Program::Binaries bins{{program_binary.data(), program_binary.size()}};
+
+    cl::Device device = select_device();
+    cl::Context context(device, nullptr, nullptr, nullptr);
+    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);
+
+    cl::Program program(context, {device}, bins, nullptr);
+
+    cl::Kernel vadd_kernel = cl::Kernel(program, "vector_add");
+
+    cl::Buffer buffer_a(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, source_a.size() * sizeof(int), source_a.data());
+    cl::Buffer buffer_b(context, CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR, source_b.size() * sizeof(int), source_b.data());
+    cl::Buffer buffer_res(context, CL_MEM_READ_WRITE, source_a.size() * sizeof(int));
+
+    int narg = 0;
+    vadd_kernel.setArg(narg++, buffer_res);
+    vadd_kernel.setArg(narg++, buffer_a);
+    vadd_kernel.setArg(narg++, buffer_b);
+    vadd_kernel.setArg(narg++, DATA_SIZE);
+
+    q.enqueueTask(vadd_kernel);
+
+    std::vector<int> result(DATA_SIZE, 0);
+    q.enqueueReadBuffer(buffer_res, CL_TRUE, 0, result.size() * sizeof(int), result.data());
+
+    int mismatch_count = 0;
+    for(size_t i = 0; i < DATA_SIZE; ++i)
+    {
+        int host_result = source_a[i] + source_b[i];
+        if(result[i] != host_result)
+        {
+            mismatch_count++;
+            std::cout << "ERROR: " << result[i] << " != " << host_result << std::endl;
+            break;
+        }
+    }
+
+    std::cout << "RESULT: " << (mismatch_count == 0 ? "PASSED" : "FAILED") << std::endl;
+
+    return 0;
+}
+```
+
+The host-side code can now be compiled using GCC toolchain as:
+
+```console
+$ g++ host.cpp -I$XILINX_XRT/include -I$XILINX_VIVADO/include -lOpenCL -o host
+```
+
+The accelerator side (simple vector-add kernel) should be saved as `vadd.cl`.
+
+```c++
+#define BUFFER_SIZE 256
+#define DATA_SIZE 1024
+
+// TRIPCOUNT indentifier
+__constant uint c_len = DATA_SIZE / BUFFER_SIZE;
+__constant uint c_size = BUFFER_SIZE;
+
+__attribute__((reqd_work_group_size(1, 1, 1)))
+__kernel void vector_add(__global int* c,
+    __global const int* a,
+    __global const int* b,
+    const int n_elements)
+{
+    int arrayA[BUFFER_SIZE];
+    int arrayB[BUFFER_SIZE];
+
+    __attribute__((xcl_loop_tripcount(c_len, c_len)))
+    for (int i = 0; i < n_elements; i += BUFFER_SIZE)
+    {
+        int size = BUFFER_SIZE;
+
+        if(i + size > n_elements)
+            size = n_elements - i;
+
+        __attribute__((xcl_loop_tripcount(c_size, c_size)))
+        __attribute__((xcl_pipeline_loop(1))) readA:
+        for(int j = 0; j < size; j++)
+            arrayA[j] = a[i + j];
+
+        __attribute__((xcl_loop_tripcount(c_size, c_size)))
+        __attribute__((xcl_pipeline_loop(1))) readB:
+        for(int j = 0; j < size; j++)
+            arrayB[j] = b[i + j];
+
+        __attribute__((xcl_loop_tripcount(c_size, c_size)))
+        __attribute__((xcl_pipeline_loop(1))) vadd_writeC:
+        for(int j = 0; j < size; j++)
+            c[i + j] = arrayA[j] + arrayB[j];
+    }
+}
+```
+
+The accelerator-side code is build using Vitis `v++`.
+This is three-step process, which either builds emulation binary or performs full HLS (depending on the value of `-t` argument).
+The platform (specific accelerator) has to be also specified at this step (both for emulation and full HLS).
+
+```console
+$ v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k vector_add -o vadd.xo vadd.cl
+$ v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.link.xclbin vadd.xo
+$ v++ -p vadd.link.xclbin -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -o vadd.xclbin
+```
+
+This process should result in `vadd.xclbin`, which can be loaded by host-side application.
+
+### Running the Application
+
+With both host application and kernel binary at hand the application (in emulation mode) can be launched as
+
+```console
+$ XCL_EMULATION_MODE=sw_emu ./host vadd.xclbin
+```
+
+or with real hardware (having compiled kernels with `IT4I_BUILD_MODE=hw`)
+
+```console
+./host vadd.xclbin
+```
+
+## Hybrid GPU and FPGA Application (HIP+XRT)
+
+This simple 8-bit quantized dot product (`R = sum(X[i]*Y[i])`) example illustrates basic approach to utilize both GPU and FPGA accelerators in a single application.
+The application takes the simplest approach, where both synchronization and data transfers are handled explicitly by the host.
+The HIP toolchain is used to compile the single source host/GPU code as usual, but it is also linked with XRT runtime, which allows host to control the FPGA accelerator.
+The FPGA kernels are built separately as in previous examples.
+
+The host/GPU HIP code should be saved as `main.hip`
+
+```c++
+#include <iostream>
+#include <vector>
+
+#include "xrt/xrt_bo.h"
+#include "experimental/xrt_xclbin.h"
+#include "xrt/xrt_device.h"
+#include "xrt/xrt_kernel.h"
+#include "hip/hip_runtime.h"
+
+const size_t DATA_SIZE = 1024;
+
+float compute_reference(const float *srcX, const float *srcY, size_t count);
+
+__global__ void quantize(int8_t *out, const float *in, size_t count)
+{
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
+        out[i] = int8_t(in[i] * 127);
+}
+
+__global__ void dequantize(float *out, const int16_t *in, size_t count)
+{
+    size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    for(size_t i = idx; i < count; i += blockDim.x * gridDim.x)
+        out[i] = float(in[i] / float(127*127));
+}
+
+int main(int argc, char *argv[])
+{
+    if(argc != 2)
+    {
+        std::cout << "Usage: " << argv[0] << " <XCLBIN File>" << std::endl;
+        return EXIT_FAILURE;
+    }
+
+    // Prepare experiment data
+    std::vector<float> srcX(DATA_SIZE);
+    std::vector<float> srcY(DATA_SIZE);
+    float outR = 0.0f;
+
+    for(size_t i = 0; i < DATA_SIZE; ++i)
+    {
+        srcX[i] = float(rand()) / float(RAND_MAX);
+        srcY[i] = float(rand()) / float(RAND_MAX);
+        outR += srcX[i] * srcY[i];
+    }
+
+    float outR_quant = compute_reference(srcX.data(), srcY.data(), DATA_SIZE);
+
+    std::cout << "REFERENCE: " << outR_quant << " (" << outR << ")" << std::endl;
+
+    // Initialize XRT (FPGA device), load kernels binary and create kernel object
+    xrt::device device(0);
+    std::cout << "Loading xclbin file " << argv[1] << std::endl;
+    xrt::uuid xclbinId = device.load_xclbin(argv[1]);
+    xrt::kernel mulKernel(device, xclbinId, "multiply", xrt::kernel::cu_access_mode::exclusive);
+
+    // Allocate GPU buffers
+    float   *srcX_gpu, *srcY_gpu, *res_gpu;
+    int8_t  *srcX_gpu_quant, *srcY_gpu_quant;
+    int16_t *res_gpu_quant;
+    hipMalloc(&srcX_gpu, DATA_SIZE * sizeof(float));
+    hipMalloc(&srcY_gpu, DATA_SIZE * sizeof(float));
+    hipMalloc(&res_gpu,  DATA_SIZE * sizeof(float));
+    hipMalloc(&srcX_gpu_quant, DATA_SIZE * sizeof(int8_t));
+    hipMalloc(&srcY_gpu_quant, DATA_SIZE * sizeof(int8_t));
+    hipMalloc(&res_gpu_quant,  DATA_SIZE * sizeof(int16_t));
+
+    // Allocate FPGA buffers
+    xrt::bo srcX_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(0));
+    xrt::bo srcY_fpga_quant(device, DATA_SIZE * sizeof(int8_t), mulKernel.group_id(1));
+    xrt::bo res_fpga_quant(device, DATA_SIZE * sizeof(int16_t), mulKernel.group_id(2));
+
+    // Copy experiment data from HOST to GPU
+    hipMemcpy(srcX_gpu, srcX.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
+    hipMemcpy(srcY_gpu, srcY.data(), DATA_SIZE * sizeof(float), hipMemcpyHostToDevice);
+
+    // Execute quantization kernels on both input vectors
+    quantize<<<16, 256>>>(srcX_gpu_quant, srcX_gpu, DATA_SIZE);
+    quantize<<<16, 256>>>(srcY_gpu_quant, srcY_gpu, DATA_SIZE);
+
+    // Map FPGA buffers into HOST memory, copy data from GPU to these mapped buffers and synchronize them into FPGA memory
+    hipMemcpy(srcX_fpga_quant.map<int8_t *>(), srcX_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
+    srcX_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+    hipMemcpy(srcY_fpga_quant.map<int8_t *>(), srcY_gpu_quant, DATA_SIZE * sizeof(int8_t), hipMemcpyDeviceToHost);
+    srcY_fpga_quant.sync(XCL_BO_SYNC_BO_TO_DEVICE);
+
+    // Execute FPGA kernel (8-bit integer multiplication)
+    auto kernelRun = mulKernel(res_fpga_quant, srcX_fpga_quant, srcY_fpga_quant, DATA_SIZE);
+    kernelRun.wait();
+
+    // Synchronize output FPGA buffer back to HOST and copy its contents to GPU buffer for dequantization
+    res_fpga_quant.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
+    hipMemcpy(res_gpu_quant, res_fpga_quant.map<int16_t *>(), DATA_SIZE * sizeof(int16_t), hipMemcpyDeviceToHost);
+
+    // Dequantize multiplication result on GPU
+    dequantize<<<16, 256>>>(res_gpu, res_gpu_quant, DATA_SIZE);
+
+    // Copy dequantized results from GPU to HOST
+    std::vector<float> res(DATA_SIZE);
+    hipMemcpy(res.data(), res_gpu, DATA_SIZE * sizeof(float), hipMemcpyDeviceToHost);
+
+    // Perform simple sum on CPU
+    float out = 0.0;
+    for(size_t i = 0; i < DATA_SIZE; ++i)
+        out += res[i];
+
+    std::cout << "RESULT: " << out << std::endl;
+
+    hipFree(srcX_gpu);
+    hipFree(srcY_gpu);
+    hipFree(res_gpu);
+    hipFree(srcX_gpu_quant);
+    hipFree(srcY_gpu_quant);
+    hipFree(res_gpu_quant);
+
+    return 0;
+}
+
+float compute_reference(const float *srcX, const float *srcY, size_t count)
+{
+    float out = 0.0f;
+
+    for(size_t i = 0; i < count; ++i)
+    {
+        int16_t quantX(srcX[i] * 127);
+        int16_t quantY(srcY[i] * 127);
+
+        out += float(int16_t(quantX * quantY) / float(127*127));
+    }
+
+    return out;
+}
+```
+
+The host/GPU application can be built using HIPCC as:
+
+```console
+$ hipcc -I$XILINX_XRT/include -I$XILINX_VIVADO/include -L$XILINX_XRT/lib -lxrt_coreutil main.hip -o host
+```
+
+The accelerator side (simple vector-multiply kernel) should be saved as `kernels.cpp`.
+
+```c++
+extern "C" {
+    void multiply(
+        short *out,
+        const char *inX,
+        const char *inY,
+        int size)
+    {
+#pragma HLS INTERFACE m_axi port=inX bundle=aximm1
+#pragma HLS INTERFACE m_axi port=inY bundle=aximm2
+#pragma HLS INTERFACE m_axi port=out bundle=aximm1
+        for(int i = 0; i < size; ++i)
+            out[i] = short(inX[i]) * short(inY[i]);
+    }
+}
+```
+
+Once again the HLS kernel is build using Vitis `v++` in two steps:
+
+```console
+v++ -c -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM -k multiply kernels.cpp -o kernels.xo
+v++ -l -t $IT4I_BUILD_MODE --platform $IT4I_PLATFORM kernels.xo -o kernels.xclbin
+```
+
+### Running the Application
+
+In emulation mode (FPGA emulation, GPU HW is required) the application can be launched as:
+
+```console
+$ XCL_EMULATION_MODE=sw_emu ./host kernels.xclbin
+REFERENCE: 256.554 (260.714)
+Loading xclbin file ./kernels.xclbin
+RESULT: 256.554
+```
+
+or, having compiled kernels with `IT4I_BUILD_MODE=hw` set, using real hardware (both FPGA and GPU HW is required)
+
+```console
+$ ./host kernels.xclbin
+REFERENCE: 256.554 (260.714)
+Loading xclbin file ./kernels.xclbin
+RESULT: 256.554
+```
+
+## Additional Resources
+
+- [https://xilinx.github.io/Vitis-Tutorials/][1]
+- [http://xilinx.github.io/Vitis_Accel_Examples/][2]
+
+[1]: https://xilinx.github.io/Vitis-Tutorials/
+[2]: http://xilinx.github.io/Vitis_Accel_Examples/
--- a/docs.it4i/cs/introduction.md
+++ b/docs.it4i/cs/introduction.md
+# Complementary Systems
+
+Complementary systems offer development environment for users
+that need to port and optimize their code and applications
+for various hardware architectures and software technologies
+that are not available on standard clusters.
+
+## Complementary Systems 1
+
+First stage of complementary systems implementation comprises of these partitions:
+
+- compute partition 0 – based on ARM technology - legacy
+- compute partition 1 – based on ARM technology - A64FX
+- compute partition 2 – based on Intel technologies - Ice Lake, NVDIMMs + Bitware FPGAs
+- compute partition 3 – based on AMD technologies - Milan, MI100 GPUs + Xilinx FPGAs
+- compute partition 4 – reflecting Edge type of servers
+- partition 5 – FPGA synthesis server
+
+![](../img/cs1_1.png)
+
+## Complementary Systems 2
+
+Second stage of complementary systems implementation comprises of these partitions:
+
+- compute partition 6 - based on ARM technology + CUDA programmable GPGPU accelerators on ampere architecture + DPU network processing units
+- compute partition 7 - based on IBM Power10 architecture
+- compute partition 8 - modern CPU with a very high L3 cache capacity (over 750MB)
+- compute partition 9 - virtual GPU accelerated workstations
+- compute partition 10 - Sapphire Rapids-HBM server
+- compute partition 11 - NVIDIA Grace CPU Superchip
+
+![](../img/cs2_2.png)
+
+## Modules and Architecture Availability
+
+Complementary systems list available modules automatically based on the detected architecture.
+
+However, you can load one of the three modules -- `aarch64`, `avx2`, and `avx512` --
+to reload the list of modules available for the respective architecture:
+
+```console
+[user@login.cs ~]$ ml architecture/aarch64
+
+  aarch64 modules + all modules
+
+[user@login.cs ~]$ ml architecture/avx2
+
+  avx2 modules + all modules
+
+[user@login.cs ~]$ ml architecture/avx512
+
+  avx512 modules + all modules
+```
--- a/docs.it4i/cs/job-scheduling.md
+++ b/docs.it4i/cs/job-scheduling.md
+# Complementary System Job Scheduling
+
+## Introduction
+
+[Slurm][1] workload manager is used to allocate and access Complementary systems resources.
+
+## Getting Partition Information
+
+Display partitions/queues
+
+```console
+$ sinfo -s
+PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
+p00-arm      up 1-00:00:00          0/1/0/1 p00-arm01
+p01-arm*     up 1-00:00:00          0/8/0/8 p01-arm[01-08]
+p02-intel    up 1-00:00:00          0/2/0/2 p02-intel[01-02]
+p03-amd      up 1-00:00:00          0/2/0/2 p03-amd[01-02]
+p04-edge     up 1-00:00:00          0/1/0/1 p04-edge01
+p05-synt     up 1-00:00:00          0/1/0/1 p05-synt01
+p06-arm      up 1-00:00:00          0/2/0/2 p06-arm[01-02]
+p07-power    up 1-00:00:00          0/1/0/1 p07-power01
+p08-amd      up 1-00:00:00          0/1/0/1 p08-amd01
+p10-intel    up 1-00:00:00          0/1/0/1 p10-intel01
+```
+
+## Getting Job Information
+
+Show jobs
+
+```console
+$ squeue --me
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+               104   p01-arm interact    user   R       1:48      2 p01-arm[01-02]
+```
+
+Show job details for specific job
+
+```console
+$ scontrol -d show job JOBID
+```
+
+Show job details for executing job from job session
+
+```console
+$ scontrol -d show job $SLURM_JOBID
+```
+
+## Running Interactive Jobs
+
+Run interactive job
+
+```console
+ $ salloc -A PROJECT-ID -p p01-arm
+```
+
+Run interactive job, with X11 forwarding
+
+```console
+ $ salloc -A PROJECT-ID -p p01-arm --x11
+```
+
+!!! warning
+    Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
+
+## Running Batch Jobs
+
+Run batch job
+
+```console
+ $ sbatch -A PROJECT-ID -p p01-arm ./script.sh
+```
+
+Useful command options (salloc, sbatch, srun)
+
+* -n, --ntasks
+* -c, --cpus-per-task
+* -N, --nodes
+
+## Slurm Job Environment Variables
+
+Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
+
+See all Slurm variables
+
+```
+set | grep ^SLURM
+```
+
+### Useful Variables
+
+| variable name | description | example |
+| ------ | ------ | ------ |
+| SLURM_JOB_ID | job id of the executing job| 593 |
+| SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
+| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
+| SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
+| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
+| SLURM_JOB_PARTITION | name of the partition | p03-amd |
+| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
+
+See [Slurm srun documentation][2] for details.
+
+Get job nodelist
+
+```
+$ echo $SLURM_JOB_NODELIST
+p03-amd[01-02]
+```
+
+Expand nodelist to list of nodes.
+
+```
+$ scontrol show hostnames $SLURM_JOB_NODELIST
+p03-amd01
+p03-amd02
+```
+
+## Modifying Jobs
+
+```
+$ scontrol update JobId=JOBID ATTR=VALUE
+```
+
+for example
+
+```
+$ scontrol update JobId=JOBID Comment='The best job ever'
+```
+
+## Deleting Jobs
+
+```
+$ scancel JOBID
+```
+
+## Partitions
+
+| PARTITION | nodes | whole node | cores per node | features |
+| --------- | ----- | ---------- | -------------- | -------- |
+| p00-arm   | 1     | yes        | 64             | aarch64,cortex-a72 |
+| p01-arm   | 8     | yes        | 48             | aarch64,a64fx,ib |
+| p02-intel | 2     | no         | 64             | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
+| p03-amd   | 2     | no         | 64             | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
+| p04-edge  | 1     | yes        | 16             | 86_64,intel,broadwell,ib |
+| p05-synt  | 1     | yes        | 8              | x86_64,amd,milan,ib,ht |
+| p06-arm   | 2     | yes        | 80             | aarch64,ib |
+| p07-power | 1     | yes        | 192            | ppc64le,ib |
+| p08-amd   | 1     | yes        | 128            | x86_64,amd,milan-x,ib,ht |
+| p10-intel | 1     | yes        | 96             | x86_64,intel,sapphire_rapids,ht|
+
+Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
+
+FIFO scheduling with backfilling is employed.
+
+## Partition 00 - ARM (Cortex-A72)
+
+Whole node allocation.
+
+One node:
+
+```console
+salloc -A PROJECT-ID -p p00-arm
+```
+
+## Partition 01 - ARM (A64FX)
+
+Whole node allocation.
+
+One node:
+
+```console
+salloc -A PROJECT-ID -p p01-arm
+```
+
+```console
+salloc -A PROJECT-ID -p p01-arm -N=1
+```
+
+Multiple nodes:
+
+```console
+salloc -A PROJECT-ID -p p01-arm -N=8
+```
+
+## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
+
+FPGAs are treated as resources. See below for more details about resources.
+
+Partial allocation - per FPGA, resource separation is not enforced.
+Use only FPGAs allocated to the job!
+
+One FPGA:
+
+```console
+salloc -A PROJECT-ID -p p02-intel --gres=fpga
+```
+
+Two FPGAs on the same node:
+
+```console
+salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
+```
+
+All FPGAs:
+
+```console
+salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
+```
+
+## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
+
+GPUs and FPGAs are treated as resources. See below for more details about resources.
+
+Partial allocation - per GPU and per FPGA, resource separation is not enforced.
+Use only GPUs and FPGAs allocated to the job!
+
+One GPU:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=gpu
+```
+
+Two GPUs on the same node:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=gpu:2
+```
+
+Four GPUs on the same node:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=gpu:4
+```
+
+All GPUs:
+
+```console
+salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
+```
+
+One FPGA:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=fpga
+```
+
+Two FPGAs:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
+```
+
+All FPGAs:
+
+```console
+salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
+```
+
+One GPU and one FPGA on the same node:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga
+```
+
+Four GPUs and two FPGAs on the same node:
+
+```console
+salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
+```
+
+All GPUs and FPGAs:
+
+```console
+salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
+```
+
+## Partition 04 - Edge Server
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p04-edge
+```
+
+## Partition 05 - FPGA Synthesis Server
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p05-synt
+```
+
+## Partition 06 - ARM
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p06-arm
+```
+
+## Partition 07 - IBM Power
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p07-power
+```
+
+## Partition 08 - AMD Milan-X
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p08-amd
+```
+
+## Partition 10 - Intel Sapphire Rapids
+
+Whole node allocation:
+
+```console
+salloc -A PROJECT-ID -p p10-intel
+```
+
+## Features
+
+Nodes have feature tags assigned to them.
+Users can select nodes based on the feature tags using --constraint option.
+
+| Feature | Description |
+| ------ | ------ |
+| aarch64 | platform |
+| x86_64 | platform |
+| ppc64le | platform |
+| amd | manufacturer |
+| intel | manufacturer |
+| icelake | processor family |
+| broadwell | processor family |
+| sapphire_rapids | processor family |
+| milan | processor family |
+| milan-x | processor family |
+| ib | Infiniband |
+| gpu | equipped with GPU |
+| fpga | equipped with FPGA |
+| nvdimm | equipped with NVDIMMs |
+| ht | Hyperthreading enabled |
+| noht | Hyperthreading disabled |
+
+```
+$ sinfo -o '%16N %f'
+NODELIST         AVAIL_FEATURES
+p00-arm01        aarch64,cortex-a72
+p01-arm[01-08]   aarch64,a64fx,ib
+p02-intel01      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
+p02-intel02      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
+p03-amd02        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
+p03-amd01        x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
+p04-edge01       x86_64,intel,broadwell,ib,ht
+p05-synt01       x86_64,amd,milan,ib,ht
+p06-arm[01-02]   aarch64,ib
+p07-power01      ppc64le,ib
+p08-amd01        x86_64,amd,milan-x,ib,ht
+p10-intel01      x86_64,intel,sapphire_rapids,ht
+```
+
+```
+$ salloc -A PROJECT-ID -p p02-intel --constraint noht
+```
+
+```
+$ scontrol -d show node p02-intel02 | grep ActiveFeatures
+   ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
+```
+
+## Resources, GRES
+
+Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
+
+!!! warning
+    Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
+
+### Node Resources
+
+Get information about GRES on node.
+
+```
+$ scontrol -d show node p02-intel01 | grep Gres=
+   Gres=fpga:bitware_520n_mx:2
+$ scontrol -d show node p02-intel02 | grep Gres=
+   Gres=fpga:bitware_520n_mx:2
+$ scontrol -d show node p03-amd01 | grep Gres=
+   Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
+$ scontrol -d show node p03-amd02 | grep Gres=
+   Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
+```
+
+### Request Resources
+
+To allocate required resources (GPUs or FPGAs) use the `--gres salloc/srun` option.
+
+Example: Allocate one FPGA
+
+```
+$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
+```
+
+### Find Out Allocated Resources
+
+Information about allocated resources is available in Slurm job details, attributes `JOB_GRES` and `GRES`.
+
+```
+$ scontrol -d show job $SLURM_JOBID |grep GRES=
+   JOB_GRES=fpga:xilinx_alveo_u250:1
+     Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
+```
+
+IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are `fpga:xilinx_alveo_u250:1(IDX:0)`, we should use FPGA with index/number 0 on node p03-amd01.
+
+### Request Specific Resources
+
+It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
+
+GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
+
+```
+$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
+salloc: Granted job allocation XXX
+salloc: Waiting for resource configuration
+salloc: Nodes p03-amd02 are ready for job
+
+$ scontrol -d show job $SLURM_JOBID | grep -i gres
+   JOB_GRES=fpga:xilinx_alveo_u280:2
+     Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
+   TresPerNode=gres:fpga:xilinx_alveo_u280:2
+```
+
+[1]: https://slurm.schedmd.com/
+[2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
--- a/docs.it4i/cs/specifications.md
+++ b/docs.it4i/cs/specifications.md
+# Complementary Systems Specifications
+
+Below are the technical specifications of individual Complementary systems.
+
+## Partition 0 - ARM (Cortex-A72)
+
+The partition is based on the [ARMv8-A 64-bit][4] nebo architecture.
+
+- Cortex-A72
+  - ARMv8-A 64-bit
+  - 2x 32 cores @ 2 GHz
+  - 255 GB memory
+- disk capacity 3,7 TB
+- 1x Infiniband FDR 56 Gb/s
+
+## Partition 1 - ARM (A64FX)
+
+The partition is based on the Armv8.2-A architecture
+with SVE extension of instruction set and
+consists of 8 compute nodes with the following per-node parameters:
+
+- 1x Fujitsu A64FX CPU
+  - Arm v8.2-A ISA CPU with Scalable Vector Extension (SVE) extension
+  - 48 cores at 2.0 GHz
+  - 32 GB of HBM2 memory
+- 400 GB SSD (m.2 form factor) – mixed used type
+- 1x Infiniband HDR100 interface
+  - connected via 16x PCI-e Gen3 slot to the CPU
+
+## Partition 2 - Intel (Ice Lake, NVDIMMs) <!--- + Bitware FPGAs) -->
+
+The partition is based on the Intel Ice Lake x86 architecture.
+It contains two servers with Intel NVDIMM memories.
+ <!--- The key technologies installed are Intel NVDIMM memories. and Intel FPGA accelerators.
+The partition contains two servers each with two FPGA accelerators. -->
+
+Each server has the following parameters:
+
+- 2x 3rd Gen Xeon Scalable Processors Intel Xeon Gold 6338 CPU
+  - 32-cores @ 2.00GHz
+- 16x 16GB RAM with ECC
+  - DDR4-3200
+- 1x Infiniband HDR100 interface
+  - connected to CPU 8x PCI-e Gen4 interface
+- 3.2 TB NVMe local storage – mixed use type
+
+<!---
+2x FPGA accelerators
+Bitware [520N-MX][1]
+-->
+
+In addition, the servers has the following parameters:
+
+- Intel server 1 – low NVDIMM memory server with 2304 GB NVDIMM memory
+  - 16x 128GB NVDIMM persistent memory modules
+- Intel server 2 – high NVDIMM memory server with 8448 GB NVDIMM memory
+  - 16x 512GB NVDIMM persistent memory modules
+
+Software installed on the partition:
+
+FPGA boards support application development using following design flows:
+
+- OpenCL
+- High-Level Synthesis (C/C++) including support for OneAPI
+- Verilog and VHDL
+
+## Partition 3 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
+
+The partition is based on two servers equipped with AMD Milan x86 CPUs,
+AMD GPUs and Xilinx FPGAs architectures and represents an alternative
+to the Intel-based partition's ecosystem.
+
+Each server has the following parameters:
+
+- 2x AMD Milan 7513 CPU
+  - 32 cores @ 2.6 GHz
+- 16x 16GB RAM with ECC
+  - DDR4-3200
+- 4x AMD GPU accelerators MI 100
+  - Interconnected with AMD Infinity Fabric™ Link for fast GPU to GPU communication
+- 1x 100 GBps Infiniband HDR100
+  - connected to CPU via 8x PCI-e Gen4 interface
+- 3.2 TB NVMe local storage – mixed use
+
+In addition:
+
+- AMD server 1 has 2x FPGA [Xilinx Alveo U250 Data Center Accelerator Card][2]
+- AMD server 2 has 2x FPGA [Xilinx Alveo U280 Data Center Accelerator Card][3]
+
+Software installed on the partition:
+
+FPGA boards support application development using following design flows:
+
+- OpenCL
+- High-Level Synthesis (C/C++)
+- Verilog and VHDL
+- developer tools and libraries for AMD GPUs.
+
+## Partition 4 - Edge Server
+
+The partition provides overview of the so-called edge computing class of resources
+with solutions powerful enough to provide data analytic capabilities (both CPU and GPU)
+in a form factor which cannot require a data center to operate.
+
+The partition consists of one edge computing server with following parameters:
+
+- 1x x86_64 CPU Intel Xeon D-1587
+  - TDP 65 W,
+  - 16 cores,
+  - 435 GFlop/s theoretical max performance in double precision
+- 1x CUDA programmable GPU NVIDIA Tesla T4
+  - TDP 70W
+  - theoretical performance 8.1 TFlop/s in FP32
+- 128 GB RAM
+- 1.92TB SSD storage
+- connectivity:
+  - 2x 10 Gbps Ethernet,
+  - WiFi 802.11 ac,
+  - LTE connectivity
+
+## Partition 5 - FPGA Synthesis Server
+
+FPGAs design tools usually run for several hours to one day to generate a final bitstream (logic design) of large FPGA chips. These tools are usually sequential, therefore part of the system is a dedicated server for this task.
+
+This server is used by development tools needed for FPGA boards installed in both Compute partition 2 and 3.
+
+- AMD EPYC 72F3, 8 cores @ 3.7 GHz nominal frequency
+  - 8 memory channels with ECC
+- 128 GB of DDR4-3200 memory with ECC
+  - memory is fully populated to maximize memory subsystem performance
+- 1x 10Gb Ethernet port used for connection to LAN
+- NVMe local storage
+  - 2x NVMe disks 3.2TB, configured RAID 1
+
+## Partition 6 - ARM + CUDA GPGU (Ampere) + DPU
+
+This partition is based on ARM architecture and is equipped with CUDA programmable GPGPU accelerators
+based on Ampere architecture and DPU network processing units.
+The partition consists of two nodes with the following per-node parameters:
+
+- Server Gigabyte G242-P36, Ampere Altra Q80-30 (80c, 3.0GHz)
+- 512GB DIMM DDR4, 3200MHz, ECC, CL22
+- 2x Micron 7400 PRO 1920GB NVMe M.2 Non-SED Enterprise SSD
+- 2x NVIDIA A30 GPU Accelerator
+- 2x NVIDIA BlueField-2 E-Series DPU 25GbE Dual-Port SFP56, PCIe Gen4 x16, 16GB DDR + 64, 200Gb Ethernet
+- Mellanox ConnectX-5 EN network interface card, 10/25GbE dual-port SFP28, PCIe3.0 x8
+- Mellanox ConnectX-6 VPI adapter card, 100Gb/s (HDR100, EDR IB and 100GbE), single-port QSFP56
+
+## Partition 7 - IBM
+
+The IBM Power10 server is a single-node partition with the following parameters:
+
+- Server IBM POWER S1022
+- 2x Power10 12-CORE TYPICAL 2.90 TO 4.0 GHZ (MAX) PO
+- 512GB DDIMMS, 3200 MHZ, 8GBIT DDR4
+- 2x ENTERPRISE 1.6 TB SSD PCIE4 NVME U.2 MOD
+- 2x ENTERPRISE 6.4 TB SSD PCIE4 NVME U.2 MOD
+- PCIE3 LP 2-PORT 25/10GB NIC&ROCE SR/CU A
+
+## Partition 8 - HPE Proliant
+
+This partition provides a modern CPU with a very large L3 cache.
+The goal is to enable users to develop algorithms and libraries
+that will efficiently utilize this technology.
+The processor is very efficient, for example, for linear algebra on relatively small matrices.
+This is a single-node partition with the following parameters:
+
+- Server HPE Proliant DL 385 Gen10 Plus v2 CTO
+- 2x AMD EPYC 7773X Milan-X, 64 cores, 2.2GHz, 768 MB L3 cache
+- 16x HPE 16GB (1x+16GB) x4 DDR4-3200 Registered Smart Memory Kit
+- 2x 3.84TB NVMe RI SFF BC U.3ST MV SSD
+- BCM 57412 10GbE 2p SFP+ OCP3 Adptr
+- HPE IB HDR100/EN 100Gb 1p QSFP56 Adptr1
+- HPE Cray Programming Environment for x86 Systems 2 Seats
+
+## Partition 9 - Virtual GPU Accelerated Workstation
+
+This partition provides users with a remote/virtual workstation running MS Windows OS.
+It offers rich graphical environment with a focus on 3D OpenGL
+or RayTracing-based applications with the smallest possible degradation of user experience.
+The partition consists of two nodes with the following per-node parameters:
+
+- Server HPE Proliant DL 385 Gen10 Plus v2 CTO
+- 2x AMD EPYC 7413, 24 cores, 2.55GHz
+- 16x HPE 32GB 2Rx4 PC4-3200AA-R Smart Kit
+- 2x 3.84TB NVMe RI SFF BC U.3ST MV SSD
+- BCM 57412 10GbE 2p SFP+ OCP3 Adptr
+- 2x NVIDIA A40 48GB GPU Accelerator
+
+### Available Software
+
+The following is the list of software available on partiton 09:
+
+- Academic VMware Horizon 8 Enterprise Term Edition: 10 Concurrent User Pack for 4 year term license; includes SnS
+- 8x NVIDIA RTX Virtual Workstation, per concurrent user, EDU, perpetual license
+- 32x NVIDIA RTX Virtual Workstation, per concurrent user, EDU SUMS per year
+- 7x Windows Server 2022 Standard - 16 Core License Pack
+- 10x Windows Server 2022 - 1 User CAL
+- 40x Windows 10/11 Enterprise E3 VDA (Microsoft) per year
+- Hardware VMware Horizon management
+
+## Partition 10 - Sapphire Rapids-HBM Server
+
+The primary purpose of this server is to evaluate the impact of the HBM memory on the x86 processor
+on the performance of the user applications.
+This is a new feature previously available only on the GPGPU accelerators
+and provided a significant boost to the memory-bound applications.
+Users can also compare the impact of the HBM memory with the impact of the large L3 cache
+available on the AMD Milan-X processor also available on the complementary systems.
+The server is also equipped with DDR5 memory and enables the comparative studies with reference to DDR4 based systems.
+
+- 2x Intel® Xeon® CPU Max 9468 48 cores base 2.1GHz, max 3.5Ghz
+- 16x 16GB DDR5 4800Mhz
+- 2x Intel D3 S4520 960GB SATA 6Gb/s
+- 1x Supermicro Standard LP 2-port 10GbE RJ45, Broadcom BCM57416
+
+## Partition 11 - NVIDIA Grace CPU Superchip
+
+The [NVIDIA Grace CPU Superchip][6] uses the [NVIDIA® NVLink®-C2C][5] technology to deliver 144 Arm® Neoverse V2 cores and 1TB/s of memory bandwidth.
+Runs all NVIDIA software stacks and platforms, including NVIDIA RTX™, NVIDIA HPC SDK, NVIDIA AI, and NVIDIA Omniverse™.
+
+- Superchip design with up to 144 Arm Neoverse V2 CPU cores with Scalable Vector Extensions (SVE2)
+- World’s first LPDDR5X with error-correcting code (ECC) memory, 1TB/s total bandwidth
+- 900GB/s coherent interface, 7X faster than PCIe Gen 5
+- NVIDIA Scalable Coherency Fabric with 3.2TB/s of aggregate bisectional bandwidth
+- 2X the packaging density of DIMM-based solutions
+- 2X the performance per watt of today’s leading CPU
+- FP64 Peak of 7.1TFLOPS
+
+[1]: https://www.bittware.com/fpga/520n-mx/
+[2]: https://www.xilinx.com/products/boards-and-kits/alveo/u250.html#overview
+[3]: https://www.xilinx.com/products/boards-and-kits/alveo/u280.html#overview
+[4]: https://developer.arm.com/documentation/100095/0003/
+[5]: https://www.nvidia.com/en-us/data-center/nvlink-c2c/
+[6]: https://www.nvidia.com/en-us/data-center/grace-cpu-superchip/
+
--- a/docs.it4i/dgx2/accessing.md
+++ b/docs.it4i/dgx2/accessing.md
+# Accessing the DGX-2
+
+## Before You Access
+
+!!! warning
+    GPUs are single-user devices. GPU memory is not purged between job runs and it can be read (but not written) by any user. Consider the confidentiality of your running jobs.
+
+## How to Access
+
+The DGX-2 machine is integrated into [Barbora cluster][3].
+The DGX-2 machine can be accessed from Barbora login nodes `barbora.it4i.cz` through the Barbora scheduler queue qdgx as a compute node cn202.
+
+## Storage
+
+There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT.
+
+### HOME
+
+The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1].
+
+### SCRATCH
+
+The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory.
+Accessible capacity is 22TB, shared among all users.
+
+!!! warning
+    Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted.
+
+### PROJECT
+
+The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters.
+For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section.
+
+[1]: ../../barbora/storage/#home-file-system
+[2]: ../../storage/project-storage
+[3]: ../../barbora/introduction
--- a/docs.it4i/dgx2/introduction.md
+++ b/docs.it4i/dgx2/introduction.md
+# NVIDIA DGX-2
+
+The DGX-2 is a very powerful computational node, featuring high end x86_64 processors and 16 NVIDIA V100-SXM3 GPUs.
+
+| NVIDIA DGX-2  | |
+| --- | --- |
+| CPUs | 2 x Intel Xeon Platinum |
+| GPUs | 16 x NVIDIA Tesla V100 32GB HBM2 |
+| System Memory | Up to 1.5 TB DDR4 |
+| GPU Memory | 512 GB HBM2 (16 x 32 GB)	|
+| Storage | 30 TB NVMe, Up to 60 TB |
+| Networking | 8 x Infiniband or 8 x 100 GbE |
+| Power | 10 kW	|
+| Size | 350 lbs |
+| GPU Throughput | Tensor: 1920 TFLOPs, FP16: 520 TFLOPs, FP32: 260 TFLOPs, FP64: 130 TFLOPs |
+
+The [DGX-2][a] introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe.
+
+With NVLink2, it enables 16x NVIDIA V100-SXM3 GPUs in a single system, for a total bandwidth going beyond 14 TB/s.
+Featuring pair of Xeon 8168 CPUs, 1.5 TB of memory, and 30 TB of NVMe storage,
+we get a system that consumes 10 kW, weighs 163.29 kg, but offers double precision performance in excess of 130TF.
+
+The DGX-2 is designed to be a powerful server in its own right.
+On the storage side, the DGX-2 comes with 30TB of NVMe-based solid state storage.
+For clustering or further inter-system communications, it also offers InfiniBand and 100GigE connectivity, up to eight of them.
+
+Further, the [DGX-2][b] offers  a total of ~2 PFLOPs of half precision performance in a single system, when using the tensor cores.
+
+![](../img/dgx1.png)
+
+With DGX-2, AlexNET, the network that 'started' the latest machine learning revolution, now takes 18 minutes.
+
+The DGX-2 is able to complete the training process
+for FAIRSEQ – a neural network model for language translation – 10x faster than a DGX-1 system,
+bringing it down to less than two days total rather than 15 days.
+
+The new NVSwitches means that the PCIe lanes of the CPUs can be redirected elsewhere, most notably towards storage and networking connectivity.
+The topology of the DGX-2 means that all 16 GPUs are able to pool their memory into a unified memory space,
+though with the usual tradeoffs involved if going off-chip.
+
+![](../img/dgx2-nvlink.png)
+
+[a]: https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/dgx-2/nvidia-dgx-2-datasheet.pdf
+[b]: https://www.youtube.com/embed/OTOGw0BRqK0
--- a/docs.it4i/dgx2/job_execution.md
+++ b/docs.it4i/dgx2/job_execution.md
+# Resource Allocation and Job Execution
+
+To run a job, computational resources of DGX-2 must be allocated.
+
+The DGX-2 machine is integrated to and accessible through Barbora cluster, the queue for the DGX-2 machine is called **qdgx**.
+
+When allocating computational resources for the job, specify:
+
+1. your Project ID
+1. a queue for your job - **qdgx**;
+1. the maximum time allocated to your calculation (default is **4 hour**, maximum is **48 hour**);
+1. a jobscript if batch processing is intended.
+
+Submit the job using the `sbatch` (for batch processing) or `salloc` (for interactive session) command:
+
+**Example**
+
+```console
+[kru0052@login2.barbora ~]$ salloc -A PROJECT-ID -p qdgx --time=02:00:00
+salloc: Granted job allocation 36631
+salloc: Waiting for resource configuration
+salloc: Nodes cn202 are ready for job
+
+kru0052@cn202:~$ nvidia-smi
+Wed Jun 16 07:46:32 2021
+-----------------------------------------------------------------------------+
+|  NVIDIA-SMI 465.19.01    Driver Version: 465.19.01    CUDA Version: 11.3    |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  Tesla V100-SXM3...  On   | 00000000:34:00.0 Off |                    0 |
+| N/A   32C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   1  Tesla V100-SXM3...  On   | 00000000:36:00.0 Off |                    0 |
+| N/A   31C    P0    48W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   2  Tesla V100-SXM3...  On   | 00000000:39:00.0 Off |                    0 |
+| N/A   35C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   3  Tesla V100-SXM3...  On   | 00000000:3B:00.0 Off |                    0 |
+| N/A   36C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   4  Tesla V100-SXM3...  On   | 00000000:57:00.0 Off |                    0 |
+| N/A   29C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   5  Tesla V100-SXM3...  On   | 00000000:59:00.0 Off |                    0 |
+| N/A   35C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   6  Tesla V100-SXM3...  On   | 00000000:5C:00.0 Off |                    0 |
+| N/A   30C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   7  Tesla V100-SXM3...  On   | 00000000:5E:00.0 Off |                    0 |
+| N/A   35C    P0    53W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   8  Tesla V100-SXM3...  On   | 00000000:B7:00.0 Off |                    0 |
+| N/A   30C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|   9  Tesla V100-SXM3...  On   | 00000000:B9:00.0 Off |                    0 |
+| N/A   30C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  10  Tesla V100-SXM3...  On   | 00000000:BC:00.0 Off |                    0 |
+| N/A   35C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  11  Tesla V100-SXM3...  On   | 00000000:BE:00.0 Off |                    0 |
+| N/A   35C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  12  Tesla V100-SXM3...  On   | 00000000:E0:00.0 Off |                    0 |
+| N/A   31C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  13  Tesla V100-SXM3...  On   | 00000000:E2:00.0 Off |                    0 |
+| N/A   29C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  14  Tesla V100-SXM3...  On   | 00000000:E5:00.0 Off |                    0 |
+| N/A   34C    P0    51W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+|  15  Tesla V100-SXM3...  On   | 00000000:E7:00.0 Off |                    0 |
+| N/A   34C    P0    50W / 350W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+kru0052@cn202:~$ exit
+```
+
+!!! tip
+    Submit the interactive job using the `salloc` command.
+
+## Job Execution
+
+The DGX-2 machine runs only a bare-bone, minimal operating system. Users are expected to run
+**[Apptainer/Singularity][1]** containers in order to enrich the environment according to the needs.
+
+Containers (Docker images) optimized for DGX-2 may be downloaded from
+[NVIDIA Gpu Cloud][2]. Select the code of interest and
+copy the docker nvcr.io link from the Pull Command section. This link may be directly used
+to download the container via Apptainer/Singularity, see the example below:
+
+### Example - Apptainer/Singularity Run Tensorflow
+
+```console
+[kru0052@login2.barbora ~] $ salloc -A PROJECT-ID -p qdgx --time=02:00:00
+salloc: Granted job allocation 36633
+salloc: Waiting for resource configuration
+salloc: Nodes cn202 are ready for job
+
+kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
+Singularity tensorflow_19.02-py3.sif:~>
+Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+TF 1.13.0-rc0
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+TF 1.13.0-rc0
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+TF 1.13.0-rc0
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+TF 1.13.0-rc0
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+TF 1.13.0-rc0
+PY 3.5.2 (default, Nov 12 2018, 13:43:14)
+[GCC 5.4.0 20160609]
+...
+...
+...
+2019-03-11 08:30:12.263822: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
+     1   1.0   338.2  6.999  7.291 2.00000
+    10  10.0  3658.6  5.658  5.950 1.62000
+    20  20.0 25628.6  2.957  3.258 1.24469
+    30  30.0 30815.1  0.177  0.494 0.91877
+    40  40.0 30826.3  0.004  0.330 0.64222
+    50  50.0 30884.3  0.002  0.327 0.41506
+    60  60.0 30888.7  0.001  0.325 0.23728
+    70  70.0 30763.2  0.001  0.324 0.10889
+    80  80.0 30845.5  0.001  0.324 0.02988
+    90  90.0 26350.9  0.001  0.324 0.00025
+kru0052@cn202:~$ exit
+```
+
+**GPU stat**
+
+The GPU load can be determined by the `gpustat` utility.
+
+```console
+Every 2,0s: gpustat --color
+
+dgx  Mon Mar 11 09:31:00 2019
+[0] Tesla V100-SXM3-32GB | 47'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[1] Tesla V100-SXM3-32GB | 48'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[2] Tesla V100-SXM3-32GB | 56'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
+[3] Tesla V100-SXM3-32GB | 57'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
+[4] Tesla V100-SXM3-32GB | 46'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
+[5] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[6] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[7] Tesla V100-SXM3-32GB | 54'C,  97 % | 23660 / 32480 MB | kru0052(23645M)
+[8] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[9] Tesla V100-SXM3-32GB | 46'C,  95 % | 23660 / 32480 MB | kru0052(23645M)
+[10] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[11] Tesla V100-SXM3-32GB | 56'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[12] Tesla V100-SXM3-32GB | 47'C,  95 % | 23660 / 32480 MB | kru0052(23645M)
+[13] Tesla V100-SXM3-32GB | 45'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[14] Tesla V100-SXM3-32GB | 55'C,  96 % | 23660 / 32480 MB | kru0052(23645M)
+[15] Tesla V100-SXM3-32GB | 58'C,  95 % | 23660 / 32480 MB | kru0052(23645M)
+```
+
+[1]: https://docs.it4i.cz/software/tools/singularity/
+[2]: https://ngc.nvidia.com/
--- a/docs.it4i/dgx2/software.md
+++ b/docs.it4i/dgx2/software.md
+# Software Deployment
+
+Software deployment on DGX-2 is based on containers. NVIDIA provides a wide range of prepared Docker containers with a variety of different software. Users can easily download these containers and use them directly on the DGX-2.
+
+The catalog of all container images can be found on [NVIDIA site][a]. Supported software includes:
+
+* TensorFlow
+* MATLAB
+* GROMACS
+* Theano
+* Caffe2
+* LAMMPS
+* ParaView
+* ...
+
+## Running Containers on DGX-2
+
+NVIDIA expects usage of Docker as a containerization tool, but Docker is not a suitable solution in a multiuser environment. For this reason, the [Apptainer/Singularity container][b] solution is used.
+
+Singularity can be used similarly to Docker, just change the image URL address. For example, original command for Docker `docker run -it nvcr.io/nvidia/theano:18.08` should be changed to `singularity shell docker://nvcr.io/nvidia/theano:18.08`. More about Apptainer/Singularity [here][1].
+
+For fast container deployment, all images are cached after first use in the *lscratch* directory. This behavior can be changed by the *SINGULARITY_CACHEDIR* environment variable, but the start time of the container will increase significantly.
+
+```console
+$ ml av Singularity
+
+---------------------------- /apps/modules/tools ----------------------------
+   Singularity/3.3.0
+```
+
+## MPI Modules
+
+```console
+$ ml av MPI
+
+---------------------------- /apps/modules/mpi ----------------------------
+   OpenMPI/2.1.5-GCC-6.3.0-2.27    OpenMPI/3.1.4-GCC-6.3.0-2.27    OpenMPI/4.0.0-GCC-6.3.0-2.27 (D)    impi/2017.4.239-iccifort-2017.7.259-GCC-6.3.0-2.27
+```
+
+## Compiler Modules
+
+```console
+$ ml av gcc
+
+---------------------------- /apps/modules/compiler ----------------------------
+   GCC/6.3.0-2.27    GCCcore/6.3.0    icc/2017.7.259-GCC-6.3.0-2.27    ifort/2017.7.259-GCC-6.3.0-2.27
+
+```
+
+[1]: ../software/tools/singularity.md
+[a]: https://ngc.nvidia.com/catalog/landing
+[b]: https://www.sylabs.io/
--- a/docs.it4i/dice.md
+++ b/docs.it4i/dice.md
+# What Is DICE Project?
+
+DICE (Data Infrastructure Capacity for EOSC) is an international project funded by the European Union
+that provides cutting-edge data management services and a significant amount of storage resources for the EOSC.
+The EOSC (European Open Science Cloud) project provides European researchers, innovators, companies,
+and citizens with a federated and open multi-disciplinary environment
+where they can publish, find, and re-use data, tools, and services for research, innovation and educational purposes.
+
+For more information, see the official [DICE project][b] and [EOSC project][q] pages.
+
+**IT4Innovations participates in DICE. DICE uses the iRODS software**
+
+The integrated Rule-Oriented Data System (iRODS) is an open source data management software
+used by research organizations and government agencies worldwide.
+iRODS is released as a production-level distribution aimed at deployment in mission critical environments.
+It virtualizes data storage resources, so users can take control of their data,
+regardless of where and on what device the data is stored.
+As data volumes grow and data services become more complex,
+iRODS is serving an increasingly important role in data management.
+For more information, see [the official iRODS page][c].
+
+## How to Put Your Data to Our Server
+
+**Prerequisities:**
+
+First, we need to verify your identity, this is done through the following steps:
+
+1. Sign in with your organization [B2ACCESS][d]; the page requests a valid personal certificate (e.g. GEANT).
+  Accounts with "Low" level of assurance are not granted access to IT4I zone.
+
+1. Confirm your certificate in the browser:
+
+    ![](img/B2ACCESS_chrome_eng.jpg)
+
+1. Confirm your certificate in the OS (Windows):
+
+    ![](img/crypto_v2.jpg)
+
+1. Sign to EUDAT/B2ACCESS:
+
+    ![](img/eudat_v2.jpg)
+
+1. After successful login to B2Access:
+
+    1. **For Non IT4I Users**
+
+         Sign in to our [AAI][f] through your B2Access account.
+         You have to set a new password for iRODS access.
+
+    1. **For IT4I Users**
+
+         Sign in to our [AAI][f] through your B2Access account and link your B2ACCESS identity with your existing account.
+         The iRODS password will be the same as your IT4I LDAP password (i.e. code.it4i.cz password).
+
+    ![](img/aai.jpg)
+    ![](img/aai2.jpg)
+    ![](img/aai3-passwd.jpg)
+    ![](img/irods_linking_link.jpg)
+
+1. Contact [support@it4i.cz][a], so we can create your account at our iRODS server.
+
+1. **Fill this request on [EOSC-MARKETPLACE][h] (recommended)** or at [EUDAT][l], please specify the requested capacity.
+
+   ![](img/eosc-marketplace-active.jpg)
+   ![](img/eosc-providers.jpg)
+   ![](img/eudat_request.jpg)
+
+## Access to iRODS Collection From Karolina
+
+Access to iRODS Collection requires access to the Karolina cluster (i.e. [IT4I account][4]),
+since iRODS clients are provided as a module on Karolina (Barbora is in progress).
+The `irodsfs` module loads config file for irodsfs and icommands, too.
+
+Note that you can change password to iRODS at [aai.it4i.cz][m].
+
+### Mounting Your Collection
+
+```console
+ssh some_user@karolina.it4i.cz
+ml irodsfs
+```
+
+Now you can choose between the Fuse client or iCommands:
+
+#### Fuse
+
+```console
+ssh some_user@karolina.it4i.cz
+[some_use@login4.karolina ~]$ ml irodsfs
+
+irodsfs configuration file has been created at /home/dvo0012/.irods/config.yml
+iRODS environment file has been created at /home/dvo0012/.irods/irods_environment.json
+
+to start irodsfs, run:          irodsfs -config ~/.irods/config.yml ~/IRODS
+to start iCommands, run:        iinit
+
+For more information, see https://docs.it4i.cz/dice/
+```
+
+To mount your iRODS collection to ~/IRODS, run
+
+```console
+[some_user@login4.karolina ~]$ irodsfs -config ~/.irods/config.yml ~/IRODS
+time="2022-08-04 08:54:13.222836" level=info msg="Logging to /tmp/irodsfs_cblmq5ab1lsaj31vrv20.log" function=processArguments package=main
+Password:
+time="2022-08-04 08:54:18.698811" level=info msg="Found FUSE Device. Starting iRODS FUSE Lite." function=parentMain package=main
+time="2022-08-04 08:54:18.699080" level=info msg="Running the process in the background mode" function=parentRun package=main
+time="2022-08-04 08:54:18.699544" level=info msg="Process id = 27145" function=parentRun package=main
+time="2022-08-04 08:54:18.699572" level=info msg="Sending configuration data" function=parentRun package=main
+time="2022-08-04 08:54:18.699730" level=info msg="Successfully sent configuration data to background process" function=parentRun package=main
+time="2022-08-04 08:54:18.922490" level=info msg="Successfully started background process" function=parentRun package=main
+```
+
+To unmount it, run
+
+```console
+fusermount -u ~/IRODS
+```
+
+You can work with Fuse as an ordinary directory  (`ls`, `cd`, `cp`, `mv`, etc.).
+
+#### iCommands
+
+```console
+ssh some_user@karolina.it4i.cz
+[some_use@login4.karolina ~]$ ml irodsfs
+irodsfs configuration file has been created at /home/dvo0012/.irods/config.yml.
+      to start irods fs run: irodsfs -config ~/.irods/config.yml ~/IRODS
+
+iCommands environment file has been created at /home/$USER/.irods/irods_environment.json.
+      to start iCommands run: iinit
+
+[some_user@login4.karolina ~]$ iinit
+Enter your current PAM password:
+```
+
+```console
+[some_use@login4.karolina ~]$ ils
+/IT4I/home/some_user:
+  test.1
+  test.2
+  test.3
+  test.4
+```
+
+Use the command `iput` for upload, `iget` for download, or `ihelp` for help.
+
+## Access to iRODS Collection From Other Resource
+
+!!! note
+    This guide assumes you are uploading your data from your local PC/VM.
+
+Use the password from [AAI][f].
+
+### You Need a Client to Connect to iRODS Server
+
+There are many iRODS clients, but we recommend the following:
+
+- Cyberduck - Windows/Mac, GUI
+- Fuse (irodsfs lite) - Linux, CLI
+- iCommands - Linux, CLI.
+
+For access, set PAM passwords at [AAI][f].
+
+### Cyberduck
+
+1. Download [Cyberduck][i].
+2. Download [connection profile][1] for IT4I iRods server.
+3. Left double-click this file to open connection.
+
+![](img/irods-cyberduck.jpg)
+
+### Fuse
+
+!!!note "Linux client only"
+    This is a Linux client only, basic knowledge of the command line is necessary.
+
+Fuse allows you to work with your iRODS collection like an ordinary directory.
+
+```console
+cd ~
+wget https://github.com/cyverse/irodsfs/releases/download/v0.7.6/irodsfs_amd64_linux_v0.7.6.tar
+tar -xvf ~/irodsfs_amd64_linux_v0.7.6.tar
+mkdir ~/IRODS ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/config.yml
+wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods/
+```
+
+Edit `~/.irods/config.yml` with username from [AAI][f].
+
+#### Mounting Your Collection
+
+```console
+[some_user@local_pc ~]$ ./irodsfs -config ~/.irods/config.yml ~/IRODS
+time="2022-07-29 09:51:11.720831" level=info msg="Logging to /tmp/irodsfs_cbhp2rucso0ef0s7dtl0.log" function=processArguments package=main
+Password:
+
+time="2022-07-29 09:51:17.691988" level=info msg="Found FUSE Device. Starting iRODS FUSE Lite." function=parentMain package=main
+time="2022-07-29 09:51:17.692683" level=info msg="Running the process in the background mode" function=parentRun package=main
+time="2022-07-29 09:51:17.693381" level=info msg="Process id = 74772" function=parentRun package=main
+time="2022-07-29 09:51:17.693421" level=info msg="Sending configuration data" function=parentRun package=main
+time="2022-07-29 09:51:17.693772" level=info msg="Successfully sent configuration data to background process" function=parentRun package=main
+time="2022-07-29 09:51:18.008166" level=info msg="Successfully started background process" function=parentRun package=main
+```
+
+#### Putting Your Data to iRODS
+
+```console
+[some_use@local_pc ~]$ cp test1G.txt ~/IRODS
+```
+
+It works as ordinary file system
+
+```console
+[some_user@local_pc ~]$ ls -la ~/IRODS
+total 0
+-rwx------ 1 some_user some_user 1073741824 Nov  4  2021 test1G.txt
+```
+
+#### Unmounting Your Collection
+
+To stop/unmount your collection, use:
+
+```console
+[some_user@local_pc ~]$ fusermount -u ~/IRODS
+```
+
+### iCommands
+
+!!!note "Linux client only"
+    This is a Linux client only, basic knowledge of the command line is necessary.
+
+We recommend Centos7, Ubuntu 20 is optional.
+
+#### Steps for Ubuntu 20
+
+```console
+LSB_RELEASE="bionic"
+wget -qO - https://packages.irods.org/irods-signing-key.asc | sudo apt-key add -
+echo "deb [arch=amd64] https://packages.irods.org/apt/ ${LSB_RELEASE}  main" \
+>   | sudo tee /etc/apt/sources.list.d/renci-irods.list
+deb [arch=amd64] https://packages.irods.org/apt/ bionic  main
+
+sudo apt-get update
+apt-cache search irods
+wget -c \
+  http://security.ubuntu.com/ubuntu/pool/main/p/python-urllib3/python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
+  http://security.ubuntu.com/ubuntu/pool/main/r/requests/python-requests_2.18.4-2ubuntu0.1_all.deb \
+  http://security.ubuntu.com/ubuntu/pool/main/o/openssl1.0/libssl1.0.0_1.0.2n-1ubuntu5.10_amd64.deb
+sudo apt install \
+  ./python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
+  ./python-requests_2.18.4-2ubuntu0.1_all.deb \
+  ./libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb
+sudo rm -rf \
+  ./python-urllib3_1.22-1ubuntu0.18.04.2_all.deb \
+  ./python-requests_2.18.4-2ubuntu0.1_all.deb \
+  ./libssl1.0.0_1.0.2n-1ubuntu5.6_amd64.deb
+sudo apt install -y irods-icommands
+mkdir ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/irods_environment.json
+wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods
+sed -i 's,~,'"$HOME"',g' ~/.irods/irods_environment.json
+```
+
+#### Steps for Centos
+
+```console
+sudo rpm --import https://packages.irods.org/irods-signing-key.asc
+sudo wget -qO - https://packages.irods.org/renci-irods.yum.repo | sudo tee /etc/yum.repos.d/renci-irods.yum.repo
+sudo yum install epel-release -y
+sudo yum install python-psutil python-jsonschema
+sudo yum install irods-icommands
+mkdir ~/.irods/ && cd "$_" && wget https://docs.it4i.cz/irods_environment.json
+wget https://pki.cesnet.cz/_media/certs/chain_geant_ov_rsa_ca_4_full.pem -P ~/.irods
+sed -i 's,~,'"$HOME"',g' ~/.irods/irods_environment.json
+```
+
+Edit ***irods_user_name*** in `~/.irods/irods_environment.json` with the username from [AAI][f].
+
+```console
+[some_user@local_pc ~]$ pwd
+/some_user/.irods
+
+[some_user@local_pc ~]$  ls -la
+total 16
+drwx------. 2 some_user some_user 136 Sep 29 08:53 .
+dr-xr-x---. 6 some_user some_user 206 Sep 29 08:53 ..
+-rw-r--r--. 1 some_user some_user 253 Sep 29 08:14 irods_environment.json
+```
+
+**How to Start:**
+
+**step 1:**
+
+```console
+[some_user@local_pc ~]$ iinit
+Enter your current PAM password:
+
+[some_user@local_pc ~]$ ils
+/IT4I/home/some_user:
+  file.jpg
+```
+
+**How to put your data to iRODS**
+
+```console
+[some_user@local_pc ~]$ iput cesnet.crt
+```
+
+```console
+[some_user@local_pc ~]$ ils
+/IT4I/home/some_user:
+  cesnet.crt
+```
+
+**How to download data**
+
+```console
+[some_user@local_pc ~]$ iget cesnet.crt
+ls -la ~
+-rw-r--r--. 1 some_user some_user 1464 Jul 20 13:44 cesnet.crt
+```
+
+For more commands, use the `ihelp` command.
+
+## PID Services
+
+You, as user, may want to index your datasets and allocate some PIDs - Persistent Identifiers for them. We host pid system by hdl-surfsara ([https://it4i-handle.it4i.cz][o]), wich is conected to [https://hdl.handle.net][p], and you are able to create your own PID by calling some of irule.
+
+### How to Create PID
+
+Pids are created by calling `irule`, you have to create at your `$HOME` or everewhere you want,
+but you have to specify the path correctly.
+Rules for pid operations have always `.r suffix`.
+It can by done only through `iCommands`.
+
+Example of a rule for PID creating only:
+
+```console
+user in ~ λ pwd
+/home/user
+
+user in ~ λ ils
+/IT4I/home/user:
+  C- /IT4I/home/dvo0012/Collection_A
+
+user in ~ λ ls -l | grep pid
+-rw-r--r--  1 user user      249 Sep 30 10:55 create_pid.r
+
+user in ~ λ cat create_pid.r
+PID_DO_reg {
+      EUDATCreatePID(*parent_pid, *source, *ror, *fio, *fixed, *newPID);
+      writeLine("stdout","PID: *newPID");
+}
+INPUT *source="/IT4I/home/user/Collection_A",*parent_pid="None",*ror="None",*fio="None",*fixed="true"
+OUTPUT ruleExecOut
+
+user in ~ λ irule -F create_pid.r
+PID: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
+```
+
+After creation, your PID is searchable worldwide:
+
+![](img/hdl_net.jpg)
+![](img/hdl_pid.jpg)
+
+**More info at [www.eudat.eu][n]**
+
+### Metadata
+
+For adding metadata to you collection/dataset, you can use imeta from iCommands.
+
+This is after PID creation:
+
+```console
+user in ~ λ imeta ls -C /IT4I/home/user/Collection_A
+AVUs defined for collection /IT4I/home/user/Collection_A:
+attribute: EUDAT/FIXED_CONTENT
+value: True
+units:
+----
+attribute: PID
+value: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
+units:
+```
+
+For adding any other metadata you can use:
+
+```console
+user in ~ λ imeta add -C /IT4I/home/user/Collection_A EUDAT_B2SHARE_TITLE Some_Title
+
+user in ~ λ imeta ls -C /IT4I/home/user/Collection_A
+AVUs defined for collection /IT4I/home/user/Collection_A:
+attribute: EUDAT/FIXED_CONTENT
+value: True
+units:
+----
+attribute: PID
+value: 21.12149/f3b9b1a5-7b4d-4fff-bfb7-826676f6fe14
+units:
+----
+attribute: EUDAT_B2SHARE_TITLE
+value: Some_Title
+units:
+```
+
+[1]: irods.cyberduckprofile
+[2]: irods_environment.json
+[3]: config.yml
+[4]: general/access/account-introduction.md
+
+[a]: mailto:support@it4i.cz
+[b]: https://www.dice-eosc.eu/
+[c]: https://irods.org/
+[d]: https://b2access.eudat.eu/
+[f]: https://aai.it4i.cz/realms/IT4i_IRODS/account/#/
+[h]: https://marketplace.eosc-portal.eu/services/b2safe/offers
+[i]: https://cyberduck.io/download/
+[l]: https://www.eudat.eu/contact-support-request?Service=B2SAFE
+[m]: https://aai.it4i.cz/
+[n]: https://www.eudat.eu/catalogue/b2handle
+[o]: https://it4i-handle.it4i.cz
+[p]: https://hdl.handle.net
+[q]: https://eosc-portal.eu/
--- a/docs.it4i/einfracz-migration.md
+++ b/docs.it4i/einfracz-migration.md
+# Migration to e-INFRA CZ
+
+## Introduction
+
+IT4Innovations is a part of [e-INFRA CZ][1] - strategic research infrastructure of the Czech Republic, which provides capacities and resources for the transmission, storage, and processing of scientific and research data. In January 2022, IT4I has begun the process of integration of its services.
+
+As a part of the process, a joint e-INFRA CZ user base has been established. This included a migration of eligible IT4I accounts.
+
+## Who Has Been Affected
+
+The migration affects all accounts of users affiliated with an academic organizations in the Czech Republic who also have an OPEN-XX-XX project. Affected users have received an email with information about changes in personal data processing.
+
+## Who Has Not Been Affected
+
+Commercial users, training accounts, suppliers, and service accounts were **not** affected by the migration.
+
+## Process
+
+During the process, additional steps have been required for successful migration.
+
+This may have included:
+
+1. e-INFRA CZ registration, if one does not already exist.
+2. e-INFRA CZ password reset, if one does not already exist.
+
+## Steps After Migration
+
+After the migration, you must use your **e-INFRA CZ credentials** to access all IT4I services as well as [e-INFRA CZ services][5].
+
+Successfully migrated accounts tied to e-INFRA CZ can be self-managed at [e-INFRA CZ User profile][4].
+
+!!! tip "Recommendation"
+    We recommend [verifying your SSH keys][6] for cluster access.
+
+## Troubleshooting
+
+If you have a problem with your account migrated to e-INFRA CZ user base, contact the [CESNET support][7].
+
+If you have questions or a problem with IT4I account (i.e. account not eligible for migration), contact the [IT4I support][2].
+
+[1]: https://www.e-infra.cz/en
+[2]: mailto:support@it4i.cz
+[3]: https://www.cesnet.cz/?lang=en
+[4]: https://profile.e-infra.cz/
+[5]: https://www.e-infra.cz/en/services
+[6]: https://profile.e-infra.cz/profile/settings/sshKeys
+[7]: mailto:support@cesnet.cz
--- a/docs.it4i/environment-and-modules.md
+++ b/docs.it4i/environment-and-modules.md
+# Environment and Modules
+
+## Shells on Clusters
+
+The table shows which shells are available on the IT4Innovations clusters.
+
+Note that bash is the only supported shell.
+
+| Cluster Name    | bash | tcsh | zsh | ksh | dash |
+| --------------- | ---- | ---- | --- | --- | ---- |
+| Karolina        | yes  | yes  | yes | yes | yes  |
+| Barbora         | yes  | yes  | yes | yes | no   |
+| DGX-2           | yes  | no   | no  | no  | no   |
+
+!!! info
+    Bash is the default shell. Should you need a different shell, contact [support\[at\]it4i.cz][3].
+
+## Environment Customization
+
+After logging in, you may want to configure the environment. Write your preferred path definitions, aliases, functions, and module loads in the .bashrc file
+
+```console
+# ./bashrc
+
+# users compilation path
+export MODULEPATH=${MODULEPATH}:/home/$USER/.local/easybuild/modules/all
+
+# User specific aliases and functions
+alias sq='squeue --me'
+
+# load default intel compilator !!! is not recommended !!!
+ml intel
+
+# Display information to standard output - only in interactive ssh session
+if [ -n "$SSH_TTY" ]
+then
+ ml # Display loaded modules
+fi
+```
+
+!!! note
+    Do not run commands outputting to standard output (echo, module list, etc.) in .bashrc for non-interactive SSH sessions. It breaks the fundamental functionality (SCP) of your account. Take care for SSH session interactivity for such commands as stated in the previous example.
+
+### Application Modules
+
+In order to configure your shell for running a particular application on clusters, we use a module package interface.
+
+Application modules on clusters are built using [EasyBuild][1]. The modules are divided into the following groups:
+
+```
+ base: Default module class
+ bio: Bioinformatics, biology and biomedical
+ cae: Computer Aided Engineering (incl. CFD)
+ chem: Chemistry, Computational Chemistry and Quantum Chemistry
+ compiler: Compilers
+ data: Data management & processing tools
+ debugger: Debuggers
+ devel: Development tools
+ geo: Earth Sciences
+ ide: Integrated Development Environments (e.g. editors)
+ lang: Languages and programming aids
+ lib: General purpose libraries
+ math: High-level mathematical software
+ mpi: MPI stacks
+ numlib: Numerical Libraries
+ perf: Performance tools
+ phys: Physics and physical systems simulations
+ system: System utilities (e.g. highly depending on system OS and hardware)
+ toolchain: EasyBuild toolchains
+ tools: General purpose tools
+ vis: Visualization, plotting, documentation and typesetting
+ OS: singularity image
+ python: python packages
+```
+
+!!! note
+    The modules set up the application paths, library paths and environment variables for running a particular application.
+
+The modules may be loaded, unloaded, and switched according to momentary needs. For details, see [lmod][2].
+
+[1]: software/tools/easybuild.md
+[2]: software/modules/lmod.md
+[3]: mailto:support@it4i.cz
--- a/docs.it4i/general/AUP-final.pdf
+++ b/docs.it4i/general/AUP-final.pdf
--- a/docs.it4i/general/Energy_saving_Karolina.pdf
+++ b/docs.it4i/general/Energy_saving_Karolina.pdf
--- a/docs.it4i/general/access/.gitkeep
+++ b/docs.it4i/general/access/.gitkeep
+
--- a/docs.it4i/general/access/account-introduction.md
+++ b/docs.it4i/general/access/account-introduction.md
+# Introduction
+
+This section provides basic information on how to gain access to IT4Innovations Information systems and project membership.
+
+## Account Types
+
+There are two types of accounts at IT4Innovations:
+
+* [**e-INFRA CZ Account**][1]
+    intended for all persons affiliated with an academic institution from the Czech Republic ([eduID.cz][a]).
+
+* [**IT4I Account**][2]
+    intended for all persons who are not eligible for an e-INFRA CZ account.
+
+Once you create an account, you can use it only for communication with IT4I support and accessing the SCS information system.
+If you want to access IT4I clusters, your account must also be **assigned to a project**.
+
+For more information, see the section:
+
+* [**Get Project Membership**][3]
+    if you want to become a collaborator on a project, or
+
+* [**Get Project**][4]
+    if you want to become a project owner.
+
+[1]: ./einfracz-account.md
+[2]: ../obtaining-login-credentials/obtaining-login-credentials.md
+[3]: ../access/project-access.md
+[4]: ../applying-for-resources.md
+
+[a]: https://www.eduid.cz/
--- a/docs.it4i/general/access/einfracz-account.md
+++ b/docs.it4i/general/access/einfracz-account.md
+# e-INFRA CZ Account
+
+[e-INFRA CZ][1] is a unique research and development e-infrastructure in the Czech Republic,
+which provides capacities and resources for the transmission, storage and processing of scientific and research data.
+IT4Innovations has become a member of e-INFRA CZ on January 2022.
+
+!!! important
+    Only persons affiliated with an academic institution from the Czech Republic ([eduID.cz][6]) are eligible for an e-INFRA CZ account.
+
+## Request e-INFRA CZ Account
+
+1. Request an account:
+    1. Go to [https://signup.e-infra.cz/fed/registrar/?vo=IT4Innovations][2]
+    1. Select a member academic institution you are affiliated with.
+    1. Fill out the e-INFRA CZ Account information (username, password and ssh key(s)).
+
+    Your account should be created in a few minutes after submitting the request.
+    Once your e-INFRA CZ account is created, it is propagated into IT4I systems
+    and can be used to access [SCS portal][3] and [Request Tracker][4].
+
+1. Provide additional information via [IT4I support][a] or email [support\[at\]it4i.cz][b] (**required**, note that without this information, you cannot use IT4I resources):
+    1. **Full name**
+    1. **Gender**
+    1. **Citizenship**
+    1. **Country of residence**
+    1. **Organization/affiliation**
+    1. **Organization/affiliation country**
+    1. **Organization/affiliation type** (university, company, R&D institution, private/public sector (hospital, police), academy of sciences, etc.)
+    1. **Job title**  (student, PhD student, researcher, research assistant, employee, etc.)
+
+Continue to apply for a project or project membership to access clusters through the [SCS portal][3].
+
+## Logging Into IT4I Services
+
+The table below shows how different IT4I services are accessed:
+
+| Services | Access  |
+| -------- | ------- |
+| Clusters | SSH key |
+| IS, RT, web, VPN | e-INFRA CZ login |
+| Profile<br>Change&nbsp;password<br>Change&nbsp;SSH&nbsp;key | Academic institution's credentials<br>e-INFRA CZ / eduID |
+
+You can change you profile settings at any time.
+
+[1]: https://www.e-infra.cz/en
+[2]: https://signup.e-infra.cz/fed/registrar/?vo=IT4Innovations
+[3]: https://scs.it4i.cz/
+[4]: https://support.it4i.cz/
+[5]: ../../management/einfracz-profile.md
+[6]: https://www.eduid.cz/
+
+[a]: https://support.it4i.cz/rt/
+[b]: mailto:support@it4i.cz
--- a/docs.it4i/general/access/project-access.md
+++ b/docs.it4i/general/access/project-access.md
+# Get Project Membership
+
+!!! note
+    You need to be named as a collaborator by a Primary Investigator (PI) in order to access and use the clusters.
+
+## Authorization by Web
+
+This is a preferred method if you have an IT4I or e-INFRA CZ account.
+
+Log in to the [IT4I SCS portal][a] and go to the **Authorization Requests** section. Here you can submit your requests for becoming a project member. You will have to wait until the project PI authorizes your request.
+
+## Authorization by Email
+
+An alternative way to become a project member is on request sent via [email by the project PI][1].
+
+[1]: ../../applying-for-resources/#authorization-by-email-an-alternative-approach
+
+[a]: https://scs.it4i.cz/
No results found