Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • chat
  • kru0052-master-patch-91081
  • lifecycles
  • master
  • 20180621-before_revision
  • 20180621-revision
6 results

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Select Git revision
  • MPDATABenchmark
  • Urx
  • anselm2
  • hot_fix
  • john_branch
  • master
  • mkdocs_update
  • patch-1
  • pbs
  • salomon_upgrade
  • tabs
  • virtual_environment2
  • 20180621-before_revision
  • 20180621-revision
14 results
Show changes
Showing
with 1205 additions and 0 deletions
# Heterogeneous Memory Management on Intel Platforms
Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
## Overview
At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
```bash
ml memkind
```
### Process Level (NUMACTL)
The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
```bash
numactl --membind <node_ids_set>
```
or select single preffered node
```bash
numactl --preffered <node_id>
```
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check `numactl` configuration is
```bash
numactl -s
```
which prints configuration in its execution environment eg.
```bash
numactl --membind 8-15 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
```
The last row shows allocations memory are restricted to NUMA nodes `8-15`.
### Allocation Level (MEMKIND)
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
```cpp
void *pData = malloc(<SIZE>);
/* ... */
free(pData);
```
with
```cpp
#include <memkind.h>
void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address
```
Similarly other memory types can be chosen.
!!! note
The allocation will return `NULL` pointer when memory of specified kind is not available.
## High Bandwidth Memory (HBM)
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
```bash
numactl -H
```
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
![](../../img/cs/guides/p10_numa_sc4_flat.png)
### Process Level
With this we can easily restrict application to DDR DRAM or HBM memory:
```bash
# Only DDR DRAM
numactl --membind 0-7 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 369745.8 0.043355 0.043273 0.043588
Scale: 366989.8 0.043869 0.043598 0.045355
Add: 378054.0 0.063652 0.063483 0.063899
Triad: 377852.5 0.063621 0.063517 0.063884
# Only HBM
numactl --membind 8-15 ./stream
# ...
Function Best Rate MB/s Avg time Min time Max time
Copy: 1128430.1 0.015214 0.014179 0.015615
Scale: 1045065.2 0.015814 0.015310 0.016309
Add: 1096992.2 0.022619 0.021878 0.024182
Triad: 1065152.4 0.023449 0.022532 0.024559
```
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
```bash
#!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
```
and can be used as
```bash
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
```
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
```bash
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
```
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
![](../../img/cs/guides/p10_stream_dram.png)
![](../../img/cs/guides/p10_stream_hbm.png)
### Allocation Level
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
```cpp
#include <memkind.h>
// ...
STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
// ...
memkind_free(NULL, a);
memkind_free(NULL, b);
memkind_free(NULL, c);
```
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library
```bash
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
```
and can be run as
```bash
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
```
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
![](../../img/cs/guides/p10_stream_memkind.png)
With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
## Simple Application
One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
```cpp
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
#include <cstring>
#include <omp.h>
#include <memkind.h>
const size_t N_DATA_SIZE = 2 * 1024 * 1024 * 1024ull;
const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
const size_t N_ITERS = 10;
#if defined(HBM)
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_HBW_ALL
#else
#define DATA_MEMKIND MEMKIND_REGULAR
#define BINS_MEMKIND MEMKIND_REGULAR
#endif
int main(int argc, char *argv[])
{
const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
#pragma omp parallel
{
drand48_data state;
srand48_r(omp_get_thread_num(), &state);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
drand48_r(&state, &pData[i]);
}
auto c1 = std::chrono::steady_clock::now();
for(size_t it = 0; it < N_ITERS; ++it)
{
#pragma omp parallel
{
for(size_t i = 0; i < N_BINS_COUNT; ++i)
pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
#pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i)
{
const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
}
}
}
auto c2 = std::chrono::steady_clock::now();
#pragma omp parallel for
for(size_t i = 0; i < N_BINS_COUNT; ++i)
{
for(size_t j = 1; j < omp_get_max_threads(); ++j)
pBins[i] += pBins[j*N_BINS_COUNT + i];
}
std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
size_t total = 0;
#pragma omp parallel for reduction(+:total)
for(size_t i = 0; i < N_BINS_COUNT; ++i)
total += pBins[i];
std::cout << "Total Items: " << total << std::endl;
memkind_free(NULL, pData);
memkind_free(NULL, pBins);
return 0;
}
```
### Using HBM Memory (P10-Intel)
Following commands can be used to compile and run example application above
```bash
ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
```
Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
## Additional Resources
- [https://linux.die.net/man/8/numactl][1]
- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
[1]: https://linux.die.net/man/8/numactl
[2]: http://memkind.github.io/memkind/man_pages/memkind.html
[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
\ No newline at end of file
# Using VMware Horizon
VMware Horizon is a virtual desktop infrastructure (VDI) solution
that enables users to access virtual desktops and applications from any device and any location.
It provides a comprehensive end-to-end solution for managing and delivering virtual desktops and applications,
including features such as session management, user authentication, and virtual desktop provisioning.
![](../../img/horizon.png)
## How to Access VMware Horizon
!!! important
Access to VMware Horizon requires IT4I VPN.
1. Contact [IT4I support][a] with a request for an access and VM allocation.
1. [Download][1] and install the VMware Horizon Client for Windows.
1. Add a new server `https://vdi-cs01.msad.it4i.cz/` in the Horizon client.
1. Connect to the server using your IT4I username and password.
Username is in the `domain\username` format and the domain is `msad.it4i.cz`.
For example: `msad.it4i.cz\user123`
## Example
Below is an example of how to mount a remote folder and check the conection on Windows OS:
### Prerequsities
3D applications
* [Blender][3]
SSHFS for remote access
* [sshfs-win][4]
* [winfsp][5]
* [shfs-win-manager][6]
* ssh keys for access to clusters
### Steps
1. Start the VPN and connect to the server via VMware Horizon Client.
![](../../img/vmware.png)
1. Mount a remote folder.
* Run sshfs-win-manager.
![](../../img/sshfs.png)
* Add a new connection.
![](../../img/sshfs1.png)
* Click on **Connect**.
![](../../img/sshfs2.png)
1. Check that the folder is mounted.
![](../../img/mount.png)
1. Check the GPU resources.
![](../../img/gpu.png)
### Blender
Now if you run, for example, Blender, you can check the available GPU resources in Blender Preferences.
![](../../img/blender.png)
[a]: mailto:support@it4i.cz
[1]: https://vdi-cs01.msad.it4i.cz/
[2]: https://www.paraview.org/download/
[3]: https://www.blender.org/download/
[4]: https://github.com/winfsp/sshfs-win/releases
[5]: https://github.com/winfsp/winfsp/releases/
[6]: https://github.com/evsar3/sshfs-win-manager/releases
# Using IBM Power Partition
For testing your application on the IBM Power partition,
you need to prepare a job script for that partition or use the interactive job:
```console
scalloc -N 1 -c 192 -A PROJECT-ID -p p07-power --time=08:00:00
```
where:
- `-N 1` means allocation single node,
- `-c 192` means allocation 192 cores (threads),
- `-p p07-power` is IBM Power partition,
- `--time=08:00:00` means allocation for 8 hours.
On the partition, you should reload the list of modules:
```
ml architecture/ppc64le
```
The platform offers both `GNU` based and proprietary IBM toolchains for building applications. IBM also provides optimized BLAS routines library ([ESSL](https://www.ibm.com/docs/en/essl/6.1)), which can be used by both toolchain.
## Building Applications
Our sample application depends on `BLAS`, therefore we start by loading following modules (regardless of which toolchain we want to use):
```
ml GCC OpenBLAS
```
### GCC Toolchain
In the case of GCC toolchain we can go ahead and compile the application as usual using either `g++`
```
g++ -lopenblas hello.cpp -o hello
```
or `gfortran`
```
gfortran -lopenblas hello.f90 -o hello
```
as usual.
### IBM Toolchain
The IBM toolchain requires additional environment setup as it is installed in `/opt/ibm` and is not exposed as a module
```
IBM_ROOT=/opt/ibm
OPENXLC_ROOT=$IBM_ROOT/openxlC/17.1.1
OPENXLF_ROOT=$IBM_ROOT/openxlf/17.1.1
export PATH=$OPENXLC_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLC_ROOT/lib:$LD_LIBRARY_PATH
export PATH=$OPENXLF_ROOT/bin:$PATH
export LD_LIBRARY_PATH=$OPENXLF_ROOT/lib:$LD_LIBRARY_PATH
```
from there we can use either `ibm-clang++`
```
ibm-clang++ -lopenblas hello.cpp -o hello
```
or `xlf`
```
xlf -lopenblas hello.f90 -o hello
```
to build the application as usual.
!!! note
Combination of `xlf` and `openblas` seems to cause severe performance degradation. Therefore `ESSL` library should be preferred (see below).
### Using ESSL Library
The [ESSL](https://www.ibm.com/docs/en/essl/6.1) library is installed in `/opt/ibm/math/essl/7.1` so we define additional environment variables
```
IBM_ROOT=/opt/ibm
ESSL_ROOT=${IBM_ROOT}math/essl/7.1
export LD_LIBRARY_PATH=$ESSL_ROOT/lib64:$LD_LIBRARY_PATH
```
The simplest way to utilize `ESSL` in application, which already uses `BLAS` or `CBLAS` routines is to link with the provided `libessl.so`. This can be done by replacing `-lopenblas` with `-lessl` or `-lessl -lopenblas` (in case `ESSL` does not provide all required `BLAS` routines).
In practice this can look like
```
g++ -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.cpp -o hello
```
or
```
gfortran -L${ESSL_ROOT}/lib64 -lessl -lopenblas hello.f90 -o hello
```
and similarly for IBM compilers (`ibm-clang++` and `xlf`).
## Hello World Applications
The `hello world` example application (written in `C++` and `Fortran`) uses simple stationary probability vector estimation to illustrate use of GEMM (BLAS 3 routine).
Stationary probability vector estimation in `C++`:
```c++
#include <iostream>
#include <vector>
#include <chrono>
#include "cblas.h"
const size_t ITERATIONS = 32;
const size_t MATRIX_SIZE = 1024;
int main(int argc, char *argv[])
{
const size_t matrixElements = MATRIX_SIZE*MATRIX_SIZE;
std::vector<float> a(matrixElements, 1.0f / float(MATRIX_SIZE));
for(size_t i = 0; i < MATRIX_SIZE; ++i)
a[i] = 0.5f / (float(MATRIX_SIZE) - 1.0f);
a[0] = 0.5f;
std::vector<float> w1(matrixElements, 0.0f);
std::vector<float> w2(matrixElements, 0.0f);
std::copy(a.begin(), a.end(), w1.begin());
std::vector<float> *t1, *t2;
t1 = &w1;
t2 = &w2;
auto c1 = std::chrono::steady_clock::now();
for(size_t i = 0; i < ITERATIONS; ++i)
{
std::fill(t2->begin(), t2->end(), 0.0f);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, MATRIX_SIZE, MATRIX_SIZE, MATRIX_SIZE,
1.0f, t1->data(), MATRIX_SIZE,
a.data(), MATRIX_SIZE,
1.0f, t2->data(), MATRIX_SIZE);
std::swap(t1, t2);
}
auto c2 = std::chrono::steady_clock::now();
for(size_t i = 0; i < MATRIX_SIZE; ++i)
{
std::cout << (*t1)[i*MATRIX_SIZE + i] << " ";
}
std::cout << std::endl;
std::cout << "Elapsed Time: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
return 0;
}
```
Stationary probability vector estimation in `Fortran`:
```fortran
program main
implicit none
integer :: matrix_size, iterations
integer :: i
real, allocatable, target :: a(:,:), w1(:,:), w2(:,:)
real, dimension(:,:), contiguous, pointer :: t1, t2, tmp
real, pointer :: out_data(:), out_diag(:)
integer :: cr, cm, c1, c2
iterations = 32
matrix_size = 1024
call system_clock(count_rate=cr)
call system_clock(count_max=cm)
allocate(a(matrix_size, matrix_size))
allocate(w1(matrix_size, matrix_size))
allocate(w2(matrix_size, matrix_size))
a(:,:) = 1.0 / real(matrix_size)
a(:,1) = 0.5 / real(matrix_size - 1)
a(1,1) = 0.5
w1 = a
w2(:,:) = 0.0
t1 => w1
t2 => w2
call system_clock(c1)
do i = 0, iterations
t2(:,:) = 0.0
call sgemm('N', 'N', matrix_size, matrix_size, matrix_size, 1.0, t1, matrix_size, a, matrix_size, 1.0, t2, matrix_size)
tmp => t1
t1 => t2
t2 => tmp
end do
call system_clock(c2)
out_data(1:size(t1)) => t1
out_diag => out_data(1::matrix_size+1)
print *, out_diag
print *, "Elapsed Time: ", (c2 - c1) / real(cr)
deallocate(a)
deallocate(w1)
deallocate(w2)
end program main
```
This diff is collapsed.
# Complementary Systems
Complementary systems offer development environment for users
that need to port and optimize their code and applications
for various hardware architectures and software technologies
that are not available on standard clusters.
## Complementary Systems 1
First stage of complementary systems implementation comprises of these partitions:
- compute partition 0 – based on ARM technology - legacy
- compute partition 1 – based on ARM technology - A64FX
- compute partition 2 – based on Intel technologies - Ice Lake, NVDIMMs + Bitware FPGAs
- compute partition 3 – based on AMD technologies - Milan, MI100 GPUs + Xilinx FPGAs
- compute partition 4 – reflecting Edge type of servers
- partition 5 – FPGA synthesis server
![](../img/cs1_1.png)
## Complementary Systems 2
Second stage of complementary systems implementation comprises of these partitions:
- compute partition 6 - based on ARM technology + CUDA programmable GPGPU accelerators on ampere architecture + DPU network processing units
- compute partition 7 - based on IBM Power10 architecture
- compute partition 8 - modern CPU with a very high L3 cache capacity (over 750MB)
- compute partition 9 - virtual GPU accelerated workstations
- compute partition 10 - Sapphire Rapids-HBM server
- compute partition 11 - NVIDIA Grace CPU Superchip
![](../img/cs2_2.png)
## Modules and Architecture Availability
Complementary systems list available modules automatically based on the detected architecture.
However, you can load one of the three modules -- `aarch64`, `avx2`, and `avx512` --
to reload the list of modules available for the respective architecture:
```console
[user@login.cs ~]$ ml architecture/aarch64
aarch64 modules + all modules
[user@login.cs ~]$ ml architecture/avx2
avx2 modules + all modules
[user@login.cs ~]$ ml architecture/avx512
avx512 modules + all modules
```
# Complementary System Job Scheduling
## Introduction
[Slurm][1] workload manager is used to allocate and access Complementary systems resources.
## Getting Partition Information
Display partitions/queues
```console
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
p00-arm up 1-00:00:00 0/1/0/1 p00-arm01
p01-arm* up 1-00:00:00 0/8/0/8 p01-arm[01-08]
p02-intel up 1-00:00:00 0/2/0/2 p02-intel[01-02]
p03-amd up 1-00:00:00 0/2/0/2 p03-amd[01-02]
p04-edge up 1-00:00:00 0/1/0/1 p04-edge01
p05-synt up 1-00:00:00 0/1/0/1 p05-synt01
p06-arm up 1-00:00:00 0/2/0/2 p06-arm[01-02]
p07-power up 1-00:00:00 0/1/0/1 p07-power01
p08-amd up 1-00:00:00 0/1/0/1 p08-amd01
p10-intel up 1-00:00:00 0/1/0/1 p10-intel01
```
## Getting Job Information
Show jobs
```console
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 p01-arm interact user R 1:48 2 p01-arm[01-02]
```
Show job details for specific job
```console
$ scontrol -d show job JOBID
```
Show job details for executing job from job session
```console
$ scontrol -d show job $SLURM_JOBID
```
## Running Interactive Jobs
Run interactive job
```console
$ salloc -A PROJECT-ID -p p01-arm
```
Run interactive job, with X11 forwarding
```console
$ salloc -A PROJECT-ID -p p01-arm --x11
```
!!! warning
Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
## Running Batch Jobs
Run batch job
```console
$ sbatch -A PROJECT-ID -p p01-arm ./script.sh
```
Useful command options (salloc, sbatch, srun)
* -n, --ntasks
* -c, --cpus-per-task
* -N, --nodes
## Slurm Job Environment Variables
Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
See all Slurm variables
```
set | grep ^SLURM
```
### Useful Variables
| variable name | description | example |
| ------ | ------ | ------ |
| SLURM_JOB_ID | job id of the executing job| 593 |
| SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
| SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
| SLURM_JOB_PARTITION | name of the partition | p03-amd |
| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
See [Slurm srun documentation][2] for details.
Get job nodelist
```
$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]
```
Expand nodelist to list of nodes.
```
$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02
```
## Modifying Jobs
```
$ scontrol update JobId=JOBID ATTR=VALUE
```
for example
```
$ scontrol update JobId=JOBID Comment='The best job ever'
```
## Deleting Jobs
```
$ scancel JOBID
```
## Partitions
| PARTITION | nodes | whole node | cores per node | features |
| --------- | ----- | ---------- | -------------- | -------- |
| p00-arm | 1 | yes | 64 | aarch64,cortex-a72 |
| p01-arm | 8 | yes | 48 | aarch64,a64fx,ib |
| p02-intel | 2 | no | 64 | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
| p03-amd | 2 | no | 64 | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
| p04-edge | 1 | yes | 16 | 86_64,intel,broadwell,ib |
| p05-synt | 1 | yes | 8 | x86_64,amd,milan,ib,ht |
| p06-arm | 2 | yes | 80 | aarch64,ib |
| p07-power | 1 | yes | 192 | ppc64le,ib |
| p08-amd | 1 | yes | 128 | x86_64,amd,milan-x,ib,ht |
| p10-intel | 1 | yes | 96 | x86_64,intel,sapphire_rapids,ht|
Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
FIFO scheduling with backfilling is employed.
## Partition 00 - ARM (Cortex-A72)
Whole node allocation.
One node:
```console
salloc -A PROJECT-ID -p p00-arm
```
## Partition 01 - ARM (A64FX)
Whole node allocation.
One node:
```console
salloc -A PROJECT-ID -p p01-arm
```
```console
salloc -A PROJECT-ID -p p01-arm -N=1
```
Multiple nodes:
```console
salloc -A PROJECT-ID -p p01-arm -N=8
```
## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per FPGA, resource separation is not enforced.
Use only FPGAs allocated to the job!
One FPGA:
```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga
```
Two FPGAs on the same node:
```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
```
## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
GPUs and FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per GPU and per FPGA, resource separation is not enforced.
Use only GPUs and FPGAs allocated to the job!
One GPU:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu
```
Two GPUs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:2
```
Four GPUs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4
```
All GPUs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
```
One FPGA:
```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga
```
Two FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
```
One GPU and one FPGA on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu,fpga
```
Four GPUs and two FPGAs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
```
All GPUs and FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
```
## Partition 04 - Edge Server
Whole node allocation:
```console
salloc -A PROJECT-ID -p p04-edge
```
## Partition 05 - FPGA Synthesis Server
Whole node allocation:
```console
salloc -A PROJECT-ID -p p05-synt
```
## Partition 06 - ARM
Whole node allocation:
```console
salloc -A PROJECT-ID -p p06-arm
```
## Partition 07 - IBM Power
Whole node allocation:
```console
salloc -A PROJECT-ID -p p07-power
```
## Partition 08 - AMD Milan-X
Whole node allocation:
```console
salloc -A PROJECT-ID -p p08-amd
```
## Partition 10 - Intel Sapphire Rapids
Whole node allocation:
```console
salloc -A PROJECT-ID -p p10-intel
```
## Features
Nodes have feature tags assigned to them.
Users can select nodes based on the feature tags using --constraint option.
| Feature | Description |
| ------ | ------ |
| aarch64 | platform |
| x86_64 | platform |
| ppc64le | platform |
| amd | manufacturer |
| intel | manufacturer |
| icelake | processor family |
| broadwell | processor family |
| sapphire_rapids | processor family |
| milan | processor family |
| milan-x | processor family |
| ib | Infiniband |
| gpu | equipped with GPU |
| fpga | equipped with FPGA |
| nvdimm | equipped with NVDIMMs |
| ht | Hyperthreading enabled |
| noht | Hyperthreading disabled |
```
$ sinfo -o '%16N %f'
NODELIST AVAIL_FEATURES
p00-arm01 aarch64,cortex-a72
p01-arm[01-08] aarch64,a64fx,ib
p02-intel01 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd02 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
p03-amd01 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
p04-edge01 x86_64,intel,broadwell,ib,ht
p05-synt01 x86_64,amd,milan,ib,ht
p06-arm[01-02] aarch64,ib
p07-power01 ppc64le,ib
p08-amd01 x86_64,amd,milan-x,ib,ht
p10-intel01 x86_64,intel,sapphire_rapids,ht
```
```
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
```
```
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
```
## Resources, GRES
Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
!!! warning
Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
### Node Resources
Get information about GRES on node.
```
$ scontrol -d show node p02-intel01 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p02-intel02 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p03-amd01 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
Gres=gpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
```
### Request Resources
To allocate required resources (GPUs or FPGAs) use the `--gres salloc/srun` option.
Example: Allocate one FPGA
```
$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
```
### Find Out Allocated Resources
Information about allocated resources is available in Slurm job details, attributes `JOB_GRES` and `GRES`.
```
$ scontrol -d show job $SLURM_JOBID |grep GRES=
JOB_GRES=fpga:xilinx_alveo_u250:1
Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
```
IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are `fpga:xilinx_alveo_u250:1(IDX:0)`, we should use FPGA with index/number 0 on node p03-amd01.
### Request Specific Resources
It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
```
$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job
$ scontrol -d show job $SLURM_JOBID | grep -i gres
JOB_GRES=fpga:xilinx_alveo_u280:2
Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
TresPerNode=gres:fpga:xilinx_alveo_u280:2
```
[1]: https://slurm.schedmd.com/
[2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
This diff is collapsed.
# Accessing the DGX-2
## Before You Access
!!! warning
GPUs are single-user devices. GPU memory is not purged between job runs and it can be read (but not written) by any user. Consider the confidentiality of your running jobs.
## How to Access
The DGX-2 machine is integrated into [Barbora cluster][3].
The DGX-2 machine can be accessed from Barbora login nodes `barbora.it4i.cz` through the Barbora scheduler queue qdgx as a compute node cn202.
## Storage
There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT.
### HOME
The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1].
### SCRATCH
The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory.
Accessible capacity is 22TB, shared among all users.
!!! warning
Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted.
### PROJECT
The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters.
For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section.
[1]: ../../barbora/storage/#home-file-system
[2]: ../../storage/project-storage
[3]: ../../barbora/introduction
# NVIDIA DGX-2
The DGX-2 is a very powerful computational node, featuring high end x86_64 processors and 16 NVIDIA V100-SXM3 GPUs.
| NVIDIA DGX-2 | |
| --- | --- |
| CPUs | 2 x Intel Xeon Platinum |
| GPUs | 16 x NVIDIA Tesla V100 32GB HBM2 |
| System Memory | Up to 1.5 TB DDR4 |
| GPU Memory | 512 GB HBM2 (16 x 32 GB) |
| Storage | 30 TB NVMe, Up to 60 TB |
| Networking | 8 x Infiniband or 8 x 100 GbE |
| Power | 10 kW |
| Size | 350 lbs |
| GPU Throughput | Tensor: 1920 TFLOPs, FP16: 520 TFLOPs, FP32: 260 TFLOPs, FP64: 130 TFLOPs |
The [DGX-2][a] introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe.
With NVLink2, it enables 16x NVIDIA V100-SXM3 GPUs in a single system, for a total bandwidth going beyond 14 TB/s.
Featuring pair of Xeon 8168 CPUs, 1.5 TB of memory, and 30 TB of NVMe storage,
we get a system that consumes 10 kW, weighs 163.29 kg, but offers double precision performance in excess of 130TF.
The DGX-2 is designed to be a powerful server in its own right.
On the storage side, the DGX-2 comes with 30TB of NVMe-based solid state storage.
For clustering or further inter-system communications, it also offers InfiniBand and 100GigE connectivity, up to eight of them.
Further, the [DGX-2][b] offers a total of ~2 PFLOPs of half precision performance in a single system, when using the tensor cores.
![](../img/dgx1.png)
With DGX-2, AlexNET, the network that 'started' the latest machine learning revolution, now takes 18 minutes.
The DGX-2 is able to complete the training process
for FAIRSEQ – a neural network model for language translation – 10x faster than a DGX-1 system,
bringing it down to less than two days total rather than 15 days.
The new NVSwitches means that the PCIe lanes of the CPUs can be redirected elsewhere, most notably towards storage and networking connectivity.
The topology of the DGX-2 means that all 16 GPUs are able to pool their memory into a unified memory space,
though with the usual tradeoffs involved if going off-chip.
![](../img/dgx2-nvlink.png)
[a]: https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/dgx-2/nvidia-dgx-2-datasheet.pdf
[b]: https://www.youtube.com/embed/OTOGw0BRqK0
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
# Migration to e-INFRA CZ
## Introduction
IT4Innovations is a part of [e-INFRA CZ][1] - strategic research infrastructure of the Czech Republic, which provides capacities and resources for the transmission, storage, and processing of scientific and research data. In January 2022, IT4I has begun the process of integration of its services.
As a part of the process, a joint e-INFRA CZ user base has been established. This included a migration of eligible IT4I accounts.
## Who Has Been Affected
The migration affects all accounts of users affiliated with an academic organizations in the Czech Republic who also have an OPEN-XX-XX project. Affected users have received an email with information about changes in personal data processing.
## Who Has Not Been Affected
Commercial users, training accounts, suppliers, and service accounts were **not** affected by the migration.
## Process
During the process, additional steps have been required for successful migration.
This may have included:
1. e-INFRA CZ registration, if one does not already exist.
2. e-INFRA CZ password reset, if one does not already exist.
## Steps After Migration
After the migration, you must use your **e-INFRA CZ credentials** to access all IT4I services as well as [e-INFRA CZ services][5].
Successfully migrated accounts tied to e-INFRA CZ can be self-managed at [e-INFRA CZ User profile][4].
!!! tip "Recommendation"
We recommend [verifying your SSH keys][6] for cluster access.
## Troubleshooting
If you have a problem with your account migrated to e-INFRA CZ user base, contact the [CESNET support][7].
If you have questions or a problem with IT4I account (i.e. account not eligible for migration), contact the [IT4I support][2].
[1]: https://www.e-infra.cz/en
[2]: mailto:support@it4i.cz
[3]: https://www.cesnet.cz/?lang=en
[4]: https://profile.e-infra.cz/
[5]: https://www.e-infra.cz/en/services
[6]: https://profile.e-infra.cz/profile/settings/sshKeys
[7]: mailto:support@cesnet.cz
This diff is collapsed.
File added
File added
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.