Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • chat
  • kru0052-master-patch-91081
  • lifecycle
  • master
  • 20180621-before_revision
  • 20180621-revision
6 results

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Select Git revision
  • MPDATABenchmark
  • Urx
  • anselm2
  • hot_fix
  • john_branch
  • master
  • mkdocs_update
  • patch-1
  • pbs
  • salomon_upgrade
  • tabs
  • virtual_environment2
  • 20180621-before_revision
  • 20180621-revision
14 results
Show changes
Showing
with 1556 additions and 0 deletions
# Visualization Servers
Karolina includes two nodes for remote visualization with [VirtualGL 2][3] and TurboVNC 2.
* 64 cores in total
* 2x AMD EPYC™ 7452 32-core, 2.35 GHz processors per node
* 256GiB DDR4 RAM, 3200MT/s, ECC of physical memory per node (12x 16 GB)
* HPE ProLiant DL385 Gen10 Plus servers
* 2406.4 GFLOP/s per compute node
* NVIDIA Quadro RTX 6000 card with OpenGL support
* 2x 100 Gb/s Ethernet and 1x 1 Gb/s Ethernet
* 1x HDR 200 Gb/s IB port
* 2x SSD 480 GB in RAID1
![](img/proliantdl385.png)
## NVIDIA® Quadro RTX™ 6000
* GPU Memory: 24 GB GDDR6
* Memory Interface: 384-bit
* Memory Bandwidth: Up to 672 GB/s
* NVIDIA® CUDA® Cores: 4,608
* NVIDIA® Tensor Cores: 576
* NVIDIA® RT Cores: 72
* System Interface: PCI Express 3.0 x16
* Max Power Consumption: 295 W
* Thermal Solution: Active
* Form Factor: 111 mm W x 267 mm L, Dual Slot, Full Height
* Display Connectors: 4x DP 1.4 + DVI-D DL
* Graphics APIs: Shader Model 5.1, OpenGL 4.6, DirectX 12.0, Vulkan 1.1,
* Compute APIs: CUDA, DirectCompute, OpenCL™
* Floating-Point Performance-Single Precision: 16.3 TFLOP/s, Peak
* Tensor Performance: 130.5 TFLOP/s
![](img/qrtx6000.png)
## Resource Allocation Policy
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ----- | -------------- | ----------------- | --------------------------------- | --------- | -------- | ------------- | -------- |
| qviz | yes | none required | 2 (with NVIDIA® Quadro RTX™ 6000) | 8 | 150 | no | 1h/8h |
## References
* [Graphical User Interface][1]
* [VPN Access][2]
[1]: ../general/shell-and-data-access.md#graphical-user-interface
[2]: ../general/shell-and-data-access.md#vpn-access
[3]: ../software/viz/vgl.md
# About LUMI
The European High-Performance Computing Joint Undertaking (EuroHPC JU) is pooling European resources
to develop top-of-the-range exascale supercomputers for processing big data,
based on competitive European technology.
One of the pan-European pre-exascale supercomputers, [LUMI][1], is located in CSC’s data center in Kajaani, Finland.
The supercomputer is hosted by the Large Unified Modern Infrastructure consortium.
The LUMI consortium countries are Finland, Belgium, the Czech Republic,
Denmark, Estonia, Iceland, Norway, Poland, Sweden, and Switzerland.
LUMI is one of the world’s best-known scientific instruments for the lifespan of 2021–2027.
## LUMI AI
LUMI can assist users in migrating their machine learning applications from smaller-scale computing environments to LUMI.
For more information, see the [LUMI AI][c] subsection.
## LUMI Software
For the list of software modules installed on LUMI,
as well as direct links to documentation for some of the most used modules,
see the [LUMI Software][a] subsection.
## LUMI Support
LUMI offers general support, Czech national support, events, and training.
For more information, see the [LUMI Support][b] subsection.
## Technical Reference
For more information about how to access the LUMI supercomputer,
see the [official documentation][2].
[1]: https://lumi-supercomputer.eu/
[2]: https://docs.lumi-supercomputer.eu/
[a]: software.md
[b]: support.md
[c]: lumiai.md
\ No newline at end of file
# LUMI AI
LUMI can assist users in migrating their machine learning applications from smaller-scale computing environments to LUMI.
## LUMI AI Guide
The guide is available at [LUMI GitHub][1] page.
Note that the project is still work in progress and changes are made constantly.
## Requirements
Before proceeding, please ensure you meet the following prerequisites:
* A basic understanding of machine learning concepts and Python programming. This guide will focus primarily on aspects specific to training models on LUMI.
* An active user account on LUMI and familiarity with its basic operations.
* If you wish to run the included examples, you need to be part of a project with GPU hours on LUMI.
## Examples
For examples visit [LUMI AI workshop][2].
[1]: https://github.com/Lumi-supercomputer/LUMI-AI-Guide/blob/main/README.md
[2]: https://github.com/Lumi-supercomputer/Getting_Started_with_AI_workshop
\ No newline at end of file
# OpenFoam
OpenFOAM is a free, open source CFD software package.
OpenFOAM has an extensive range of features to solve anything
from complex fluid flows involving chemical reactions, turbulence and heat transfer, to solid dynamics and electromagnetics.
## CSC Installed Software Collection
- [https://docs.lumi-supercomputer.eu/software/local/csc/][2]
- [https://docs.csc.fi/apps/openfoam/][3]
## Install 32bit/64bit
!!! warning
There is a very small quota for maximum number of files on LUMI:
projappl (100K), scratch (2.0M), flash (1.0M) - check: `lumi-quota`.
```
#!/bin/bash
SCRATCH="/pfs/lustre..."
cd $SCRATCH
mkdir -p openfoam
cd openfoam
export EBU_USER_PREFIX=$PWD/easybuild/lumi-c-23.09
module load LUMI/23.09 partition/container EasyBuild-user
#32bit - use eb file from this repository
#eb OpenFOAM-v2312-cpeGNU-23.09.eb -r
#64bit - use eb file from this repository
eb eb/OpenFOAM-v2312-64bit-cpeGNU-23.09.eb -r
```
## Run 32bit/64bit
```
#!/bin/bash
SCRATCH="/pfs/lustre..."
cd $SCRATCH/openfoam
export EBU_USER_PREFIX=$PWD/easybuild/lumi-c-23.09
module load LUMI/23.09 partition/container EasyBuild-user
ml OpenFOAM/v2312-cpeGNU-23.09-64bit
#32bit
#source $EBROOTOPENFOAM/etc/bashrc WM_COMPILER=Cray WM_MPLIB=CRAY-MPICH
#64bit
source $EBROOTOPENFOAM/etc/bashrc WM_COMPILER=Cray WM_MPLIB=CRAY-MPICH WM_LABEL_SIZE=64
OPENFOAM_PROJECT="/pfs/lustre..."
cd $OPENFOAM_PROJECT
srun -n 1 blockMesh
srun -n 1 decomposePar
srun -n $SLURM_NTASKS snappyHexMesh -overwrite -parallel | tee log.snappy
srun -n $SLURM_NTASKS createPatch -overwrite -parallel | tee log.createPatch
srun -n $SLURM_NTASKS transformPoints -scale '(0.01 0.01 0.01)' -parallel | tee log.transformPoint
srun -n $SLURM_NTASKS renumberMesh -overwrite -parallel | tee log.renum
srun -n $SLURM_NTASKS pimpleFoam -parallel | tee log.pimpleFoam1
```
## License
OpenFOAM is the free, open source CFD software developed primarily by OpenCFD Ltd since 2004.
OpenCFD Ltd, owner of the OpenFOAM Trademark, is a wholly owned subsidiary of ESI Group.
ESI-OpenCFD produces the OpenFOAM® open source CFD toolbox and distributes freely
via [https://www.openfoam.com/][1]. OpenCFD Ltd was established in 2004 to coincide with
the release of its OpenFOAM software under general public license.
OpenFOAM is distributed under the GPL v3 license.
## References
- Homepage: [http://www.openfoam.com/][1]
[1]: https://www.openfoam.com/
[2]: https://docs.lumi-supercomputer.eu/software/local/csc/
[3]: https://docs.csc.fi/apps/openfoam/
# PyTorch
## PyTorch Highlight
* Official page: [https://pytorch.org/][1]
* Code: [https://github.com/pytorch/pytorch][2]
* Python-based framework for machine learning
* Auto-differentiation on tensor types
* Official LUMI page: [https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/][3]
* **Warning:** be careful where the SIF image is installed or copied ($HOME is not recommended for quota reasons). For EasyBuild you must specify the installation path: `export EBU_USER_PREFIX=/project/project_XXXX/EasyBuild`.
## CSC Installed Software Collection
* [https://docs.csc.fi/support/tutorials/ml-multi/][8]
* [https://docs.lumi-supercomputer.eu/software/local/csc/][9]
* [https://docs.csc.fi/apps/pytorch/][10]
## PyTorch Install
### Base Environment
```console
module purge
module load CrayEnv
module load PrgEnv-cray/8.3.3
module load craype-accel-amd-gfx90a
module load cray-python
# Default ROCm – more recent versions are preferable (e.g. ROCm 5.6.0)
module load rocm/5.2.3.lua
```
### Scripts
* natively
* [01-install-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch1.13.1-rocm5.2.3.sh)
* [01-install-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/01-install-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env
* [02-install-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch1.13.1-rocm5.2.3.sh)
* [02-install-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/02-install-venv-torch2.1.2-rocm5.5.3.sh)
* conda env
* [03-install-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch1.13.1-rocm5.2.3.sh)
* [03-install-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/03-install-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-install-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/04-install-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity)
* [05-install-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.0.1-rocm5.5.1.sh)
* [05-install-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/install/05-install-container-torch2.1.0-rocm5.6.1.sh)
## PyTorch Tests
### Run Interactive Job on Single Node
```console
salloc -A project_XXX --partition=standard-g -N 1 -n 1 --gpus 8 -t 01:00:00
```
### Scripts
* natively
* [01-simple-test-direct-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch1.13.1-rocm5.2.3.sh)
* [01-simple-test-direct-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/01-simple-test-direct-torch2.1.2-rocm5.5.3.sh)
* virtual env
* [02-simple-test-venv-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch1.13.1-rocm5.2.3.sh)
* [02-simple-test-venv-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/02-simple-test-venv-torch2.1.2-rocm5.5.3.sh)
* conda env
* [03-simple-test-conda-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch1.13.1-rocm5.2.3.sh)
* [03-simple-test-conda-torch2.1.2-rocm5.5.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/03-simple-test-conda-torch2.1.2-rocm5.5.3.sh)
* from source: [04-simple-test-source-torch1.13.1-rocm5.2.3.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/04-simple-test-source-torch1.13.1-rocm5.2.3.sh)
* containers (singularity)
* [05-simple-test-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.0.1-rocm5.5.1.sh)
* [05-simple-test-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/05-simple-test-container-torch2.1.0-rocm5.6.1.sh)
### Run Interactive Job on Multiple Nodes
```
salloc -A project_XXX --partition=standard-g -N 2 -n 16 --gpus 16 -t 01:00:00
```
### Scripts
* containers (singularity)
* [07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.0.1-rocm5.5.1.sh)
* [07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/07-mnist-distributed-learning-container-torch2.1.0-rocm5.6.1.sh)
* [08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.0.1-rocm5.5.1.sh)
* [08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh](https://code.it4i.cz/lumi-g/pytorch/-/blob/main/scripts/tests/08-cnn-distributed-container-torch2.1.0-rocm5.6.1.sh)
## Tips
### Official Containers
```
ls -la /appl/local/containers/easybuild-sif-images/
```
### Unofficial Versions of ROCM
```
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
ml rocm/5.4.3
ml rocm/5.6.0
```
### Unofficial Containers
```
ls -la /pfs/lustrep2/projappl/project_462000125/samantao-public/containers/
```
### Installing Python Modules in Containers
```console
#!/bin/bash
wd=$(pwd)
SIF=/pfs/lustrep2/projappl/project_462000125/samantao-public/containers/lumi-pytorch-rocm-5.6.1-python-3.10-pytorch-v2.1.0-dockerhash-aa8dbea5e0e4.sif
rm -rf $wd/setup-me.sh
cat > $wd/setup-me.sh << EOF
#!/bin/bash -e
\$WITH_CONDA
pip3 install scipy h5py tqdm
EOF
chmod +x $wd/setup-me.sh
mkdir -p $wd/pip_install
srun -n 1 --gpus 8 singularity exec \
-B /var/spool/slurmd:/var/spool/slurmd \
-B /opt/cray:/opt/cray \
-B /usr/lib64/libcxi.so.1:/usr/lib64/libcxi.so.1 \
-B $wd:/workdir \
-B $wd/pip_install:$HOME/.local/lib \
$SIF /workdir/setup-me.sh
# Add the path of pip_install to singularity-exec in run.sh:
# -B $wd/pip_install:$HOME/.local/lib \
```
### Controlling Device Visibility
* `HIP_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* `ROCR_VISIBLE_DEVICES=0,1,2,3 python -c 'import torch; print(torch.cuda.device_count())'`
* SLURM sets `ROCR_VISIBLE_DEVICES`
* Implications of both ways of setting visibility – blit kernels and/or DMA
### RCCL
* The problem – on startup we can see:
* `NCCL error in: /pfs/lustrep2/projappl/project_462000125/samantao/pytorchexample/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, unhandled system error, NCCL version 2.12.12`
* Checking error origin:
* `export NCCL_DEBUG=INFO`
* `NCCL INFO NET/Socket : Using [0]nmn0:10.120.116.65<0> [1]hsn0:10.253.6.67<0> [2]hsn1:10.253.6.68<0>[3]hsn2:10.253.2.12<0> [4]hsn3:10.253.2.11<0>`
* `NCCL INFO /long_pathname_so_that_rpms_can_package_the_debug_info/data/driver/rccl/src/init.cc:1292`
* The fix:
* `export NCCL_SOCKET_IFNAME=hsn0,hsn1,hsn2,hsn3`
### RCCL AWS-CXI Plugin
* RCCL relies on runtime plugin-ins to connect with some transport layers
* Libfabric – provider for Slingshot
* Hipified plugin adapted from AWS OpenFabrics support available
* [https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl][7]
* 3-4x faster collectives
* Plugin needs to be pointed at by the loading environment
```console
module use /pfs/lustrep2/projappl/project_462000125/samantao-public/mymodules
module load aws-ofi-rccl/rocm-5.2.3.lua
# Or
export LD_LIBRARY_PATH=/pfs/lustrep2/projappl/project_462000125/samantao-public/apps-rocm-5.2.3/aws-ofirccl
# (will detect librccl-net.so)
```
* Verify the plugin is detected
```console
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT
# and search the logs for:
# [0] NCCL INFO NET/OFI Using aws-ofi-rccl 1.4.0
```
### amdgpu.ids Issue
[https://github.com/pytorch/builder/issues/1410][4]
## References
* Samuel Antao (AMD), LUMI Courses
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/][5]
* [https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/][6]
* Multi-GPU and multi-node machine learning by CSC
* [https://docs.csc.fi/support/tutorials/ml-multi/][11]
[1]: https://pytorch.org/
[2]: https://github.com/pytorch/pytorch
[3]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/p/PyTorch/
[4]: https://github.com/pytorch/builder/issues/1410
[5]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20230530/extra_4_10_Best_Practices_GPU_Optimization/
[6]: https://lumi-supercomputer.github.io/LUMI-training-materials/4day-20231003/extra_4_10_Best_Practices_GPU_Optimization/
[7]: https://github.com/ROCmSoftwarePlatform/aws-ofi-rccl
[8]: https://docs.csc.fi/support/tutorials/ml-multi
[9]: https://docs.lumi-supercomputer.eu/software/local/csc/
[10]: https://docs.csc.fi/apps/pytorch/
[11]: https://docs.csc.fi/support/tutorials/ml-multi/
# LUMI Software
Below are links to LUMI guides for selected [LUMI Software modules][1]:
## PyTorch
[PyTorch][8] is an optimized tensor library for deep learning using GPUs and CPUs.
### Comprehensive Guide on PyTorch
See the [PyTorch][a] subsection for guides on how to install PyTorch and run interactive jobs.
### How to Run PyTorch on Lumi-G AMD GPU Accelerators
Link to LUMI guide on how to run PyTorch on LUMI GPUs:
[https://docs.lumi-supercomputer.eu/software/packages/pytorch/][2]
## How to Run Gromacs on Lumi-G AMD GPU Accelerators
Gromacs is a very efficient engine to perform molecular dynamics simulations
and energy minimizations particularly for proteins.
However, it can also be used to model polymers, membranes and e.g. coarse grained systems.
It also comes with plenty of analysis scripts.
[https://docs.csc.fi/apps/gromacs/#example-batch-script-for-lumi-full-gpu-node][3]
## AMD Infinity Hub
The AMD Infinity Hub contains a collection of advanced software containers and deployment guides for HPC and AI applications,
including code built recipes for code customization.
[https://www.amd.com/fr/developer/resources/infinity-hub.html][4]
## GPU-Accelerated Applications With AMD INSTINCT™ Accelerators Enabled by AMD ROCm™
The AMD Infinity Hub contains a collection of advanced software containers
and deployment guides for HPC and AI applications,
enabling researchers, scientists, and engineers to speed up their time to science.
[https://www.amd.com/system/files/documents/gpu-accelerated-applications-catalog.pdf][5]
## CSC Installed Software
The link below contains a list of codes enabled by CSC
and available for all, including PyTorch, TensorFlow, JAX, GROMACS, and others.
[https://docs.lumi-supercomputer.eu/software/local/csc/][6]
## Installation of SW via Conda
Conda is an open-source, cross-platform,language-agnostic package manager and environment management system.
[https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/][7]
[1]: https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/
[2]: https://docs.lumi-supercomputer.eu/software/packages/pytorch/
[3]: https://docs.csc.fi/apps/gromacs/#example-batch-script-for-lumi-full-gpu-node
[4]: https://www.amd.com/fr/developer/resources/infinity-hub.html
[5]: https://www.amd.com/system/files/documents/gpu-accelerated-applications-catalog.pdf
[6]: https://docs.lumi-supercomputer.eu/software/local/csc/
[7]: https://docs.lumi-supercomputer.eu/software/installing/container-wrapper/
[8]: https://pytorch.org/docs/stable/index.html
[a]: pytorch.md
# LUMI Support
You can use the [LUMI support portal][5] for help and support regarding the cluster and SW technology at LUMI.
Czech national support for LUMI is provided by Jan Vicherek (contact: [support\[at\]it4i.cz][a]).
Additionally, LUMI organizes number of events and training, the list of which can be found at:
[https://lumi-supercomputer.eu/events/][1]
## LUMI User Coffee Breaks
These are LUMI User Support Team’s online meetings
which you can join even if you are not yet LUMI user.
[https://lumi-supercomputer.eu/lumi-user-coffee-breaks/][2]
You can submit questions for the session using HedgeDoc.
Zoom link for the meeting can be found at:
[https://www.lumi-supercomputer.eu/events/usercoffeebreaks/][3]
## Overview
The overview of LUMI's events and trainings can be found at:
[https://lumi-supercomputer.github.io/LUMI-training-materials/][4]
[1]: https://lumi-supercomputer.eu/events/
[2]: https://lumi-supercomputer.eu/lumi-user-coffee-breaks/
[3]: https://www.lumi-supercomputer.eu/events/usercoffeebreaks/
[4]: https://lumi-supercomputer.github.io/LUMI-training-materials/
[5]: https://lumi-supercomputer.eu/user-support/need-help/
[a]: mailto:support@it4i.cz
# PRACE User Support
## Introduction
PRACE users coming to the TIER-1 systems offered through the DECI calls are, in general, treated as standard users, so most of the general documentation applies to them as well. This section shows the main differences for quicker orientation, but often uses references to the original documentation. PRACE users who do not undergo the full procedure (including signing the IT4I AuP on top of the PRACE AuP) will not have a password and thus an access to some services intended for regular users. However, even with the limited access, they should be able to use the TIER-1 system as intended. If the same level of access is required, see the [Obtaining Login Credentials][1] section.
All general [PRACE User Documentation][a] should be read before continuing reading the local documentation here.
## Help and Support
If you need any information, request support, or want to install additional software, use PRACE Helpdesk.
Information about the local services are provided in the [introduction of general user documentation Salomon][2] and [introduction of general user documentation Barbora][3]. Keep in mind, that standard PRACE accounts don't have a password to access the web interface of the local (IT4Innovations) request tracker and thus a new ticket should be created by sending an email to support[at]it4i.cz.
## Obtaining Login Credentials
In general, PRACE users already have a PRACE account set up through their HOMESITE (institution from their country) as a result of a rewarded PRACE project proposal. This includes signed PRACE AuP, generated and registered certificates, etc.
If there is a special need, a PRACE user can get a standard (local) account at IT4Innovations. To get an account on a cluster, the user needs to obtain the login credentials. The procedure is the same as for general users of the cluster, see the corresponding [section of the general documentation here][1].
## Accessing the Cluster
### Access With GSI-SSH
For all PRACE users, the method for interactive access (login) and data transfer based on grid services from Globus Toolkit (GSI SSH and GridFTP) is supported.
The user will need a valid certificate and to be present in the PRACE LDAP (contact your HOME SITE or the Primary Investigator of your project for LDAP account creation).
For more information, see [PRACE FAQ][b]
Before you start using any of the services, do not forget to create a proxy certificate from your certificate:
```console
$ grid-proxy-init
```
To check whether your proxy certificate is still valid (12 hours by default), use:
```console
$ grid-proxy-info
```
To access the cluster, several login nodes running the GSI SSH service are available. The service is available from public Internet as well as from the internal PRACE network (accessible only from other PRACE partners).
#### Access From PRACE Network:
It is recommended to use the single DNS name **name-cluster**-prace.it4i.cz which is distributed between the four login nodes. If needed, the user can log in directly to one of the login nodes. The addresses are:
Salomon cluster:
| Login address | Port | Protocol | Login node |
| ---------------------------- | ---- | -------- | -------------------------------- |
| salomon-prace.it4i.cz | 2222 | gsissh | login1, login2, login3 or login4 |
| login1-prace.salomon.it4i.cz | 2222 | gsissh | login1 |
| login2-prace.salomon.it4i.cz | 2222 | gsissh | login2 |
| login3-prace.salomon.it4i.cz | 2222 | gsissh | login3 |
| login4-prace.salomon.it4i.cz | 2222 | gsissh | login4 |
```console
$ gsissh -p 2222 salomon-prace.it4i.cz
```
When logging from other PRACE system, the prace_service script can be used:
```console
$ gsissh `prace_service -i -s salomon`
```
#### Access From Public Internet:
It is recommended to use the single DNS name **name-cluster**.it4i.cz which is distributed between the four login nodes. If needed, the user can login directly to one of the login nodes. The addresses are:
Salomon cluster:
| Login address | Port | Protocol | Login node |
| ---------------------------- | ---- | -------- | -------------------------------- |
| salomon.it4i.cz | 2222 | gsissh | login1, login2, login3 or login4 |
| login1.salomon.it4i.cz | 2222 | gsissh | login1 |
| login2-prace.salomon.it4i.cz | 2222 | gsissh | login2 |
| login3-prace.salomon.it4i.cz | 2222 | gsissh | login3 |
| login4-prace.salomon.it4i.cz | 2222 | gsissh | login4 |
```console
$ gsissh -p 2222 salomon.it4i.cz
```
When logging from other PRACE system, the prace_service script can be used:
```console
$ gsissh `prace_service -e -s salomon`
```
Although the preferred and recommended file transfer mechanism is [using GridFTP][5], the GSI SSH implementation also supports SCP, so for small files transfer, gsiscp can be used:
```console
$ gsiscp -P 2222 _LOCAL_PATH_TO_YOUR_FILE_ salomon.it4i.cz:_SALOMON_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 salomon.it4i.cz:_SALOMON_PATH_TO_YOUR_FILE_ _LOCAL_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 _LOCAL_PATH_TO_YOUR_FILE_ salomon-prace.it4i.cz:_SALOMON_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 salomon-prace.it4i.cz:_SALOMON_PATH_TO_YOUR_FILE_ _LOCAL_PATH_TO_YOUR_FILE_
```
### Access to X11 Applications (VNC)
If the user needs to run X11 based graphical application and does not have a X11 server, the applications can be run using VNC service. If the user is using a regular SSH based access, see this [section in general documentation][6].
If the user uses a GSI SSH based access, then the procedure is similar to the [SSH based access][6], only the port forwarding must be done using GSI SSH:
```console
$ gsissh -p 2222 salomon.it4i.cz -L 5961:localhost:5961
```
### Access With SSH
After they successfully obtain the login credentials for the local IT4Innovations account, the PRACE users can access the cluster as regular users using SSH. For more information, see this [section in general documentation][9].
## File Transfers
PRACE users can use the same transfer mechanisms as regular users (if they have undergone the full registration procedure). For more information, see the [Accessing the Clusters][9] section.
Apart from the standard mechanisms, for PRACE users to transfer data to/from the Salomon cluster, a GridFTP server running the Globus Toolkit GridFTP service is available. The service is available from public Internet as well as from the internal PRACE network (accessible only from other PRACE partners).
There is one control server and three backend servers for striping and/or backup in case one of them would fail.
### Access From PRACE Network
Salomon cluster:
| Login address | Port | Node role |
| ----------------------------- | ---- | --------------------------- |
| gridftp-prace.salomon.it4i.cz | 2812 | Front end /control server |
| lgw1-prace.salomon.it4i.cz | 2813 | Backend / data mover server |
| lgw2-prace.salomon.it4i.cz | 2813 | Backend / data mover server |
| lgw3-prace.salomon.it4i.cz | 2813 | Backend / data mover server |
Copy files **to** Salomon by running the following commands on your local machine:
```console
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://gridftp-prace.salomon.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_
```
Or by using prace_service script:
```console
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://`prace_service -i -f salomon`/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_
```
Copy files **from** Salomon:
```console
$ globus-url-copy gsiftp://gridftp-prace.salomon.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Or by using the prace_service script:
```console
$ globus-url-copy gsiftp://`prace_service -i -f salomon`/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
### Access From Public Internet
Salomon cluster:
| Login address | Port | Node role |
| ----------------------- | ---- | --------------------------- |
| gridftp.salomon.it4i.cz | 2812 | Front end /control server |
| lgw1.salomon.it4i.cz | 2813 | Backend / data mover server |
| lgw2.salomon.it4i.cz | 2813 | Backend / data mover server |
| lgw3.salomon.it4i.cz | 2813 | Backend / data mover server |
Copy files **to** Salomon by running the following commands on your local machine:
```console
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://gridftp.salomon.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_
```
Or by using the prace_service script:
```console
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://`prace_service -e -f salomon`/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_
```
Copy files **from** Salomon:
```console
$ globus-url-copy gsiftp://gridftp.salomon.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Or by using the prace_service script:
```console
$ globus-url-copy gsiftp://`prace_service -e -f salomon`/home/prace/_YOUR_ACCOUNT_ON_SALOMON_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Generally, both shared file systems are available through GridFTP:
| File system mount point | Filesystem | Comment |
| ----------------------- | ---------- | -------------------------------------------------------------- |
| /home | Lustre | Default HOME directories of users in format /home/prace/login/ |
| /scratch | Lustre | Shared SCRATCH mounted on the whole cluster |
More information about the shared file systems on Salomon is available [here][10].
!!! hint
The `prace` directory is used for PRACE users on the SCRATCH file system.
Salomon cluster /scratch:
| Data type | Default path |
| ---------------------------- | ------------------------------- |
| large project files | /scratch/work/user/prace/login/ |
| large scratch/temporary data | /scratch/temp/ |
## Usage of the Cluster
There are some limitations for PRACE users when using the cluster. By default, PRACE users are not allowed to access special queues in the PBS Pro to have high priority or exclusive access to some special equipment like accelerated nodes and high memory (fat) nodes. There may also be restrictions on obtaining a working license for the commercial software installed on the cluster, mostly because of the license agreement or because of insufficient amount of licenses.
For production runs, always use scratch file systems. The available file systems on Salomon is described [here][10].
### Software, Modules and PRACE Common Production Environment
All system-wide installed software on the cluster is made available to the users via the modules. For more information about the environment and modules usage, see the [Environment and Modules][12] section.
PRACE users can use the "prace" module for PRACE Common Production Environment.
```console
$ ml prace
```
### Resource Allocation and Job Execution
For general information about the resource allocation, job queuing, and job execution, see [Resources Allocation Policy][13].
For PRACE users, the default production run queue is "qprod", the same queue as for the national users of IT4I. Previously the "qprace" was the default queue for PRACE users, but since it gradually became identical with the "qprod" queue, it has been retired. For legacy reasons, the "qprace" queue is enabled on systems where it was the default one, but is not available on current and future systems. PRACE users can also use two other queues "qexp" and "qfree".
Salomon:
| queue | Active project | Project resources | Nodes | priority | authorization | walltime |
| ---------------------------------- | -------------- | ----------------- | -------------------------- | -------- | ------------- | --------- |
| **qexp** Express queue | no | none required | 32 nodes, max 8 per user | 150 | no | 1 / 1 h |
| **qprod** Production queue | yes | >0 | 1006 nodes, max 86 per job | 0 | no | 24 / 48 h |
| **qfree** Free resource queue | yes | none required | 752 nodes, max 86 per job | -1024 | no | 12 / 12 h |
| **qprace** Legacy production queue | yes | >0 | 1006 nodes, max 86 per job | 0 | no | 24 / 48 h |
### Accounting & Quota
The resources that are currently subject to accounting are the core hours. The core hours are accounted on the wall clock basis. The accounting runs whenever the computational cores are allocated or blocked via the PBS Pro workload manager (the qsub command), regardless of whether the cores are actually used for any calculation. See the [example in the general documentation][13].
PRACE users should check their project accounting using the PRACE Accounting Tool (DART).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received a local password may check at any time, how many core-hours they and their projects have consumed using the command "it4ifree". Note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! note
The **it4ifree** command is a part of it4i.portal.clients package, [located here][pypi].
```console
$ it4ifree
Projects I am participating in
==============================
PID Days left Total Used WCHs Used NCHs WCHs by me NCHs by me Free
---------- ----------- ------- ----------- ----------- ------------ ------------ -------
OPEN-XX-XX 323 0 5169947 5169947 50001 50001 1292555
Projects I am Primarily Investigating
=====================================
PID Login Used WCHs Used NCHs
---------- ---------- ----------- -----------
OPEN-XX-XX user1 376670 376670
user2 4793277 4793277
Legend
======
WCH = Wall-clock Core Hour
NCH = Normalized Core Hour
```
By default, a file system quota is applied. To check the current status of the quota (separate for HOME and SCRATCH), use:
```console
$ quota
$ lfs quota -u USER_LOGIN /scratch
```
If the quota is insufficient, contact the [support][15] and request an increase.
[1]: general/obtaining-login-credentials/obtaining-login-credentials.md
[2]: salomon/introduction.md
[3]: barbora/introduction.md
[5]: #file-transfers
[6]: general/accessing-the-clusters/graphical-user-interface/x-window-system.md
[9]: general/shell-and-data-access.md
[10]: salomon/storage.md
[12]: environment-and-modules.md
[13]: general/resources-allocation-policy.md
[15]: #help-and-support
[a]: https://prace-ri.eu/training-support/
[b]: https://prace-ri.eu/about/faqs/
[pypi]: https://pypi.python.org/pypi/it4i.portal.clients
Sitemap: https://docs.it4i.cz/sitemap.xml.gz
User-agent: *
Disallow:
# 7D Enhanced Hypercube
![](../img/7D_Enhanced_hypercube.png)
| Node type | Count | Short name | Long name | Rack |
| ------------------------------------ | ----- | ---------------- | ------------------------ | ----- |
| M-Cell compute nodes w/o accelerator | 576 | cns1 -cns576 | r1i0n0 - r4i7n17 | 1-4 |
| compute nodes MIC accelerated | 432 | cns577 - cns1008 | r21u01n577 - r37u31n1008 | 21-38 |
## IB Topology
![](../img/Salomon_IB_topology.png)
# Compute Nodes
## Nodes Configuration
Salomon is cluster of x86-64 Intel-based nodes. The cluster contains two types of compute nodes of the same processor type and memory size.
Compute nodes with MIC accelerator **contain two Intel Xeon Phi 7120P accelerators.**
[More about][1] schematic representation of the Salomon cluster compute nodes IB topology.
### Compute Nodes Without Accelerator
* codename "grafton"
* 576 nodes
* 13 824 cores in total
* two Intel Xeon E5-2680v3, 12-core, 2.5 GHz processors per node
* 128 GB of physical memory per node
![cn_m_cell](../img/cn_m_cell.jpg)
### Compute Nodes With MIC Accelerator
* codename "perrin"
* 432 nodes
* 10 368 cores in total
* two Intel Xeon E5-2680v3, 12-core, 2.5 GHz processors per node
* 128 GB of physical memory per node
* MIC accelerator 2 x Intel Xeon Phi 7120P per node, 61-cores, 16 GB per accelerator
![cn_mic](../img/cn_mic-1.jpg)
![(source Silicon Graphics International Corp.)](../img/sgi-c1104-gp1.jpeg)
![cn_mic](../img/cn_mic.jpg)
### Uv 2000
* codename "UV2000"
* 1 node
* 112 cores in total
* 14 x Intel Xeon E5-4627v2, 8-core, 3.3 GHz processors, in 14 NUMA nodes
* 3328 GB of physical memory per node
* 1 x NVIDIA GM200 (GeForce GTX TITAN X), 12 GB RAM
![](../img/uv-2000.jpeg)
### Compute Nodes Summary
| Node type | Count | Memory | Cores |
| -------------------------- | ----- | ----------------- | ----------------------------------- |
| Nodes without accelerator | 576 | 128 GB | 24 @ 2.5GHz |
| Nodes with MIC accelerator | 432 | 128 GB, MIC 32GB | 24 @ 2.5GHz, MIC 61 @ 1.238 GHz |
| UV2000 SMP node | 1 | 3328GB | 112 @ 3.3GHz |
## Processor Architecture
Salomon is equipped with Intel Xeon processors Intel Xeon E5-2680v3. Processors support Advanced Vector Extensions 2.0 (AVX2) 256-bit instruction set.
### Intel Xeon E5-2680v3 Processor
* 12-core
* speed: 2.5 GHz, up to 3.3 GHz using Turbo Boost Technology
* peak performance: 40 GFLOP/s per core @ 2.5 GHz
* caches:
* Intel® Smart Cache: 30 MB
* memory bandwidth at the level of the processor: 68 GB/s
### MIC Accelerator Intel Xeon Phi 7120P Processor
* 61-core
* speed: 1.238
GHz, up to 1.333 GHz using Turbo Boost Technology
* peak performance: 18.4 GFLOP/s per core
* caches:
* L2: 30.5 MB
* memory bandwidth at the level of the processor: 352 GB/s
## Memory Architecture
Memory is equally distributed across all CPUs and cores for optimal performance. Memory is composed of memory modules of the same size and evenly distributed across all memory controllers and memory channels.
### Compute Node Without Accelerator
* 2 sockets
* Memory Controllers are integrated into processors.
* 8 DDR4 DIMMs per node
* 4 DDR4 DIMMs per CPU
* 1 DDR4 DIMMs per channel
* Populated memory: 8 x 16 GB DDR4 DIMM >2133 MHz
### Compute Node With MIC Accelerator
2 sockets
Memory Controllers are integrated into processors.
* 8 DDR4 DIMMs per node
* 4 DDR4 DIMMs per CPU
* 1 DDR4 DIMMs per channel
Populated memory: 8 x 16 GB DDR4 DIMM 2133 MHz
MIC Accelerator Intel Xeon Phi 7120P Processor
* 2 sockets
* Memory Controllers are connected via an
Interprocessor Network (IPN) ring.
* 16 GDDR5 DIMMs per node
* 8 GDDR5 DIMMs per CPU
* 2 GDDR5 DIMMs per channel
[1]: ib-single-plane-topology.md
# Hardware Overview
## Introduction
The Salomon cluster consists of 1008 computational nodes of which 576 are regular compute nodes and 432 accelerated nodes. Each node is a powerful x86-64 computer equipped with 24 cores (two twelve-core Intel Xeon processors) and 128 GB RAM. The nodes are interlinked by high speed InfiniBand and Ethernet networks. All nodes share 0.5 PB /home NFS disk storage to store the user files. Users may use a DDN Lustre shared storage with capacity of 1.69 PB which is available for the scratch project data. User access to the Salomon cluster is provided by four login nodes.
[More about][1] schematic representation of the Salomon cluster compute nodes IB topology.
![Salomon](../img/salomon-2.jpg)
The parameters are summarized in the following tables:
## General Information
| **In general** | |
| ------------------------------------------- | ------------------------------------------- |
| Primary purpose | High Performance Computing |
| Architecture of compute nodes | x86-64 |
| Operating system | CentOS 7.x Linux |
| [**Compute nodes**][2] | |
| Totally | 1008 |
| Processor | 2 x Intel Xeon E5-2680v3, 2.5 GHz, 12 cores |
| RAM | 128GB, 5.3 GB per core, DDR4@2133 MHz |
| Local disk drive | no |
| Compute network / Topology | InfiniBand FDR56 / 7D Enhanced hypercube |
| w/o accelerator | 576 |
| MIC accelerated | 432 |
| **In total** | |
| Total theoretical peak performance (Rpeak) | 2011 TFLOP/s |
| Total amount of RAM | 129.024 TB |
## Compute Nodes
| Node | Count | Processor | Cores | Memory | Accelerator |
| --------------- | ----- | --------------------------------- | ----- | ------ | --------------------------------------------- |
| w/o accelerator | 576 | 2 x Intel Xeon E5-2680v3, 2.5 GHz | 24 | 128 GB | - |
| MIC accelerated | 432 | 2 x Intel Xeon E5-2680v3, 2.5 GHz | 24 | 128 GB | 2 x Intel Xeon Phi 7120P, 61 cores, 16 GB RAM |
For more details, refer to the [Compute nodes][2] section.
## Remote Visualization Nodes
For remote visualization, two nodes with NICE DCV software are available each configured:
| Node | Count | Processor | Cores | Memory | GPU Accelerator |
| ------------- | ----- | --------------------------------- | ----- | ------ | ----------------------------- |
| visualization | 2 | 2 x Intel Xeon E5-2695v3, 2.3 GHz | 28 | 512 GB | NVIDIA QUADRO K5000, 4 GB RAM |
## SGI Uv 2000
For large memory computations, a special SMP/NUMA SGI UV 2000 server is available:
| Node | Count | Processor | Cores | Memory | Extra HW |
| ------ | ----- | ------------------------------------------- | ----- | --------------------- | ------------------------------------------------------------------------ |
| UV2000 | 1 | 14 x Intel Xeon E5-4627v2, 3.3 GHz, 8 cores | 112 | 3328 GB DDR3@1866 MHz | 2 x 400GB local SSD, 1x NVIDIA GM200 (GeForce GTX TITAN X), 12 GB RAM |
![](../img/uv-2000.jpeg)
[1]: ib-single-plane-topology.md
[2]: compute-nodes.md
# IB Single-Plane Topology
A complete M-Cell assembly consists of four compute racks. Each rack contains 4 physical IRUs - Independent rack units. Using one dual socket node per one blade slot leads to 8 logical IRUs. Each rack contains 4 x 2 SGI ICE X IB Premium Blades.
The SGI ICE X IB Premium Blade provides the first level of interconnection via dual 36-port Mellanox FDR InfiniBand ASIC switch with connections as follows:
* 9 ports from each switch chip connect to the unified backplane, to connect the 18 compute node slots
* 3 ports on each chip provide connectivity between the chips
* 24 ports from each switch chip connect to the external bulkhead, for a total of 48
## IB Single-Plane Topology - ICEX M-Cell
Each color in each physical IRU represents one dual-switch ASIC switch.
[IB single-plane topology - ICEX Mcell.pdf][1]
![IB single-plane topology - ICEX Mcell.pdf](../img/IBsingleplanetopologyICEXMcellsmall.png)
## IB Single-Plane Topology - Accelerated Nodes
Each of the 3 inter-connected D racks is equivalent to one half of the M-Cell rack. 18 x D rack with MIC accelerated nodes [r21-r38] are equivalent to 3 M-Cell racks as shown in the [7D Enhanced Hypercube][2] diagram.
As shown in a diagram [IB Topology][3]
* Racks 21, 22, 23, 24, 25, 26 are equivalent to one M-Cell rack.
* Racks 27, 28, 29, 30, 31, 32 are equivalent to one M-Cell rack.
* Racks 33, 34, 35, 36, 37, 38 are equivalent to one M-Cell rack.
[IB single-plane topology - Accelerated nodes.pdf][4]
![IB single-plane topology - Accelerated nodes.pdf](../img/IBsingleplanetopologyAcceleratednodessmall.png)
[1]: ../src/IB_single-plane_topology_-_ICEX_Mcell.pdf
[2]: 7d-enhanced-hypercube.md
[3]: 7d-enhanced-hypercube.md#ib-topology)
[4]: ../src/IB_single-plane_topology_-_Accelerated_nodes.pdf
# Introduction
Welcome to Salomon supercomputer cluster. The Salomon cluster consists of 1009 compute nodes, totaling 24192 compute cores with 129 TB RAM and giving over 2 PFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer equipped with 24 cores and at least 128 GB RAM. Nodes are interconnected through a 7D Enhanced hypercube InfiniBand network and are equipped with Intel Xeon E5-2680v3 processors. The Salomon cluster consists of 576 nodes without accelerators and 432 nodes equipped with Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview][1].
The cluster runs with a [CentOS Linux][a] operating system, which is compatible with the Red Hat [Linux family][b].
## Water-Cooled Compute Nodes With MIC Accelerators
![](../img/salomon.jpg)
![](../img/salomon-1.jpeg)
## Tape Library T950B
![](../img/salomon-3.jpeg)
![](../img/salomon-4.jpeg)
[1]: hardware-overview.md
[a]: http://www.bull.com/bullx-logiciels/systeme-exploitation.html
[b]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
# Network
All compute and login nodes of Salomon are interconnected by the 7D Enhanced hypercube [InfiniBand][a] network and by the Gigabit [Ethernet][b] network. Only the [InfiniBand][c] network may be used to transfer user data.
## InfiniBand Network
All compute and login nodes of Salomon are interconnected by the 7D Enhanced hypercube [Infiniband][a] network (56 Gbps). The network topology is a [7D Enhanced hypercube][1].
Read more about schematic representation of the Salomon cluster [IB single-plain topology][2] ([hypercube dimension][1]).
The compute nodes may be accessed via the Infiniband network using the ib0 network interface, in the address range 10.17.0.0 (mask 255.255.224.0). The MPI may be used to establish a native Infiniband connection among the nodes.
The network provides **2170MB/s** transfer rates via the TCP connection (single stream) and up to **3600MB/s** via the native Infiniband protocol.
## Example
```console
$ qsub -q qexp -l select=4:ncpus=16 -N Name0 ./myjob
$ qstat -n -u username
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.isrv5 username qexp Name0 5530 4 96 -- 01:00 R 00:00
r4i1n0/0*24+r4i1n1/0*24+r4i1n2/0*24+r4i1n3/0*24
```
In this example, we access the node r4i1n0 by Infiniband network via the ib0 interface.
```console
$ ssh 10.17.35.19
```
In this example, we get
information of the Infiniband network.
```console
$ ifconfig
....
inet addr:10.17.35.19....
....
$ ip addr show ib0
....
inet 10.17.35.19....
....
```
[1]: 7d-enhanced-hypercube.md
[2]: ib-single-plane-topology.md
[a]: http://en.wikipedia.org/wiki/InfiniBand
[b]: http://en.wikipedia.org/wiki/Ethernet
[c]: http://en.wikipedia.org/wiki/InfiniBand
# CLP
## Introduction
Clp (Coin-or linear programming) is an open-source linear programming solver written in C++. It is primarily meant to be used as a callable library, but a basic, stand-alone executable version is also available.
Clp ([projects.coin-or.org/Clp][1]) is a part of the COIN-OR (The Computational Infrastracture for Operations Research) project ([projects.coin-or.org/][2]).
## Modules
Clp, version 1.16.10 is available on Salomon via module Clp:
```console
$ ml Clp
```
The module sets up environment variables required for linking and running applications using Clp. This particular command loads the default module Clp/1.16.10-intel-2017a, Intel module intel/2017a and other related modules.
## Compiling and Linking
!!! note
Link with -lClp
Load the Clp module. Link using -lClp switch to link your code against Clp.
```console
$ ml Clp
$ icc myprog.c -o myprog.x -Wl,-rpath=$LIBRARY_PATH -lClp
```
## Example
An example of Clp enabled application follows. In this example, the library solves linear programming problem loaded from file.
```cpp
#include "coin/ClpSimplex.hpp"
int main (int argc, const char *argv[])
{
ClpSimplex model;
int status;
if (argc<2)
status=model.readMps("/apps/all/Clp/1.16.10-intel-2017a/lib/p0033.mps");
else
status=model.readMps(argv[1]);
if (!status) {
model.primal();
}
return 0;
}
```
### Load Modules and Compile:
```console
ml Clp
icc lp.c -o lp.x -Wl,-rpath=$LIBRARY_PATH -lClp
```
In this example, the lp.c code is compiled using the Intel compiler and linked with Clp. To run the code, the Intel module has to be loaded.
[1]: https://projects.coin-or.org/Clp
[2]: https://projects.coin-or.org/
# Storage
## Introduction
There are two main shared file systems on Salomon cluster: [HOME][1] and [SCRATCH][2].
All login and compute nodes may access same data on shared file systems. Compute nodes are also equipped with local (non-shared) scratch, ramdisk, and tmp file systems.
## Policy (In a Nutshell)
!!! note
* Use [HOME][1] for your most valuable data and programs.
* Use [WORK][3] for your large project files.
* Use [TEMP][4] for large scratch data.
!!! warning
Do not use for [archiving][5]!
## Archiving
Do not use shared file systems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][6], which is available via SSHFS.
## Shared File Systems
Salomon computer provides two main shared file systems, the [HOME file system][7] and the [SCRATCH file system][8]. The SCRATCH file system is partitioned to [WORK and TEMP workspaces][9]. The HOME file system is realized as a tiered NFS disk storage. The SCRATCH file system is realized as a parallel Lustre file system. Both shared file systems are accessible via the Infiniband network. Extended ACLs are provided on both HOME/SCRATCH file systems for sharing data with other users using fine-grained control.
### HOME File System
The HOME file system is realized as a Tiered file system, exported via NFS. The first tier has the capacity of 100 TB, second tier has the capacity of 400 TB. The file system is available on all login and computational nodes. The Home file system hosts the [HOME workspace][1].
### SCRATCH File System
The architecture of Lustre on Salomon is composed of two metadata servers (MDS) and six data/object storage servers (OSS). Accessible capacity is 1.69 PB, shared among all users. The SCRATCH file system hosts the [WORK and TEMP workspaces][9].
Configuration of the SCRATCH Lustre storage
* SCRATCH Lustre object storage
* Disk array SFA12KX
* 540 x 4 TB SAS 7.2krpm disk
* 54 x OST of 10 disks in RAID6 (8+2)
* 15 x hot-spare disk
* 4 x 400 GB SSD cache
* SCRATCH Lustre metadata storage
* Disk array EF3015
* 12 x 600 GB SAS 15 krpm disk
### Understanding the Lustre File Systems
A user file on the Lustre file system can be divided into multiple chunks (stripes) and stored across a subset of the object storage targets (OSTs) (disks). The stripes are distributed among the OSTs in a round-robin fashion to ensure load balancing.
When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the file's stripes. Once the file is opened and the client obtains the striping information, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and OSTs to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency so that all clients see consistent results.
## Disk Usage and Quota Commands
Disk usage and user quotas can be checked and reviewed using the following command:
```console
$ it4i-disk-usage
```
Example for Salomon:
```console
$ it4i-disk-usage -h
# Using human-readable format
# Using power of 1000 for space
# Using power of 1000 for entries
Filesystem: /home
Space used: 110GB
Space limit: 250GB
Entries: 40K
Entries limit: 500K
# based on filesystem quota
Filesystem: /scratch
Space used: 377GB
Space limit: 100TB
Entries: 14K
Entries limit: 10M
# based on Lustre quota
Filesystem: /scratch
Space used: 377GB
Entries: 14K
# based on Robinhood
Filesystem: /scratch/work
Space used: 377GB
Entries: 14K
Entries: 40K
Entries limit: 1.0M
# based on Robinhood
Filesystem: /scratch/temp
Space used: 12K
Entries: 6
# based on Robinhood
```
In this example, we view current size limits and space occupied on the /home and /scratch filesystem, for a particular user executing the command.
Note that limits are imposed also on number of objects (files, directories, links, etc.) that the user is allowed to create.
To have a better understanding of where the space is exactly used, use the following command:
```console
$ du -hs dir
```
Example for your HOME directory:
```console
$ cd /home
$ du -hs * .[a-zA-z0-9]* | grep -E "[0-9]*G|[0-9]*M" | sort -hr
258M cuda-samples
15M .cache
13M .mozilla
5,5M .eclipse
2,7M .idb_13.0_linux_intel64_app
```
This will list all directories with megabytes or gigabytes of consumed space in your actual (in this example HOME) directory. List is sorted in descending order from largest to smallest files/directories.
To have a better understanding of the previous commands, read the man pages:
```console
$ man lfs
```
```console
$ man du
```
## Extended Access Control List (ACL)
Extended ACLs provide another security mechanism beside the standard POSIX ACLs which are defined by three entries (for owner/group/others). Extended ACLs have more than the three basic entries. In addition, they also contain a mask entry and may contain any number of named user and named group entries.
ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner.
For more information, see the [Access Control List][11] section of the documentation.
## Shared Workspaces
### Home
Users home directories /home/username reside on HOME file system. Accessible capacity is 0.5 PB, shared among all users. Individual users are restricted by file system usage quotas, set to 250 GB per user. If 250 GB should prove as insufficient for particular user, contact [support][d], the quota may be lifted upon request.
!!! note
The HOME file system is intended for preparation, evaluation, processing and storage of data generated by active Projects.
The HOME should not be used to archive data of past Projects or other unrelated data.
The files on HOME will not be deleted until end of the user's lifecycle.
The workspace is backed up, such that it can be restored in case of catastrophic failure resulting in significant data loss. This backup however is not intended to restore old versions of user data or to restore (accidentally) deleted files.
| HOME workspace | |
| ----------------- | -------------- |
| Accesspoint | /home/username |
| Capacity | 500TB |
| Throughput | 6GB/s |
| User space quota | 250GB |
| User inodes quota | 500K |
| Protocol | NFS, 2-Tier |
### Scratch
The SCRATCH is realized as Lustre parallel file system and is available from all login and computational nodes. There are 54 OSTs dedicated for the SCRATCH file system.
Accessible capacity is 1.6PB, shared among all users on TEMP and WORK. Individual users are restricted by file system usage quotas, set to 10M inodes and 100 TB per user. The purpose of this quota is to prevent runaway programs from filling the entire file system and deny service to other users. Should 100TB of space or 10M inodes prove insufficient, contact [support][d], the quota may be lifted upon request.
#### Work
The WORK workspace resides on SCRATCH file system. Users may create subdirectories and files in the **/scratch/work/project/projectid** directory. The directory is accessible to all users involved in the `projectid` project.
!!! note
The WORK workspace is intended to store users project data as well as for high performance access to input and output files. All project data should be removed once the project is finished. The data on the WORK workspace are not backed up.
Files on the WORK file system are **persistent** (not automatically deleted) throughout duration of the project.
#### Temp
The TEMP workspace resides on SCRATCH file system. The TEMP workspace accesspoint is /scratch/temp. Users may freely create subdirectories and files on the workspace. Accessible capacity is 1.6 PB, shared among all users on TEMP and WORK.
!!! note
The TEMP workspace is intended for temporary scratch data generated during the calculation as well as for high performance access to input and output files. All I/O intensive jobs must use the TEMP workspace as their working directory.
Users are advised to save the necessary data from the TEMP workspace to HOME or WORK after the calculations and clean up the scratch files.
!!! warning
Files on the TEMP file system that are **not accessed for more than 90 days** will be automatically **deleted**.
<table>
<tr>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;"></td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">WORK workspace</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">TEMP workspace</td>
</tr>
<tr>
<td style="vertical-align : middle">Accesspoints</td>
<td>/scratch/work/user/projectid</td>
<td>/scratch/temp</td>
</tr>
<tr>
<td>Capacity</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">1.6PB</td>
</tr>
<tr>
<td>Throughput</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">30GB/s</td>
</tr>
<tr>
<td>User space quota</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">100TB</td>
</tr>
<tr>
<td>User inodes quota</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">10M</td>
</tr>
<tr>
<td>Number of OSTs</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">54</td>
</tr>
<tr>
<td>Protocol</td>
<td colspan="2" style="vertical-align : middle;text-align:center;">Lustre</td>
</tr>
</table>
## RAM Disk
### Local RAM Disk
Every computational node is equipped with file system realized in memory, so called RAM disk.
The local RAM disk is mounted as /ramdisk and is accessible to user at /ramdisk/$PBS_JOBID directory.
The RAM disk is private to a job and local to node, created when the job starts and deleted at the job end.
!!! note
The local RAM disk directory /ramdisk/$PBS_JOBID will be deleted immediately after the calculation end. Users should take care to save the output data from within the jobscript.
The local RAM disk file system is intended for temporary scratch data generated during the calculation as well as
for high-performance access to input and output files. Size of RAM disk file system is limited.
It is not recommended to allocate large amount of memory and use large amount of data in RAM disk file system at the same time.
!!! warning
Be very careful, use of RAM disk file system is at the expense of operational memory.
| Local RAM disk | |
| ----------- | ------------------------------------------------------------------------------------------------------- |
| Mountpoint | /ramdisk |
| Accesspoint | /ramdisk/$PBS_JOBID |
| Capacity | 110GB |
| Throughput | over 1.5GB/s write, over 5GB/s read, single thread, over 10GB/s write, over 50GB/s read, 16 threads |
| User quota | none |
### Global RAM Disk
The Global RAM disk spans the local RAM disks of all the nodes within a single job.
For more information, see the [Job Features][12] section.
## Summary
<table>
<tr>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Mountpoint</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Usage</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Protocol</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Net Capacity</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Throughput</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Space/Inodes quota</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Access</td>
<td style="background-color: rgba(0, 0, 0, 0.54); color: white;">Service</td>
</tr>
<tr>
<td>/home</td>
<td>home directory</td>
<td>NFS, 2-Tier</td>
<td>500TB</td>
<td>6GB/s</td>
<td>250GB / 500K</td>
<td>Compute and login nodes</td>
<td>backed up</td>
</tr>
<tr>
<td style="background-color: #D3D3D3;">/scratch/work</td>
<td style="background-color: #D3D3D3;">large project files</td>
<td rowspan="2" style="background-color: #D3D3D3; vertical-align : middle;text-align:center;">Lustre</td>
<td rowspan="2" style="background-color: #D3D3D3; vertical-align : middle;text-align:center;">1.69PB</td>
<td rowspan="2" style="background-color: #D3D3D3; vertical-align : middle;text-align:center;">30GB/s</td>
<td rowspan="2" style="background-color: #D3D3D3; vertical-align : middle;text-align:center;">100TB / 10M</td>
<td style="background-color: #D3D3D3;">Compute and login nodes</td>
<td style="background-color: #D3D3D3;">none</td>
</tr>
<tr>
<td style="background-color: #D3D3D3;">/scratch/temp</td>
<td style="background-color: #D3D3D3;">job temporary data</td>
<td style="background-color: #D3D3D3;">Compute and login nodes</td>
<td style="background-color: #D3D3D3;">files older 90 days removed</td>
</tr>
<tr>
<td>/ramdisk</td>
<td>job temporary data, node local</td>
<td>tmpfs</td>
<td>110GB</td>
<td>90GB/s</td>
<td>none / none</td>
<td>Compute nodes, node local</td>
<td>purged after job ends</td>
</tr>
<tr>
<td style="background-color: #D3D3D3;">/mnt/global_ramdisk</td>
<td style="background-color: #D3D3D3;">job temporary data</td>
<td style="background-color: #D3D3D3;">BeeGFS</td>
<td style="background-color: #D3D3D3;">(N*110)GB</td>
<td style="background-color: #D3D3D3;">3*(N+1)GB/s</td>
<td style="background-color: #D3D3D3;">none / none</td>
<td style="background-color: #D3D3D3;">Compute nodes, job shared</td>
<td style="background-color: #D3D3D3;">purged after job ends</td>
</tr>
</table>
N = number of compute nodes in the job.
[1]: #home
[2]: #shared-filesystems
[3]: #work
[4]: #temp
[5]: #archiving
[6]: ../storage/cesnet-storage.md
[7]: #home-filesystem
[8]: #scratch-filesystem
[9]: #shared-workspaces
[11]: ../storage/standard-file-acl.md
[12]: ../job-features.md#global-ram-disk
[c]: https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/Administration_Guide/ch09s05.html
[d]: https://support.it4i.cz/rt
# Visualization Servers
Remote visualization with [NICE DCV software][3] or [VirtualGL][4] is availabe on two nodes.
| Node | Count | Processor | Cores | Memory | GPU Accelerator |
|---------------|-------|-----------------------------------|-------|--------|------------------------------|
| visualization | 2 | 2 x Intel Xeon E5-2695v3, 2.3 GHz | 28 | 512 GB | NVIDIA QUADRO K5000 4 GB |
## Resource Allocation Policy
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
|-------|----------------|-------------------|-------|-----------|----------|---------------|----------|
| qviz Visualization queue | yes | none required | 2 (with NVIDIA Quadro K5000) | 4 | 150 | no | 1h/8h |
## References
* [Graphical User Interface][1]
* [VPN Access][2]
[1]: ../general/shell-and-data-access.md#graphical-user-interface
[2]: ../general/shell-and-data-access.md#vpn-access
[3]: ../software/viz/NICEDCVsoftware.md
[4]: ../software/viz/vgl.md
\ No newline at end of file
# Diagnostic Component (TEAM)
## Access
TEAM is available at the [following address][a]
!!! note
The address is accessible only via VPN.
## Diagnostic Component
VCF files are scanned by this diagnostic tool for known diagnostic disease-associated variants. When no diagnostic mutation is found, the file can be sent to the disease-causing gene discovery tool to see whether new disease-associated variants can be found.
TEAM (27) is an intuitive and easy-to-use web tool that fills the gap between the predicted mutations and the final diagnostic in targeted enrichment sequencing analysis. The tool searches for known diagnostic mutations, corresponding to a disease panel, among the predicted patient’s variants. Diagnostic variants for the disease are taken from four databases of disease-related variants (HGMD, HUMSAVAR , ClinVar and COSMIC) If no primary diagnostic variant is found, then a list of secondary findings that can help to establish a diagnostic is produced. TEAM also provides with an interface for the definition of and customization of panels, by means of which, genes and mutations can be added or discarded to adjust panel definitions.
![Interface of the application. Panels for defining targeted regions of interest can be set up by just drag and drop known disease genes or disease definitions from the lists. Thus, virtual panels can be interactively improved as the knowledge of the disease increases.](../../../img/fig5.png)
**Figure 5.** Interface of the application. Panels for defining targeted regions of interest can be set up by just drag and drop known disease genes or disease definitions from the lists. Thus, virtual panels can be interactively improved as the knowledge of the disease increases.
[a]: http://omics.it4i.cz/team/