Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Select Git revision
Show changes
Showing
with 760 additions and 0 deletions
docs.it4i/img/vncviewer.png

20.2 KiB

docs.it4i/img/vpnuiV.png

48.6 KiB

docs.it4i/img/vtune-amplifier.png

206 KiB

# Documentation
Welcome to the IT4Innovations documentation.
The IT4Innovations National Supercomputing Center operates the [Karolina][1] and [Barbora][2] supercomputers.
The supercomputers are available to the academic community within the Czech Republic and Europe, and the industrial community worldwide.
The purpose of these pages is to provide comprehensive documentation of the hardware, software, and usage of the computers.
## How to Read the Documentation
1. Select the subject of interest from the left column or use the Search tool in the upper right corner.
1. Scan for all the notes and reminders on the page.
1. If more information is needed, read the details and **look for examples** illustrating the concepts.
## Required Proficiency
!!! note
Basic proficiency in Linux environments is required.
In order to use the system for your calculations, you need basic proficiency in Linux environments.
To gain this proficiency, we recommend you read the [introduction to Linux][a] operating system environments,
and install a Linux distribution on your personal computer.
For example, the [CentOS][b] distribution is similar to systems on the clusters at IT4Innovations and it is easy to install and use,
but any Linux distribution would do.
!!! note
Learn how to parallelize your code.
In many cases, you will run your own code on the cluster.
In order to fully exploit the cluster, you will need to carefully consider how to utilize all the cores available on the node
and how to use multiple nodes at the same time.
You need to **parallelize** your code.
Proficiency in MPI, OpenMP, CUDA, UPC, or GPI2 programming may be gained via [training provided by IT4Innovations][c].
## Terminology Frequently Used on These Pages
* **node:** a computer, interconnected via a network to other computers – Computational nodes are powerful computers, designed for, and dedicated to executing demanding scientific computations.
* **core:** a processor core, a unit of processor, executing computations
* **node-hour:** a metric of computer utilization, [see definition][3].
* **job:** a calculation running on the supercomputer – the job allocates and utilizes the resources of the supercomputer for certain time.
* **HPC:** High Performance Computing
* **HPC (computational) resources:** nodehours, storage capacity, software licenses
* **code:** a program
* **primary investigator (PI):** a person responsible for execution of computational project and utilization of computational resources allocated to that project
* **collaborator:** a person participating in the execution of a computational project and utilization of computational resources allocated to that project
* **project:** a computational project under investigation by the PI – the project is identified by the project ID. Computational resources are allocated and charged per project.
* **jobscript:** a script to be executed by the Slurm workload manager
## Conventions
In this documentation, you will find a number of examples.
We use the following conventions:
Cluster command prompt:
```console
$
```
Your local Linux host command prompt:
```console
local $
```
## Errors
Although we have taken every care to ensure the accuracy of the content, mistakes do happen.
If you find an inconsistency or error, please report it by visiting [support][d], creating a new ticket, and entering the details.
By doing so, you can save other readers from frustration and help us improve.
[1]: karolina/introduction.md
[2]: barbora/introduction.md
[3]: general/resources-allocation-policy.md#resource-accounting-policy
[a]: http://www.tldp.org/LDP/intro-linux/html/
[b]: http://www.centos.org/
[c]: https://www.it4i.cz/en/education/training-activities
[d]: http://support.it4i.cz/rt
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Copyright (c) 2002-2017 iterate GmbH. All rights reserved.
~ https://cyberduck.io/
~
~ This program is free software; you can redistribute it and/or modify
~ it under the terms of the GNU General Public License as published by
~ the Free Software Foundation; either version 2 of the License, or
~ (at your option) any later version.
~
~ This program is distributed in the hope that it will be useful,
~ but WITHOUT ANY WARRANTY; without even the implied warranty of
~ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
~ GNU General Public License for more details.
-->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Protocol</key>
<string>irods</string>
<key>Vendor</key>
<string>it4i</string>
<key>Description</key>
<string>it4iInnovations - VSB - TU Ostrava</string>
<key>Hostname Configurable</key>
<false/>
<key>Port Configurable</key>
<false/>
<key>Default Hostname</key>
<string>irods.it4i.cz</string>
<key>Region</key>
<string>IT4I:it4iResc</string>
<key>Default Port</key>
<string>1247</string>
<key>Authorization</key>
<string>PAM</string>
</dict>
</plist>
{
"irods_host": "irods.it4i.cz",
"irods_port": 1247,
"irods_user_name": "some_user",
"irods_zone_name": "IT4I",
"irods_authentication_scheme": "pam_password",
"irods_ssl_verify_server": "cert",
"irods_ssl_ca_certificate_file": "~/.irods/chain_geant_ov_rsa_ca_4_full.pem",
"irods_encryption_algorithm": "AES-256-CBC",
"irods_encryption_key_size": 32,
"irods_encryption_num_hash_rounds": 16,
"irods_encryption_salt_size": 8
}
# Job Features
Special features installed/configured on the fly on allocated nodes, features are requested in Slurm job using specially formatted comments.
```console
$ salloc... --comment "use:feature=req"
```
or
```
SBATCH --comment "use:feature=req"
```
or for multiple features
```console
$ salloc ... --comment "use:feature1=req1 use:feature2=req2 ..."
```
where `feature` is a feature name and `req` is a requested value (`true`, `version string`, etc.)
## Xorg
[Xorg][2] is a free and open source implementation of the X Window System imaging server maintained by the X.Org Foundation. Xorg is available only for Karolina accelerated nodes Acn[01-72].
```console
$ salloc ... --comment "use:xorg=True"
```
## VTune Support
Load the VTune kernel modules.
```console
$ salloc ... --comment "use:vtune=version_string"
```
`version_string` is VTune version e.g. 2019_update4
## Global RAM Disk
!!! warning
The feature has not been implemented on Slurm yet.
The Global RAM disk deploys BeeGFS On Demand parallel filesystem,
using local (i.e. allocated nodes') RAM disks as a storage backend.
The Global RAM disk is mounted at `/mnt/global_ramdisk`.
```console
$ salloc ... --comment "use:global_ramdisk=true"
```
![Global RAM disk](../img/global_ramdisk.png)
### Example
```console
$ sbatch -A PROJECT-ID -p qcpu --nodes 4 --comment="use:global_ramdisk=true" ./jobscript
```
This command submits a 4-node job in the `qcpu` queue;
once running, a RAM disk shared across the 4 nodes will be created.
The RAM disk will be accessible at `/mnt/global_ramdisk`
and files written to this RAM disk will be visible on all 4 nodes.
The file system is private to a job and shared among the nodes,
created when the job starts and deleted at the job's end.
!!! warning
The Global RAM disk will be deleted immediately after the calculation end.
Users should take care to save the output data from within the jobscript.
The files on the Global RAM disk will be equally striped across all the nodes, using 512k stripe size.
Check the Global RAM disk status:
```console
$ beegfs-df -p /mnt/global_ramdisk
$ beegfs-ctl --mount=/mnt/global_ramdisk --getentryinfo /mnt/global_ramdisk
```
Use Global RAM disk in case you need very large RAM disk space.
The Global RAM disk allows for high performance sharing of data among compute nodes within a job.
!!! warning
Use of Global RAM disk file system is at the expense of operational memory.
| Global RAM disk | |
| ------------------ | --------------------------------------------------------------------------|
| Mountpoint | /mnt/global_ramdisk |
| Accesspoint | /mnt/global_ramdisk |
| Capacity | Barbora (Nx180)GB |
| User quota | none |
N = number of compute nodes in the job.
!!! Warning
Available on Barbora nodes only.
## MSR-SAFE Support
Load a kernel module that allows saving/restoring values of MSR registers.
Uses [LLNL MSR-SAFE][a].
```console
$ salloc ... --comment "use:msr=version_string"
```
`version_string` is MSR-SAFE version e.g. 1.4.0
!!! Danger
Hazardous, it causes CPU frequency disruption.
!!! Warning
Available on Barbora nodes only.
!!! Warning
It is recommended to combine with setting the feature `mon-flops=off`.
## Cluster Monitoring
Disable monitoring of certain registers which are used to collect performance
monitoring counters (PMC) values such as CPU FLOPs or Memory Bandwidth:
```console
$ salloc ... --comment "use:mon-flops=off"
```
!!! Warning
Available on Karolina nodes only.
## HDEEM Support
Load the HDEEM software stack. The [High Definition Energy Efficiency Monitoring][b] (HDEEM) library is a software interface used to measure power consumption of HPC clusters with bullx blades.
```console
$ salloc ... --comment "use:hdeem=version_string"
```
`version_string` is HDEEM version e.g. 2.2.8-1
!!! Warning
Available on Barbora nodes only.
## NVMe Over Fabrics File System
!!! warning
The feature has not been implemented on Slurm yet.
Attach a volume from an NVMe storage and mount it as a file-system. File-system is mounted on /mnt/nvmeof (on the first node of the job).
Barbora cluster provides two NVMeoF storage nodes equipped with NVMe disks. Each storage node contains seven 1.6TB NVMe disks and provides net aggregated capacity of 10.18TiB. Storage space is provided using the NVMe over Fabrics protocol; RDMA network i.e. InfiniBand is used for data transfers.
```console
$ salloc ... --comment "use:nvmeof=size"
```
`size` is a size of the requested volume, size conventions are used, e.g. 10t
Create a shared file-system on the attached NVMe file-system and make it available on all nodes of the job. Append `:shared` to the size specification, shared file-system is mounted on /mnt/nvmeof-shared.
```console
$ salloc ... --comment "use:nvmeof=size:shared"
```
For example:
```console
$ salloc ... --comment "use:nvmeof=10t:shared"
```
!!! Warning
Available on Barbora nodes only.
## Smart Burst Buffer
!!! warning
The feature has not been implemented on Slurm yet.
Accelerate SCRATCH storage using the Smart Burst Buffer (SBB) technology. A specific Burst Buffer process is launched and Burst Buffer resources (CPUs, memory, flash storage) are allocated on an SBB storage node for acceleration (I/O caching) of SCRATCH data operations. The SBB profile file `/lscratch/$SLURM_JOB_ID/sbb.sh` is created on the first allocated node of job. For SCRATCH acceleration, the SBB profile file has to be sourced into the shell environment - provided environment variables have to be defined in the process environment. Modified data is written asynchronously to a backend (Lustre) filesystem, writes might be proceeded after job termination.
Barbora cluster provides two SBB storage nodes equipped with NVMe disks. Each storage node contains ten 3.2TB NVMe disks and provides net aggregated capacity of 29.1TiB. Acceleration uses RDMA network i.e. InfiniBand is used for data transfers.
```console
$ salloc ... --comment "use:sbb=spec:
```
`spec` specifies amount of resources requested for Burst Buffer (CPUs, memory, flash storage), available values are small, medium, and large
Loading SBB profile:
```console
$ source /lscratch/$SLURM_JOB_ID/sbb.sh
```
!!! Warning
Available on Barbora nodes only.
[1]: software/tools/virtualization.md#tap-interconnect
[2]: general/accessing-the-clusters/graphical-user-interface/xorg.md
[a]: https://software.llnl.gov/news/2019/04/29/msrsafe-1.3.0/
[b]: https://tu-dresden.de/zih/forschung/projekte/hdeem
# Compute Nodes
Karolina is a cluster of x86-64 AMD- and Intel-based nodes built with the HPE technology. The cluster contains four types of compute nodes.
## Compute Nodes Without Accelerators
Standard compute nodes without accelerators (such as GPUs or FPGAs) are based on the x86 CPU architecture and provide quick accessibility for the users and their existing codes.
* 720 nodes
* 92,160 cores in total
* 2x AMD EPYC™ 7H12, 64-core, 2.6 GHz processors per node
* 256 GB DDR4 3200MT/s of physical memory per node
* 5,324.8 GFLOP/s per compute node
* 1x 100 Gb/s IB port
* Cn[001-720]
![](img/apolloproliant.png)
## Compute Nodes With a GPU Accelerator
Accelerated compute nodes deliver most of the compute power usable for HPC as well as excellent performance in HPDA and AI workloads, especially in the learning phase of Deep Neural Networks.
* 72 nodes
* 9,216 cores in total
* 2x AMD EPYC™ 7763, 64-core, 2.45 GHz processors per node
* 1024 GB DDR4 3200MT/s of physical memory per node
* 8x GPU accelerator NVIDIA A100 per node, 320GB HBM2 memory per node
* 5,017.6 GFLOP/s per compute node
* 4x 200 Gb/s IB port
* Acn[01-72]
![](img/hpeapollo6500.png)
## Data Analytics Compute Node
Data analytics compute node is oriented on supporting huge memory jobs by implementing a NUMA SMP system with large cache coherent memory.
* 1x HPE Superdome Flex server
* 768 cores in total
* 32x Intel® Xeon® Platinum, 24-core, 2.9 GHz, 205W
* 24 TB DDR4 2993MT/s of physical memory per node
* 2x 200 Gb/s IB port
* 71.2704 TFLOP/s
* Sdf1
![](img/superdomeflex.png)
## Cloud Compute Node
Cloud compute nodes support both the research and operation of the Infrastructure/HPC as a Service. It is intended for provision and operation of cloud technologies like OpenStack and Kubernetes.
* 36 nodes
* 4,608 cores in total
* 2x AMD EPYC™ 7H12, 64-core, 2.6 GHz processors per node
* 256 GB DDR4 3200MT/s of physical memory per node
* HPE ProLiant XL225n Gen10 Plus servers
* 5,324.8 GFLOP/s per compute node
* 2x 10 Gb/s Ethernet
* 1x 100 Gb/s IB port
* CLn[01-36]
## Compute Node Summary
| Node type | Count | Range | Memory | Cores |
| ---------------------------- | ----- | ------------ | ------- | -------------- |
| Nodes without an accelerator | 720 | Cn[001-720] | 256 GB | 128 @ 2.6 GHz |
| Nodes with a GPU accelerator | 72 | Acn[01-72] | 1024 GB | 128 @ 2.45 GHz |
| Data analytics nodes | 1 | Sdf1 | 24 TB | 768 @ 2.9 GHz |
| Cloud partiton | 36 | CLn[01-36] | 256 GB | 128 @ 2.6 GHz | |
## Processor Architecture
Karolina is equipped with AMD EPYC™ 7H12 (nodes without accelerators, Cloud partiton), AMD EPYC™ 7763 (nodes with accelerators), and Intel Cascade Lake Xeon-SC 8268 (Data analytics partition).
### AMD [Epyc™ 7H12][d]
EPYC™ 7H12 is a 64-bit 64-core x86 server microprocessor designed and introduced by AMD in late 2019. This multi-chip processor, which is based on the Zen 2 microarchitecture, incorporates logic fabricated TSMC 7 nm process and I/O fabricated on GlobalFoundries 14 nm process. The 7H12 has a TDP of 280 W with a base frequency of 2.6 GHz and a boost frequency of up to 3.3 GHz. This processor supports up to two-way SMP and up to 4 TiB of eight channels DDR4-3200 memory per socket.
* **Family**: EPYC™
* **Cores**: 64
* **Threads**: 128
* **L1I Cache**: 2 MiB, 64x32 KiB, 8-way set associative
* **L1D Cache**: 2 MiB, 64x32 KiB, 8-way set associative
* **L2 Cache**: 32 MiB, 64x512 KiB, 8-way set associative, write-back
* **L3 Cache**: 256 MiB, 16x16 MiB
* **Instructions**: x86-16, x86-32, x86-64, MMX, EMMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4a, AVX, AVX2, AES, CLMUL, RdRanD, FMA3, F16C, ABM, BMI1, BMI2, AMD-Vi, AMD-V, SHA, ADX, Real, Protected, SMM, FPU, NX, SMT, SME, TSME, SEV, SenseMI, Boost2
* **Frequency**: 2.6 GHz
* **Max turbo**: 3.3 GHz
* **Process**: 7 nm, 14 nm
* **TDP**: 280 W
### AMD [Epyc™ 7763][e]
EPYC 7763 is a 64-bit 64-core x86 server microprocessor designed and introduced by AMD in March 2021. This multi-chip processor, which is based on the Zen 3 microarchitecture, incorporates eight Core Complex Dies fabricated on a TSMC advanced 7 nm process and a large I/O die manufactured by GlobalFoundries. The 7763 has a TDP of 280 W with a base frequency of 2.45 GHz and a boost frequency of up to 3.5 GHz. This processor supports up to two-way SMP and up to 4 TiB of eight channel DDR4-3200 memory per socket.
* **Family**: EPYC™
* **Cores**: 64
* **Threads**: 128
* **L1I Cache**: 2 MiB, 64x32 KiB, 8-way set associative, write-back
* **L1D Cache**: 2 MiB, 64x32 KiB, 8-way set associative, write-back
* **L2 Cache**: 32 MiB, 64x512 KiB, 8-way set associative, write-back
* **L3 Cache**: 256 MiB, 8x32 MiB, 16-way set associative, write-back
* **Instructions**: x86-16, x86-32, x86-64, MMX, EMMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4a, AVX, AVX2, AES, CLMUL, RdRanD, FMA3, F16C, ABM, BMI1, BMI2, AMD-Vi, AMD-V, SHA, ADX, Real, Protected, SMM, FPU, NX, SMT, SME, TSME, SEV, SenseMI
* **Frequency**: 2.45 GHz
* **Max turbo**: 3.5 GHz
* **Process**: 7 nm
* **TDP**: 280 W
### Intel [Skylake Platinum 8268][f]
Xeon Platinum 8268 is a 64-bit 24-core x86 high-performance server microprocessor introduced by Intel in early 2019. The Platinum 8268 is based on the Cascade Lake microarchitecture and is manufactured on a 14 nm process. This chip supports 8-way multiprocessing, sports 2 AVX-512 FMA units as well as three Ultra Path Interconnect links. This microprocessor supports up 1 TiB of hexa-channel DDR4-2933 memory, operates at 2.9 GHz with a TDP of 205 W and features a turbo boost frequency of up to 3.9 GHz.
* **Family**: Xeon Platinum
* **Cores**: 24
* **Threads**: 48
* **L1I Cache**: 768 KiB, 24x32 KiB, 8-way set associative
* **L1D Cache**: 768 KiB, 24x32 KiB, 8-way set associative, write-back
* **L2 Cache**: 24 MiB, 24x1 MiB, 16-way set associative, write-back
* **L3 Cache**: 35.75 MiB, 26x1.375 MiB, 11-way set associative, write-back
* **Instructions**: x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512
* **Frequency**: 2.9 GHz
* **Max turbo**: 3.9 GHz
* **Process**: 14 nm
* **TDP**: 205 W
## GPU Accelerator
Karolina is equipped with an [NVIDIA A100][g] accelerator.
|NVIDIA A100||
| --- | --- |
| GPU Architecture | NVIDIA Ampere |
| NVIDIA Tensor| Cores: 432 |
| NVIDIA CUDA® Cores | 6912 |
| Double-Precision Performance | 9.7 TFLOP/s |
| Single-Precision Performance | 19.5 TFLOP/s |
| Tensor Performance | 312 TFLOP/s |
| GPU Memory | 40 GB HBM2 |
| Memory Bandwidth | 1555 GB/sec |
| ECC | Yes |
| Interconnect Bandwidth | 600 GB/sec |
| System Interface | NVIDIA NVLink |
| Form Factor | SXM4 | (sxm 8?)
| Max Power Consumption | 400 W |
| Thermal Solution | Passive |
| Compute APIs | CUDA, DirectCompute,OpenCLTM, OpenACC | (?)
[c]: https://en.wikichip.org/wiki/x86/avx512vnni
[d]: https://en.wikichip.org/wiki/amd/epyc/7h12
[e]: https://en.wikichip.org/wiki/amd/epyc/7763
[f]: https://en.wikichip.org/wiki/intel/xeon_platinum/8268
[g]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/a100-80gb-datasheet-update-nvidia-us-1521051-r2-web.pdf
# Hardware Overview
Karolina consists of 829 computational nodes of which 720 are universal compute nodes (**Cn[001-720]**), 72 are NVIDIA A100 accelerated nodes (**Acn[01-72]**), 1 is a data analytics node (**Sdf1**), and 36 are cloud partitions (**CLn[01-36]**). Each node is a powerful x86-64 computer, equipped with 128/768 cores (64-core AMD EPYC™ 7H12 / 64-core AMD EPYC™ 7763 / 24-core Intel Xeon-SC 8268) and at least 256 GB of RAM.
[User access][5] to Karolina is provided by four login nodes **login[1-4]**. The nodes are interlinked through high speed InfiniBand and Ethernet networks.
The Accelerated nodes, Data analytics node, and Cloud nodes are available [upon request][a] from a PI. For more information about accessing the nodes, see also the [Resources Allocation Policy][2] section.
For more technical information, see the [Compute Nodes][1] section.
The parameters are summarized in the following tables:
| **In general** | |
| ------------------------------------------- | ---------------------------------------------- |
| Primary purpose | High Performance Computing |
| Architecture of compute nodes | x86-64 |
| Operating system | Linux |
| **Compute nodes** | |
| Total | 829 |
| Processor cores | 128/768 (2x64 cores/32x24 cores) |
| RAM | min. 256 GB |
| Local disk drive | no |
| Compute network | InfiniBand HDR |
| Universal compute node | 720, Cn[001-720] |
| Accelerated compute nodes | 72, Acn[01-72] |
| Data analytics compute nodes | 1, Sdf1 |
| Cloud compute nodes | 36, CLn[01-36] |
| **In total** | |
| Total theoretical peak performance (Rpeak) | 15.7 PFLOP/s |
| Total amount of RAM | 313 TB |
| Node | Processor | Memory | Accelerator |
| ------------------------ | --------------------------------------- | ------ | ---------------------------- |
| Universal compute node | 2 x AMD Zen 2 EPYC™ 7H12, 2.6 GHz | 256 GB | - |
| Accelerated compute node | 2 x AMD Zen 3 EPYC™ 7763, 2.45 GHz | 1024 GB | 8 x NVIDIA A100 (40 GB HBM2) |
| Data analytics node | 32 x Intel Xeon-SC 8268, 2.9 GHz | 24 TB | - |
| Cloud compute node | 2 x AMD Zen 2 EPYC™ 7H12, 2.6 GHz | 256 GB | - |
[1]: compute-nodes.md
[2]: ../general/resources-allocation-policy.md
[3]: network.md
[4]: storage.md
[5]: ../general/shell-and-data-access.md
[6]: visualization.md
[a]: https://support.it4i.cz/rt
docs.it4i/karolina/img/apolloproliant.png

400 KiB

docs.it4i/karolina/img/compute_network_topology_v2.png

741 KiB

docs.it4i/karolina/img/hpeapollo6500.png

354 KiB

docs.it4i/karolina/img/proliantdl385.png

236 KiB

docs.it4i/karolina/img/qrtx6000.png

149 KiB

docs.it4i/karolina/img/superdomeflex.png

438 KiB

# Introduction
!!! important "Karolina Update"
Karolina has been updated. This includes updates to cluster management tools and new node images with **Rocky Linux 8.9**. Expect new versions of kernels, libraries, and drivers on compute nodes.
Karolina is the latest and most powerful supercomputer cluster built for IT4Innovations in Q2 of 2021. The Karolina cluster consists of 829 compute nodes, totaling 106,752 compute cores with 313 TB RAM, giving over 15.7 PFLOP/s theoretical peak performance.
Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with AMD Zen 2, Zen 3, and Intel Cascade Lake architecture processors. Seventy two nodes are also equipped with NVIDIA A100 accelerators. Read more in [Hardware Overview][1].
The cluster runs with an operating system compatible with the Red Hat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
The user data shared file system and job data shared file-system are available to users.
The [Slurm][b] workload manager provides [computing resources allocations and job execution][3].
Read more on how to [apply for resources][4], [obtain login credentials][5] and [access the cluster][6].
[1]: hardware-overview.md
[2]: ../environment-and-modules.md
[3]: ../general/job-submission-and-execution.md
[4]: ../general/applying-for-resources.md
[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[6]: ../general/shell-and-data-access.md
[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
[b]: https://slurm.schedmd.com/
# Network
All of the compute and login nodes of Karolina are interconnected through an [InfiniBand][a] HDR 200Gbps network and a gigabit ethernet network.
The compute network is configured as a non-blocking Fat Tree which consists of 60 x 40-ports Mellanox Quantum™ HDR switches (40 Leaf HDR switches and 20 Spine HDR switches).
![](img/compute_network_topology_v2.png)<br>*For a higher resolution, open the image in a new browser tab.*
Compute nodes and the service infrastructure is connected by the HDR100 technology that allows one 200Gbps HDR port (aggregation 4x 50Gbps) divided into 2 HDR100 ports with 100Gbps (2x 50Gbps) bandwidth. The cabling between the L1 and L2 layer is realized by HDR cabling, connecting the end devices is realized by so called Y or splitter cable (1x HDR200 - 2x HDR100).
**The compute network has the following parameters**
* 100Gbps
* Latencies less than 10 microseconds (0.6μs end-to-end, <90ns switch hop)
* Adaptive routing support
* MPI communication support
* IP protocol support (IPoIB)
* Support for SCRATCH Data Storage and NVMe over Fabric Data Storage.
## Mellanox Quantum™ QM8790 40-Ports Switch
[Mellanox][b] provides the world’s smartest switch, enabling in-network computing through the Co-Design Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)™ technology.
QM8790 has the highest fabric performance available in the market with up to 16Tb/s of non-blocking bandwidth with sub-130ns port-to-port latency.
**Performance**
* 40 x HDR200 200Gb/s ports in a 1U switch
* 80 x HDR100 100Gb/s ports (using splitter cables)
* 16Tb/s aggregate switch throughput
* Sub-130ns switch latency – Optimized design
**Optimized Design**
* 1+1 redundant & hot-swappable power
* N+1 redundant & hot-swappable fans
* 80 gold+ and energy star certified power supplies
**Advanced Design**
* Adaptive routing
* Congestion control
* Collective offloads (Mellanox SHARP™ technology)
* VL mapping (VL2VL)
[a]: http://en.wikipedia.org/wiki/InfiniBand
[b]: https://network.nvidia.com/files/doc-2020/pb-qm8790.pdf
# Storage
Karolina cluster provides two main shared filesystems, [HOME filesystem][1] and [SCRATCH filesystem][2], and has access to IT4Innovations' central PROJECT storage, as well. All login and compute nodes may access the same data on shared file systems. Compute nodes are also equipped with local (non-shared) scratch, RAM disk, and TMP file systems.
## Archiving
Shared filesystems should not be used as a backup for large amount of data or long-term data storage. The academic staff and students of research institutions in the Czech Republic can use the [CESNET storage][6] service, which is available via SSHFS.
### HOME File System
The HOME filesystem is an HA cluster of two active-passive NFS servers. This filesystem contains users' home directories `/home/username`. Accessible capacity is 31 TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 25 GB per user. Should 25 GB prove insufficient, contact [support][d], the quota may be increased upon request.
!!! note
The HOME filesystem is intended for preparation, evaluation, processing and storage of data generated by active projects.
The files on HOME filesystem will not be deleted until the end of the [user's lifecycle][4].
The filesystem is backed up, so that it can be restored in case of a catastrophic failure resulting in significant data loss. However, this backup is not intended to restore old versions of user data or to restore deleted files.
| HOME filesystem | |
| -------------------- | ------------------------------ |
| Mountpoint | /home/username |
| Capacity | 31 TB |
| Throughput | 1.93 GB/s write, 3.1 GB/s read |
| User space quota | 25 GB |
| User inodes quota | 500 k |
| Protocol | NFS |
Configuration of the storage:
**2x NFS server HPE ProLiant DL325 Gen10 Plus**
* 1x AMD EPYC 7302P (3.0GHz/16-core/155W)
* 8x 16GB (1x16GB) Dual Rank x8 DDR4-3200 CAS-22-22-22
* 2x 240GB SATA 6G Read Intensive SFF (2.5in) SC SSD – (HW RAID1)
* 1x Smart Array E208i-a SR Gen10 (No Cache) 12G SAS Modular LH Controller
* 1x HPE SN1100Q 16Gb Dual Port Fibre Channel Host Bus Adapter
* 1x Intel I350-T4 Ethernet 1Gb 4-port BASE-T OCP3 Adapter
* ILO5
* 1x InfiniBand HDR100/Ethernet 100Gb 2-port QSFP56 PCIe4 x16 MCX653106A-ECAT Adapter
* 2x 500W Flex Slot Platinum Hot Plug Low Halogen Power Supply Kit
* OS: Red Hat Enterprise Linux Server
**1x Storage array HPE MSA 2060 16Gb Fibre Channel SFF Storage**
* 1x Base MSA 2060 SFF Storage Drive Enclosure
* 22x MSA 1.92TB SAS 12G SFF (2.5in) M2 SSD
* 1x MSA 16Gb Short Wave Fibre Channel SFP+ 4-pack Transceiver
* Dual-controller, 4x 16Gb FC host interface
* LAN connectivity 2x 1Gb/s
* Redundant, hot-swap power supplies
### SCRATCH File System
The SCRATCH filesystem is realized as a parallel Lustre filesystem. It is accessible via the Infiniband network and is available from all login and compute nodes. Extended ACLs are provided on the Lustre filesystems for sharing data with other users using fine-grained control. For basic information about Lustre, see the [Understanding the Lustre Filesystems][7] subsection of the Barbora's storage documentation.
The SCRATCH filesystem is mounted in the `/scratch/project/PROJECT_ID` directory created automatically with the `PROJECT_ID` project. Accessible capacity is 1000 TB, shared among all users. Users are restricted by PROJECT quotas set to 20 TB. The purpose of this quota is to prevent runaway programs from filling the entire filesystem and deny service to other users. Should 20 TB prove insufficient, contact [support][d], the quota may be increased upon request.
To find out current SCRATCH quotas, use:
```code
[usr0123@login1.karolina ~]$ getent group OPEN-XX-XX
open-xx-xx:*:1234:user1,...,usern
[usr0123@login1.karolina ~]$ lfs quota -p 1234 /scratch/
Disk quotas for prj 1234 (pid 1234):
Filesystem kbytes quota limit grace files quota limit grace
/scratch/ 14356700796 0 19531250000 - 82841 0 20000000 -
```
!!! note
The Scratch filesystem is intended for temporary scratch data generated during the calculation as well as for high-performance access to input and output files. All I/O intensive jobs must use the SCRATCH filesystem as their working directory.
Users are advised to save the necessary data from the SCRATCH filesystem to HOME filesystem after the calculations and clean up the scratch files.
!!! warning
Files on the SCRATCH filesystem that are **not accessed for more than 90 days** will be automatically **deleted**.
| SCRATCH filesystem | |
| -------------------- | ---------------------------------- |
| Mountpoint | /scratch |
| Capacity | 1361 TB |
| Throughput | 730.9 GB/s write, 1198.3 GB/s read |
| PROJECT quota | 20 TB |
| PROJECT inodes quota | 20 M |
| Default stripe size | 1 MB |
| Default stripe count | 1 |
| Protocol | Lustre |
Configuration of the storage:
**1x SMU - ClusterStor 2U/24 System Management Unit Storage Controller**
* 5x Cray ClusterStor 1.6TB NVMe x4 Lanes Mixed Use SFF (2.5in) U.2 with Carrier
* 2x Cray ClusterStor InfiniBand HDR/Ethernet 200Gb 1-port QSFP PCIe4 Adapter (Mellanox ConnectX-6)
**1x MDU - ClusterStor 2U/24 Metadata Unit Storage Controller**
* 24x Cray ClusterStor 1.6TB NVMe x4 Lanes Mixed Use SFF (2.5in) U.2 with Carrier
* 2x Cray ClusterStor InfiniBand HDR/Ethernet 200Gb 1-port QSFP PCIe4 Adapter (Mellanox ConnectX-6)
**24x SSU-F - ClusterStor 2U24 Scalable Storage Unit Flash Storage Controller**
* 24x Cray ClusterStor 3.2TB NVMe x4 Lanes Mixed Use SFF (2.5in) U.2 with Carrier
* 4x Cray ClusterStor InfiniBand HDR/Ethernet 200Gb 1-port QSFP PCIe4 Adapter (Mellanox ConnectX-6)
**2x LMN - Aruba 6300M 48-port 1GbE**
* Aruba X371 12VDC 250W 100-240VAC Power-to-Port Power Supply
### PROJECT File System
The PROJECT data storage is a central storage for projects' and users' data at IT4Innovations that is accessible from all clusters.
For more information, see the [PROJECT Data Storage][9] section.
### Disk Usage and Quota Commands
For more information about disk usage and user quotas, see the Barbora's [storage section][8].
### Extended ACLs
Extended ACLs provide another security mechanism beside the standard POSIX ACLs, which are defined by three entries (for owner/group/others). Extended ACLs have more than the three basic entries. In addition, they also contain a mask entry and may contain any number of named user and named group entries.
ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner.
For more information, see the [Access Control List][10] section of the documentation.
## Local Filesystems
### TMP
Each node is equipped with a local `/tmp` directory of few GB capacity. The `/tmp` directory should be used to work with small temporary files. Old files in the `/tmp` directory are automatically purged.
## Summary
| Mountpoint | Usage | Protocol | Net Capacity | Throughput | Limitations | Access | Services | |
| ---------- | ------------------------- | -------- | -------------- | ---------- | ----------- | ----------------------- | --------------------------- | ------ |
| /home | home directory | NFS | 31 TB | 1.93 GB/s write, 3.1 GB/s read | Quota 25 GB | Compute and login nodes | backed up | |
| /scratch | cluster shared jobs' data | Lustre | 1361 TB | 730.9 GB/s write, 1198.3 GB/s read | Quota 20 TB| Compute and login nodes | files older 90 days removed | |
| /tmp | local temporary files | local | ------ | ------- | none | Compute / login nodes | auto | purged |
[1]: #home-file-system
[2]: #scratch-file-system
[4]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[5]: #project-file-system
[6]: ../storage/cesnet-storage.md
[7]: ../barbora/storage.md#understanding-the-lustre-filesystems
[8]: ../barbora/storage.md#disk-usage-and-quota-commands
[9]: ../storage/project-storage.md
[10]: ../storage/standard-file-acl.md
[a]: http://www.nas.nasa.gov
[b]: http://www.nas.nasa.gov/hecc/support/kb/Lustre_Basics_224.html#striping
[c]: http://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace
[d]: https://support.it4i.cz/rt
[e]: http://man7.org/linux/man-pages/man1/nfs4_setfacl.1.html
[l]: http://man7.org/linux/man-pages/man1/nfs4_getfacl.1.html