Skip to content
Snippets Groups Projects
Commit d0cc85bc authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

New cluster

parent bfd3ef06
No related branches found
No related tags found
4 merge requests!368Update prace.md to document the change from qprace to qprod as the default...,!367Update prace.md to document the change from qprace to qprod as the default...,!366Update prace.md to document the change from qprace to qprod as the default...,!323extended-acls-storage-section
Showing
with 768 additions and 25 deletions
# Compute Nodes
Barbora is a cluster of x86-64 Intel based nodes built with BullSequana Computing technology. The cluster contains three types of compute nodes.
## Compute Nodes Without Accelerators
* 192 nodes
* 6912 cores in total
* 2x Intel Cascade Lake 6240, 18-core, 2.6 GHz processors per node
* 192 GB DDR4 2933MT/s of physical memory per node (12x 16 GB)
* BullSequana X1120 blade servers
* 2995,2 GFLOP/s per compute node
* 1x 1 GB Ethernet
* 1x HDR100 IB port
* 3 computes nodes per X1120 blade server
* cn[1-192]
![](img/BullSequanaX1120.png)
## Compute Nodes With a GPU Accelerator
* 8 nodes
* 192 cores in total
* two Intel Skylake Gold 6126, 12-core, 2.6 GHz processors per node
* 192 GB DDR4 2933MT/s with ECC of physical memory per node (12x 16 GB)
* 4x GPU accelerator NVIDIA Tesla V100-SXM2 per node
* Bullsequana X410-E5 NVLink-V blade servers
* 1996,8 GFLOP/s per compute nodes
* GPU-tp-GPU All-to-All NVLINK 2.0, GPU-Direct
* 1 GB Ethernet
* 2x HDR100 IB ports
* cn[194-200]
![](img/BullSequanaX410E5GPUNVLink.jpg)
## Fat Compute Node
* 1x BullSequana X808 server
* 128 cores in total
* 8 Intel Skylake 8153, 16-core, 2.0 GHz, 125W
* 6144 GiB DDR4 2667MT/s of physical memory per node (92x 64 GB)
* 2x HDR100 IB port
* 8192 GFLOP/s
* cn[201]
![](img/BullSequanaX808.jpg)
## Compute Node Summary
| Node type | Count | Range | Memory | Cores | Queues |
| ---------------------------- | ----- | ----------- | ------ | ----------- | -------------------------- |
| Nodes without an accelerator | 189 | cn[1-189] | 192GB | 36 @ 2.6 GHz | qexp, qprod, qlong, qfree |
| Nodes with a GPU accelerator | 8 | cn[190-197] | 192GB | 24 @ 2.6 GHz | qnvidia |
| Fat compute nodes | 1 | cn[198] | 6144GiB | 128 @ 2.0 GHz | qfat |
## Processor Architecture
Barbora is equipped with Intel Cascade Lake processors Intel Xeon 6240 (nodes without accelerators), Intel Skylake Gold 6126 (nodes with accelerators) and Intel Skylake Platinum 8153.
### Intel [Cascade Lake 6240][d]
Cascade Lake core is largely identical to that of [Skylake's][a]. For in-depth detail of the Skylake core/pipeline see [Skylake (client) § Pipeline][b].
Xeon Gold 6240 is a 64-bit 18-core x86 multi-socket high performance server microprocessor set to be introduced by Intel in late 2018. This chip supports up to 4-way multiprocessing. The Gold 6240, which is based on the Cascade Lake microarchitecture and is manufactured on a 14 nm process, sports 2 AVX-512 FMA units as well as three Ultra Path Interconnect links. This microprocessor, which operates at 2.6 GHz with a TDP of ? W and a turbo boost frequency of up to 3.9 GHz, supports up ? GiB of hexa-channel DDR4-2666 ECC memory.
* **Family**: Xeon Gold
* **Cores**: 18
* **Threads**: 36
* **L1I Cache**: 576 KiB, 18x32 KiB, 8-way set associative
* **L1D Cache**: 576 KiB, 18x32 KiB, 8-way set associative, write-back
* **L2 Cache**: 18 MiB, 18x1 MiB, 16-way set associative, write-back
* **L3 Cache**: 24.75 MiB, 18x1.375 MiB, 11-way set associative, write-back
* **Instructions**: x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512 (New instructions for [Vector Neural Network Instructions][c])
* **Frequency**: 2.6 GHz
* **Max turbo**: 3.9 GHz
* **Proccess**: 14 nm
* **TDP**: 140+ W
### Intel [Skylake Gold 6126][e]
Xeon Gold 6126 is a 64-bit dodeca-core x86 multi-socket high performance server microprocessor introduced by Intel in mid-2017. This chip supports up to 4-way multiprocessing. The Gold 6126, which is based on the server configuration of the Skylake microarchitecture and is manufactured on a 14 nm+ process, sports 2 AVX-512 FMA units as well as three Ultra Path Interconnect links. This microprocessor, which operates at 2.6 GHz with a TDP of 125 W and a turbo boost frequency of up to 3.7 GHz, supports up 768 GiB of hexa-channel DDR4-2666 ECC memory.
* **Family**: Xeon Gold
* **Cores**: 12
* **Threads**: 24
* **L1I Cache**: 384 KiB, 12x32 KiB, 8-way set associative
* **L1D Cache**: 384 KiB, 12x32 KiB, 8-way set associative, write-back
* **L2 Cache**: 12 MiB, 12x1 MiB, 16-way set associative, write-back
* **L3 Cache**: 19.25 MiB, 14x1.375 MiB, 11-way set associative, write-back
* **Instructions**: x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512
* **Frequency**: 2.6 GHz
* **Max turbo**: 3.7 GHz
* **Proccess**: 14 nm
* **TDP**: 125 W
### Intel [Skylake Platinum 8153][f]
Xeon Platinum 8153 is a 64-bit 16-core x86 multi-socket highest performance server microprocessor introduced by Intel in mid-2017. This chip supports up to 8-way multiprocessing. The Platinum 8153, which is based on the server configuration of the Skylake microarchitecture and is manufactured on a 14 nm+ process, sports 2 AVX-512 FMA units as well as three Ultra Path Interconnect links. This microprocessor, which operates at 2 GHz with a TDP of 125 W and a turbo boost frequency of up to 2.8 GHz, supports up 768 GiB of hexa-channel DDR4-2666 ECC memory.
* **Family**: Xeon Platinum
* **Cores**: 16
* **Threads**: 32
* **L1I Cache**: 512 KiB, 16x32 KiB, 8-way set associative
* **L1D Cache**: 512 KiB, 16x32 KiB, 8-way set associative, write-back
* **L2 Cache**: 16 MiB, 16x1 MiB, 16-way set associative, write-back
* **L3 Cache**: 22 MiB, 16x1.375 MiB, 11-way set associative, write-back
* **Instructions**: x86-64, MOVBE, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, POPCNT, AVX, AVX2, AES, PCLMUL, FSGSBASE, RDRND, FMA3, F16C, BMI, BMI2, VT-x, VT-d, TXT, TSX, RDSEED, ADCX, PREFETCHW, CLFLUSHOPT, XSAVE, SGX, MPX, AVX-512
* **Frequency**: 2.0 GHz
* **Max turbo**: 2.8 GHz
* **Proccess**: 14 nm
* **TDP**: 125 W
## GPU Accelerator
Barbora is equipped with [NVIDIA Tesla V100-SXM2][g] accelerator.
![](img/gpu-v100.png)
|NVIDIA Tesla V100-SXM2||
| --- | --- |
| GPU Architecture | NVIDIA Volta |
| NVIDIA Tensor| Cores: 640 |
| NVIDIA CUDA® Cores | 5 120 |
| Double-Precision Performance | 7.8 TFLOP/s |
| Single-Precision Performance | 15.7 TFLOP/s |
| Tensor Performanc | 125 TFLOP/s |
| GPU Memory | 16 GB HBM2 |
| Memory Bandwidth | 900 GB/sec |
| ECC | Yes |
| Interconnect Bandwidth | 300 GB/sec |
| System Interface | NVIDIA NVLink |
| Form Factor | SXM2 |
| Max Power Comsumption | 300 W |
| Thermal Solution | Passive |
| Compute APIs | CUDA, DirectCompute,OpenCLTM, OpenACC |
[a]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(server)#Core
[b]: https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Pipeline
[c]: https://en.wikichip.org/wiki/x86/avx512vnni
[d]: https://en.wikichip.org/wiki/intel/xeon_gold/6240
[e]: https://en.wikichip.org/wiki/intel/xeon_gold/6126
[f]: https://en.wikichip.org/wiki/intel/xeon_platinum/8153
[g]: https://images.nvidia.com/content/technologies/volta/pdf/tesla-volta-v100-datasheet-letter-fnl-web.pdf
# Hardware Overview
The Barbora cluster consists of 201 computational nodes named **cn[1-201]** of which 192 are regular compute nodes, 8 are GPU Tesla V100 accelerated nodes and 1 are fat nodes. Each node is a powerful x86-64 computer, equipped with 36/24/128 cores (18-core Intel Cascade Lake 6240 / 12-core Intel Skylake Gold 6126 / 16-core Intel Skylake 8153), at least 192 GB of RAM. User access to the Barbora cluster is provided by two login nodes **login[1,2]**. The nodes are interlinked through high speed InfiniBand and Ethernet networks.
The Fat nodes are equipped with a large amount (6144 GB) of memory. Virtualization infrastructure provides resources to run long term servers and services in virtual mode. Accelerated nodes, fat nodes, and virtualization infrastructure are available [upon request][a] from a PI.
**There are four types of compute nodes:**
* 192 compute nodes without an accelerator
* 8 compute nodes with a GPU accelerator - 4x NVIDIA Tesla V100-SXM2
* 1 fat nodes - equipped with 6144 GB of RAM
[More about Compute nodes][1].
GPU and accelerated nodes are available upon request, see the [Resources Allocation Policy][2].
All of these nodes are interconnected through fast InfiniBand and Ethernet networks. [More about the Network][3].
Every chassis provides an InfiniBand switch, marked **isw**, connecting all nodes in the chassis, as well as connecting the chassis to the upper level switches.
User access to the Barbora cluster is provided by two login nodes login1, login2. [More about accessing the cluster][5].
The parameters are summarized in the following tables:
| **In general** | |
| ------------------------------------------- | -------------------------------------------- |
| Primary purpose | High Performance Computing |
| Architecture of compute nodes | x86-64 |
| Operating system | Linux |
| [**Compute nodes**][1] | |
| Total | 201 |
| Processor cores | 36/24/128 (2x18 cores/2x12 cores/8x16 cores) |
| RAM | min. 192 GB |
| Local disk drive | no |
| Compute network | InfiniBand HDR |
| w/o accelerator | 192, cn[1-192] |
| GPU accelerated | 8, cn[194-200] |
| Fat compute nodes | 1, cn[201] |
| **In total** | |
| Total theoretical peak performance (Rpeak) | 840 TFLOP/s |
| Total max. LINPACK performance (Rmax) | XX TFLOP/s |
| Total amount of RAM | 43.712 TB |
| Node | Processor | Memory | Accelerator |
| ---------------- | --------------------------------------- | ------ | ---------------------- |
| w/o accelerator | 2 x Intel Cascade Lake 6240, 2.6 GHz | 192 GB | - |
| GPU accelerated | 2 x Intel Skylake Gold 6126, 2.6 GHz | 192 GB | NVIDIA Tesla V100-SXM2 |
| Fat compute node | 2 x Intel Skylake Platinum 8153, 2.0 GHz | 6144 GB | - |
For more details refer to [Compute nodes][1], [Storage][4], [Visualization servers][6] and [Network][3].
[1]: compute-nodes.md
[2]: ../general/resources-allocation-policy.md
[3]: network.md
[4]: storage.md
[5]: ../general/shell-and-data-access.md
[6]: visualization.md
[a]: https://support.it4i.cz/rt
docs.it4i/barbora/img/BullSequanaX.png

572 KiB

docs.it4i/barbora/img/BullSequanaX1120.png

184 KiB

docs.it4i/barbora/img/BullSequanaX410E5GPUNVLink.jpg

9.32 KiB

docs.it4i/barbora/img/BullSequanaX808.jpg

4.31 KiB

docs.it4i/barbora/img/QM8700.jpg

53.5 KiB

docs.it4i/barbora/img/XH2000.png

74.7 KiB

docs.it4i/barbora/img/bullsequanaX450-E5.png

33.4 KiB

docs.it4i/barbora/img/gpu-v100.png

62.9 KiB

docs.it4i/barbora/img/hdr.jpg

51.9 KiB

docs.it4i/barbora/img/quadrop6000.jpg

33.9 KiB

# Introduction
Welcome to Barbora supercomputer cluster. The Barbora cluster consists of 198 compute nodes, totaling 7124 compute cores with 43712 GB RAM, giving over 840 TFLOP/s theoretical peak performance.
Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2 Read more in [Hardware Overview][1].
The cluster runs with an operating system which is compatible with the RedHat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
!!! warning
Cluster integration in the progress. The resulting settings may vary. The documentation will be updated.
The user data shared file-system and job data shared file-system are available to users.
The [PBS Professional Open Source Project][b] workload manager provides [computing resources allocations and job execution][3].
Read more on how to [apply for resources][4], [obtain login credentials][5] and [access the cluster][6].
![](img/BullSequanaX.png)
[1]: hardware-overview.md
[2]: ../environment-and-modules.md
[3]: ../general/resources-allocation-policy.md
[4]: ../general/applying-for-resources.md
[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[6]: ../general/shell-and-data-access.md
[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
[b]: https://www.pbspro.org/
# Network
All of the compute and login nodes of Barbora are interconnected through an [Mellanox][c] [InfiniBand][a] HDR 200 Gbps network and a Gigabit [Ethernet][b] network.
Calculation nodes and the service infrastructure is connected by the HDR100 technology that allows one 200Gbps HDR port (aggregation 4x 50Gbps) divided into 2 HDR100 ports with 100Gbps (2x 50Gbps) bandwidth.
The cabling between the L1 and L2 layer is realized by HDR cabling, connecting the end devices is realized so called Y or splitter cable (1xHRD200 - 2x HDR100).
![](img/hdr.jpg)
**The computing network thus implemented fulfills the following parameters**
* 100Gbps
* Latencies less than 10 microseconds (0.6 μs end-to-end, <90ns switch hop)
* Adaptive routing support
* MPI communication support
* IP protocol support (IPoIB)
* Support for SCRATCH Data Storage and NVMe over Fabric Data Storage.
## Mellanox QM8700 40-Ports Switch
**Performance**
* 40 X HDR 200Gb/s ports in a 1U switch
* 80 X HDR100 100Gb/s ports in a 1U switch
* 16Tb/s aggregate switch throughput
* Up to 15.8 billion messages-per-second
* 90ns switch latency
**Optimized Design**
* 1+1 redundant & hot-swappable power
* 80 gold+ and energy star certified power supplies
* Dual-core x86 CPU
**Advanced Design**
* Adaptive routing
* Collective offloads (Mellanox SHARP technology)
* VL mapping (VL2VL)
![](img/QM8700.jpg)
## BullSequana XH2000 HDRx WH40 MODULE
* Mellanox QM8700 switch modified for direct liquid cooling (Atos Cold Plate), with form factor for installing the Bull Sequana XH2000 rack
![](img/XH2000.png)
[a]: http://en.wikipedia.org/wiki/InfiniBand
[b]: http://en.wikipedia.org/wiki/Ethernet
[c]: http://www.mellanox.com/
# Storage - WORKING IN PROGRESS
!!! warning
Cluster integration in the progress. The resulting settings may vary. The documentation will be updated.
There are three main shared file systems on Barbora cluster, the [HOME][1], [SCRATCH][2] and [PROJECT][5]. All login and compute nodes may access same data on shared file systems. Compute nodes are also equipped with local (non-shared) scratch, RAM disk and tmp file systems.
## Archiving
Don't use shared filesystems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][3], which is available via SSHFS.
## Shared Filesystems
Barbora computer provides three main shared filesystems, the [HOME filesystem][1], [SCRATCH filesystem][2] and the [PROJECT filesystems][5].
*Both HOME and SCRATCH filesystems are realized as a parallel Lustre filesystem. Both shared file systems are accessible via the Infiniband network. Extended ACLs are provided on both Lustre filesystems for the purpose of sharing data with other users using fine-grained control.*
### Understanding the Lustre Filesystems
A user file on the [Lustre filesystem][a] can be divided into multiple chunks (stripes) and stored across a subset of the object storage targets (OSTs) (disks). The stripes are distributed among the OSTs in a round-robin fashion to ensure load balancing.
When a client (a compute node from your job) needs to create or access a file, the client queries the metadata server (MDS) and the metadata target (MDT) for the layout and location of the [file's stripes][b]. Once the file is opened and the client obtains the striping information, the MDS is no longer involved in the file I/O process. The client interacts directly with the object storage servers (OSSes) and OSTs to perform I/O operations such as locking, disk allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the same time, the Lustre distributed lock manager enforces coherency so that all clients see consistent results.
There is default stripe configuration for barbora Lustre filesystems. However, users can set the following stripe parameters for their own directories or files to get optimum I/O performance:
1. stripe_size: the size of the chunk in bytes; specify with k, m, or g to use units of KB, MB, or GB, respectively; the size must be an even multiple of 65,536 bytes; default is 1MB for all barbora Lustre filesystems
1. stripe_count the number of OSTs to stripe across; default is 1 for barbora Lustre filesystems one can specify -1 to use all OSTs in the filesystem.
1. stripe_offset The index of the OST where the first stripe is to be placed; default is -1 which results in random selection; using a non-default value is NOT recommended.
!!! note
Setting stripe size and stripe count correctly for your needs may significantly impact the I/O performance you experience.
Use the lfs getstripe for getting the stripe parameters. Use the lfs setstripe command for setting the stripe parameters to get optimal I/O performance The correct stripe setting depends on your needs and file access patterns.
```console
$ lfs getstripe dir|filename
$ lfs setstripe -s stripe_size -c stripe_count -o stripe_offset dir|filename
```
Example:
```console
$ lfs getstripe /scratch/username/
/scratch/username/
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
$ lfs setstripe -c -1 /scratch/username/
$ lfs getstripe /scratch/username/
/scratch/username/
stripe_count: 10 stripe_size: 1048576 stripe_offset: -1
```
In this example, we view current stripe setting of the /scratch/username/ directory. The stripe count is changed to all OSTs, and verified. All files written to this directory will be striped over 10 OSTs
Use lfs check OSTs to see the number and status of active OSTs for each filesystem on Barbora. Learn more by reading the man page
```console
$ lfs check osts
$ man lfs
```
### Hints on Lustre Stripping
!!! note
Increase the stripe_count for parallel I/O to the same file.
When multiple processes are writing blocks of data to the same file in parallel, the I/O performance for large files will improve when the stripe_count is set to a larger value. The stripe count sets the number of OSTs the file will be written to. By default, the stripe count is set to 1. While this default setting provides for efficient access of metadata (for example to support the ls -l command), large files should use stripe counts of greater than 1. This will increase the aggregate I/O bandwidth by using multiple OSTs in parallel instead of just one. A rule of thumb is to use a stripe count approximately equal to the number of gigabytes in the file.
Another good practice is to make the stripe count be an integral factor of the number of processes performing the write in parallel, so that you achieve load balance among the OSTs. For example, set the stripe count to 16 instead of 15 when you have 64 processes performing the writes.
!!! note
Using a large stripe size can improve performance when accessing very large files
Large stripe size allows each client to have exclusive access to its own part of a file. However, it can be counterproductive in some cases if it does not match your I/O pattern. The choice of stripe size has no effect on a single-stripe file.
Read more [here][c].
### Lustre on Barbora
The architecture of Lustre on barbora is composed of two metadata servers (MDS) and four data/object storage servers (OSS). Two object storage servers are used for file system HOME and another two object storage servers are used for file system SCRATCH.
Configuration of the storages
* HOME Lustre object storage
* One disk array NetApp E5400
* 22 OSTs
* 227 2TB NL-SAS 7.2krpm disks
* 22 groups of 10 disks in RAID6 (8+2)
* 7 hot-spare disks
* SCRATCH Lustre object storage
* 54x 8TB 10kRPM 2,5” SAS HDD
* 5 x RAID6(8+2)
* 4 hotspare
* Lustre metadata storage
* One disk array NetApp E2600
* 12 300GB SAS 15krpm disks
* 2 groups of 5 disks in RAID5
* 2 hot-spare disks
### HOME File System
The HOME filesystem is mounted in directory /home. Users home directories /home/username reside on this filesystem. Accessible capacity is 320TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 250GB per user. If 250GB should prove as insufficient for particular user, contact [support][d], the quota may be lifted upon request.
!!! note
The HOME filesystem is intended for preparation, evaluation, processing and storage of data generated by active Projects.
The HOME filesystem should not be used to archive data of past Projects or other unrelated data.
The files on HOME filesystem will not be deleted until end of the [users lifecycle][4].
The filesystem is backed up, such that it can be restored in case of catasthropic failure resulting in significant data loss. This backup however is not intended to restore old versions of user data or to restore (accidentaly) deleted files.
The HOME filesystem is realized as Lustre parallel filesystem and is available on all login and computational nodes.
Default stripe size is 1MB, stripe count is 1. There are 22 OSTs dedicated for the HOME filesystem.
!!! note
Setting stripe size and stripe count correctly for your needs may significantly impact the I/O performance you experience.
| HOME filesystem | |
| -------------------- | ------ |
| Mountpoint | /home |
| Capacity | 320 TB |
| Throughput | 2 GB/s |
| User quota | 250 GB |
| Default stripe size | 1 MB |
| Default stripe count | 1 |
| Number of OSTs | 22 |
### SCRATCH File System
The SCRATCH filesystem is mounted in directory /scratch. Users may freely create subdirectories and files on the filesystem. Accessible capacity is 146TB, shared among all users. Individual users are restricted by filesystem usage quotas, set to 100TB per user. The purpose of this quota is to prevent runaway programs from filling the entire filesystem and deny service to other users. If 100TB should prove as insufficient for particular user, contact [support][d], the quota may be lifted upon request.
!!! note
The Scratch filesystem is intended for temporary scratch data generated during the calculation as well as for high performance access to input and output files. All I/O intensive jobs must use the SCRATCH filesystem as their working directory.
Users are advised to save the necessary data from the SCRATCH filesystem to HOME filesystem after the calculations and clean up the scratch files.
!!! warning
Files on the SCRATCH filesystem that are **not accessed for more than 90 days** will be automatically **deleted**.
The SCRATCH filesystem is realized as Lustre parallel filesystem and is available from all login and computational nodes. Default stripe size is 1MB, stripe count is 1. There are 10 OSTs dedicated for the SCRATCH filesystem.
!!! note
Setting stripe size and stripe count correctly for your needs may significantly impact the I/O performance you experience.
| SCRATCH filesystem | |
| -------------------- | -------- |
| Mountpoint | /scratch |
| Capacity | 146TB |
| Throughput | 6GB/s |
| User quota | 100TB |
| Default stripe size | 1MB |
| Default stripe count | 1 |
| Number of OSTs | 10 |
### PROJECT File System
to do...
### Disk Usage and Quota Commands
Disk usage and user quotas can be checked and reviewed using following command:
```console
$ it4i-disk-usage
```
Example:
```console
$ it4i-disk-usage -h
# Using human-readable format
# Using power of 1024 for space
# Using power of 1000 for entries
Filesystem: /home
Space used: 112G
Space limit: 238G
Entries: 15k
Entries limit: 500k
Filesystem: /scratch
Space used: 0
Space limit: 93T
Entries: 0
Entries limit: 0
```
In this example, we view current size limits and space occupied on the /home and /scratch filesystem, for a particular user executing the command.
Note that limits are imposed also on number of objects (files, directories, links, etc...) that are allowed to create.
To have a better understanding of where the space is exactly used, you can use following command to find out.
```console
$ du -hs dir
```
Example for your HOME directory:
```console
$ cd /home
$ du -hs * .[a-zA-z0-9]* | grep -E "[0-9]*G|[0-9]*M" | sort -hr
258M cuda-samples
15M .cache
13M .mozilla
5,5M .eclipse
2,7M .idb_13.0_linux_intel64_app
```
This will list all directories which are having MegaBytes or GigaBytes of consumed space in your actual (in this example HOME) directory. List is sorted in descending order from largest to smallest files/directories.
To have a better understanding of previous commands, you can read manpages.
```console
$ man lfs
```
```console
$ man du
```
### Extended ACLs
Extended ACLs provide another security mechanism beside the standard POSIX ACLs which are defined by three entries (for owner/group/others). Extended ACLs have more than the three basic entries. In addition, they also contain a mask entry and may contain any number of named user and named group entries.
ACLs on a Lustre file system work exactly like ACLs on any Linux file system. They are manipulated with the standard tools in the standard manner. Below, we create a directory and allow a specific user access.
```console
[vop999@login1.barbora ~]$ umask 027
[vop999@login1.barbora ~]$ mkdir test
[vop999@login1.barbora ~]$ ls -ld test
drwxr-x--- 2 vop999 vop999 4096 Nov 5 14:17 test
[vop999@login1.barbora ~]$ getfacl test
# file: test
# owner: vop999
# group: vop999
user::rwx
group::r-x
other::---
[vop999@login1.barbora ~]$ setfacl -m user:johnsm:rwx test
[vop999@login1.barbora ~]$ ls -ld test
drwxrwx---+ 2 vop999 vop999 4096 Nov 5 14:17 test
[vop999@login1.barbora ~]$ getfacl test
# file: test
# owner: vop999
# group: vop999
user::rwx
user:johnsm:rwx
group::r-x
mask::rwx
other::---
```
Default ACL mechanism can be used to replace setuid/setgid permissions on directories. Setting a default ACL on a directory (-d flag to setfacl) will cause the ACL permissions to be inherited by any newly created file or subdirectory within the directory. Refer to this page for more information on Linux ACL at [RedHat guide][e].
## Local Filesystems
### Tmp
Each node is equipped with local /tmp directory of few GB capacity. The /tmp directory should be used to work with small temporary files. Old files in /tmp directory are automatically purged.
## Summary
| Mountpoint | Usage | Protocol | Net Capacity | Throughput | Limitations | Access | Services | |
| ---------- | ------------------------- | -------- | -------------- | ---------- | ----------- | ----------------------- | --------------------------- | ------ |
| /home | home directory | Lustre | 320 TiB | 2 GB/s | Quota 250GB | Compute and login nodes | backed up | |
| /scratch | cluster shared jobs' data | Lustre | 146 TiB | 6 GB/s | Quota 100TB | Compute and login nodes | files older 90 days removed | |
| /tmp | local temporary files | local | 9.5 GB | 100 MB/s | none | Compute and login nodes | auto | purged |
## CESNET Data Storage
Do not use shared filesystems at IT4Innovations as a backup for large amount of data or long-term archiving purposes.
!!! note
The IT4Innovations does not provide storage capacity for data archiving. Academic staff and students of research institutions in the Czech Republic can use [CESNET Storage service][f].
The CESNET Storage service can be used for research purposes, mainly by academic staff and students of research institutions in the Czech Republic.
User of data storage CESNET (DU) association can become organizations or an individual person who is either in the current employment relationship (employees) or the current study relationship (students) to a legal entity (organization) that meets the “Principles for access to CESNET Large infrastructure (Access Policy)”.
User may only use data storage CESNET for data transfer and storage which are associated with activities in science, research, development, the spread of education, culture and prosperity. In detail see “Acceptable Use Policy CESNET Large Infrastructure (Acceptable Use Policy, AUP)”.
The service is documented [here][g]. For special requirements contact directly CESNET Storage Department via e-mail [du-support(at)cesnet.cz][h].
The procedure to obtain the CESNET access is quick and trouble-free.
## CESNET Storage Access
### Understanding CESNET Storage
!!! note
It is very important to understand the CESNET storage before uploading data. [Read][i] first.
Once registered for CESNET Storage, you may [access the storage][j] in number of ways. We recommend the SSHFS and RSYNC methods.
### SSHFS Access
!!! note
SSHFS: The storage will be mounted like a local hard drive
The SSHFS provides a very convenient way to access the CESNET Storage. The storage will be mounted onto a local directory, exposing the vast CESNET Storage as if it was a local removable hard drive. Files can be than copied in and out in a usual fashion.
First, create the mount point
```console
$ mkdir cesnet
```
Mount the storage. Note that you can choose among the ssh.du1.cesnet.cz (Plzen), ssh.du2.cesnet.cz (Jihlava), ssh.du3.cesnet.cz (Brno) Mount tier1_home **(only 5120 MB !)**:
```console
$ sshfs username@ssh.du1.cesnet.cz:. cesnet/
```
For easy future access from barbora, install your public key
```console
$ cp .ssh/id_rsa.pub cesnet/.ssh/authorized_keys
```
Mount tier1_cache_tape for the Storage VO:
```console
$ sshfs username@ssh.du1.cesnet.cz:/cache_tape/VO_storage/home/username cesnet/
```
View the archive, copy the files and directories in and out
```console
$ ls cesnet/
$ cp -a mydir cesnet/.
$ cp cesnet/myfile .
```
Once done, remember to unmount the storage
```console
$ fusermount -u cesnet
```
### RSYNC Access
!!! info
RSYNC provides delta transfer for best performance, can resume interrupted transfers
RSYNC is a fast and extraordinarily versatile file copying tool. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. RSYNC is widely used for backups and mirroring and as an improved copy command for everyday use.
RSYNC finds files that need to be transferred using a "quick check" algorithm (by default) that looks for files that have changed in size or in last-modified time. Any changes in the other preserved attributes (as requested by options) are made on the destination file directly when the quick check indicates that the file's data does not need to be updated.
[More about RSYNC][k].
Transfer large files to/from CESNET storage, assuming membership in the Storage VO
```console
$ rsync --progress datafile username@ssh.du1.cesnet.cz:VO_storage-cache_tape/.
$ rsync --progress username@ssh.du1.cesnet.cz:VO_storage-cache_tape/datafile .
```
Transfer large directories to/from CESNET storage, assuming membership in the Storage VO
```console
$ rsync --progress -av datafolder username@ssh.du1.cesnet.cz:VO_storage-cache_tape/.
$ rsync --progress -av username@ssh.du1.cesnet.cz:VO_storage-cache_tape/datafolder .
```
Transfer rates of about 28 MB/s can be expected.
[1]: #home
[2]: #scratch
[3]: #cesnet-data-storage
[4]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[5]: #project
[a]: http://www.nas.nasa.gov
[b]: http://www.nas.nasa.gov/hecc/support/kb/Lustre_Basics_224.html#striping
[c]: http://doc.lustre.org/lustre_manual.xhtml#managingstripingfreespace
[d]: https://support.it4i.cz/rt
[e]: https://access.redhat.com/documentation/en-US/Red_Hat_Storage/2.0/html/Administration_Guide/ch09s05.html
[f]: https://du.cesnet.cz/
[g]: https://du.cesnet.cz/en/start
[h]: mailto:du-support@cesnet.cz
[i]: https://du.cesnet.cz/en/navody/home-migrace-plzen/start
[j]: https://du.cesnet.cz/en/navody/faq/start
[k]: https://du.cesnet.cz/en/navody/rsync/start#pro_bezne_uzivatele
# Visualization Servers
Remote visualization with [NICE DCV software][3] is availabe on two nodes.
* 2 nodes
* 64 cores in total
* 2x Intel Skylake Gold 6130 – 16core@2,1GHz processors per node
* 192 GB DDR4 2667MT/s of physical memory per node (12x 16 GB)
* BullSequana X450-E5 blade servers
* 2150,4 GFLOP/s per compute node
* 1x 1 GB Ethernet and 2x 10 GB Ethernet
* 1x HDR100 IB port
* 2x SSD 240 GB
![](img/bullsequanaX450-E5.png)
## NVIDIA Quadro P6000
* GPU Memory: 24 GB GDDR5X
* Memory Interface: 384-bit
* Memory Bandwidth: Up to 432 GB/s
* NVIDIA CUDA® Cores: 3840
* System Interface: PCI Express 3.0 x16
* Max Power Consumption: 250 W
* Thermal Solution: Active
* Form Factor: 4.4”H x 10.5” L, Dual Slot, Full Heigh
* Display Connectors: 4x DP 1.4 + DVI-D DL
* Max Simultaneous Displays: 4 direct, 4 DP1.4 Multi-Stream
* Max DP 1.4 Resolution: 7680 x 4320 @ 30 Hz
* Max DVI-D DL Resolution: 2560 x 1600 @ 60 Hz
* Graphics APIs: Shader Model 5.1, OpenGL 4.5, DirectX 12.0, Vulkan 1.0,
* Compute APIs: CUDA, DirectCompute, OpenCL™
* Floating-Point Performance-Single Precision: 12.6 TFLOP/s, Peak
![](img/quadrop6000.jpg)
## Resource Allocation Policy
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
|-------|----------------|-------------------|-------|-----------|----------|---------------|----------|
| qviz Visualization queue | yes | none required | 2 | 4 | 150 | no | 1h/8h |
## References
* [Graphical User Interface][1]
* [VPN Access][2]
[1]: ../general/shell-and-data-access.md#graphical-user-interface
[2]: ../general/shell-and-data-access.md#vpn-access
[3]: ../software/viz/NICEDCVsoftware.md
......@@ -28,7 +28,7 @@ The qsub command submits the job to the queue, i.e. the qsub command creates a r
### Job Submission Examples
!!! note
Anselm ... ncpus=16, Salomon ... ncpus=24
Anselm ... ncpus=16, Salomon ... ncpus=24, Barbora ... ncpus=36 orc ncpus=24 for accelerate node
```console
$ qsub -A OPEN-0-0 -q qprod -l select=64:ncpus=16,walltime=03:00:00 ./myjob
......
......@@ -57,6 +57,28 @@ The resources are allocated to the job in a fair-share fashion, subject to const
!!! note
To access node with Xeon Phi co-processor user needs to specify that in [job submission select statement][3].
### Barbora
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------- | -------------- | -------------------- | ---------------------------------------------------- | --------- | -------- | ------------- | -------- |
| qexp | no | none required | 189 nodes | 36 | 150 | no | 1 h |
| qprod | yes | > 0 | 187 nodes w/o accelerator | 36 | 0 | no | 24/48 h |
| qlong | yes | > 0 | 60 nodes w/o accelerator | 36 | 0 | no | 72/144 h |
| qnvidia | yes | > 0 | 8 nvidia nodes | 24 | 200 | yes | 24/48 h |
| qfat | yes | > 0 | 1 fat nodes | 8 | 200 | yes | 24/144 h |
| qfree | yes | < 120% of allocation | 189 w/o accelerator | 36 | -1024 | no | 12 h |
!!! note
**The qfree queue is not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of qfree after exhaustion of DD projects' computational resources is allowed after request for this queue.
**The qexp queue is equipped with nodes which do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
* **qexp**, the Express queue: This queue is dedicated to testing and running very small jobs. It is not required to specify a project to enter the qexp. There are always 2 nodes reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user. The nodes may be allocated on a per core basis. No special authorization is required to use qexp. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 187 nodes without accelerators are included. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h).
* **qnvidia**, **qfat**, the Dedicated queues: The queue qnvidia is dedicated to accessing the Nvidia accelerated nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. 8 nvidia (4 nvidia cards per node) and 1 fat nodes are included. Full nodes, 24 cores per node, are allocated. The queues run with very high priority. An PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with her/his project.
* **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request persmission to use qfree after exhaustion of computational resources). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 189 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
## Queue Notes
The job wall clock time defaults to **half the maximum time**, see the table above. Longer wall time limits can be [set manually, see examples][3].
......
......@@ -15,6 +15,14 @@ The all IT4Innovations clusters are accessed by SSH protocol via login nodes log
| login1.anselm.it4i.cz | 22 | ssh | login1 |
| login2.anselm.it4i.cz | 22 | ssh | login2 |
### Barbora Cluster
| Login address | Port | Protocol | Login node |
| ------------------------- | ---- | -------- | ------------------------------------- |
| barbora.it4i.cz | 22 | ssh | round-robin DNS record for login[1-2] |
| login1.barbora.it4i.cz | 22 | ssh | login1 |
| login2.barbora.it4i.cz | 22 | ssh | login2 |
### Salomon Cluster
| Login address | Port | Protocol | Login node |
......@@ -27,38 +35,20 @@ The all IT4Innovations clusters are accessed by SSH protocol via login nodes log
## Authentication
Authentication is available by [private key][1] only. Verify SSH fingerprints during the first logon:
Authentication is available by [private key][1] only.
Anselm:
!!! note
Verify SSH fingerprints during the first logon. They are identical on all login nodes:
```console
md5:
29:b3:f4:64:b0:73:f5:6f:a7:85:0f:e0:0d:be:76:bf (DSA)
d4:6f:5c:18:f4:3f:70:ef:bc:fc:cc:2b:fd:13:36:b7 (RSA)
1a:19:75:31:ab:53:45:53:ce:35:82:13:29:e4:0d:d5 (ECDSA)
sha256:
LX2034TYy6Lf0Q7Zf3zOIZuFlG09DaSGROGBz6LBUy4 (DSA)
+DcED3GDoA9piuyvQOho+ltNvwB9SJSYXbB639hbejY (RSA)
2Keuu9gzrcs1K8pu7ljm2wDdUXU6f+QGGSs8pyrMM3M (ECDSA)
```
Salomon:
```console
md5:
f6:28:98:e4:f9:b2:a6:8f:f2:f4:2d:0a:09:67:69:80 (DSA)
70:01:c9:9a:5d:88:91:c7:1b:c0:84:d1:fa:4e:83:5c (RSA)
66:32:0a:ef:50:01:77:a7:52:3f:d9:f8:23:7c:2c:3a (ECDSA)
sha256:
epkqEU2eFzXnMeMMkpX02CykyWjGyLwFj528Vumpzn4 (DSA)
WNIrR7oeQDYpBYy4N2d5A6cJ2p0837S7gzzTpaDBZrc (RSA)
cYO4UdtUBYlS46GEFUB75BkgxkI6YFQvjVuFxOlRG3g (ECDSA)
```
!!! note
SSH fingerprints are identical on all login nodes.
Private key authentication:
......@@ -107,6 +97,14 @@ Data in and out of the system may be transferred by the [scp][a] and sftp protoc
| login1.anselm.it4i.cz | 22 | scp |
| login2.anselm.it4i.cz | 22 | scp |
### Barbora Cluster
| Address | Port | Protocol |
| ------------------------- | ---- | ------- |
| barbora.it4i.cz | 22 | scp |
| login1.barbora.it4i.cz | 22 | scp |
| login2.barbora.it4i.cz | 22 | scp |
### Salomon Cluster
| Address | Port | Protocol |
......@@ -122,6 +120,8 @@ Authentication is by [private key][1] only.
!!! note
If you experience degraded data transfer performance, consult your local network provider.
On linux or Mac, use an scp or sftp client to transfer data to Barbora:
```console
$ scp -i /path/to/id_rsa my-local-file username@cluster-name.it4i.cz:directory/file
```
......@@ -142,6 +142,8 @@ A very convenient way to transfer files in and out of cluster is via the fuse fi
$ sshfs -o IdentityFile=/path/to/id_rsa username@cluster-name.it4i.cz:. mountpoint
```
Using sshfs, the users Barbora home directory will be mounted on your local computer, just like an external disk.
Learn more about ssh, scp and sshfs by reading the manpages
```console
......
# Documentation
Welcome to the IT4Innovations documentation pages. The IT4Innovations national supercomputing center operates the supercomputers [Anselm][2] and [Salomon][1]. The supercomputers are [available][4] to the academic community within the Czech Republic and Europe, and the industrial community worldwide. The purpose of these pages is to provide comprehensive documentation of the hardware, software and usage of the computers.
Welcome to the IT4Innovations documentation pages. The IT4Innovations national supercomputing center operates the supercomputers [Anselm][2], [Barbora][3] and [Salomon][1]. The supercomputers are [available][4] to the academic community within the Czech Republic and Europe, and the industrial community worldwide. The purpose of these pages is to provide comprehensive documentation of the hardware, software and usage of the computers.
## How to Read the Documentation
......@@ -68,6 +68,7 @@ By doing so, you can save other readers from frustration and help us improve.
[1]: salomon/introduction.md
[2]: anselm/introduction.md
[3]: barbora/introduction.md
[4]: general/applying-for-resources.md
[5]: general/resources-allocation-policy.md#normalized-core-hours-nch
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment