Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 1684 additions and 0 deletions
# Xorg
## Introduction
!!! note
Available only for Karolina accelerated nodes acn[01-72] and vizualization servers viz[1-2]
Some applications (e.g. Paraview, Ensight, Blender, Ovito) require not only visualization but also computational resources such as multiple cores or multiple graphics accelerators. For the processing of demanding tasks, more operating memory and more memory on the graphics card are also required. These requirements are met by all accelerated nodes on the Karolina cluster, which are equipped with eight graphics cards with 40GB GPU memory and 1TB CPU memory. To run properly, it is required to have the Xorg server running and the VirtualGL environment installed.
## Xorg
[Xorg][a] is a free and open source implementation of the X Window System imaging server maintained by the X.Org Foundation. Client-side implementations of the protocol are available, for example, in the form of Xlib and XCB. While Xorg usually supports 2D hardware acceleration, 3D hardware acceleration is often missing. With hardware 3D acceleration, 3D rendering uses the graphics processor on the graphics card instead of taking up valuable CPU resources when rendering 3D images. It is also referred to as hardware acceleration instead of software acceleration because without this 3D acceleration, the processor is forced to draw everything itself using the [Mesa][c] software rendering libraries, which takes up quite a bit of computing power. There is a VirtualGL package that solves these problems.
## VirtualGL
[VirtualGL][b] is an open source software package that redirects 3D rendering commands from Linux OpenGL applications to 3D accelerator hardware in a dedicated server and sends the rendered output to a client located elsewhere on the network. On the server side, VirtualGL consists of a library that handles the redirection and a wrapper that instructs applications to use the library. Clients can connect to the server either using a remote X11 connection or using an X11 proxy such as a VNC server. In the case of an X11 connection, some VirtualGL software is also required on the client side to receive the rendered graphical output separately from the X11 stream. In the case of VNC connections, no specific client-side software is needed other than the VNC client itself. VirtualGL works seamlessly with [headless][d] NVIDIA GPUs (Ampere, Tesla).
## Running Paraview With GUI and Interactive Job on Karolina
1. Run [VNC environment][1]
1. Run terminal in VNC session:
```console
[loginX.karolina]$ gnome-terminal
```
1. Run interactive job in gnome terminal
```console
[loginX.karolina]$ salloc --A PROJECT-ID -q qgpu --x11 --comment use:xorg=true
```
1. Run Xorg server
```console
[acnX.karolina]$ Xorg :0 &
```
1. Load VirtualGL:
```console
[acnX.karolina]$ ml VirtualGL
```
1. Find number of DISPLAY:
```console
[acnX.karolina]$ echo $DISPLAY
localhost:XX.0 (for ex. localhost:50.0)
```
1. Load ParaView:
```console
[acnX.karolina]$ ml ParaView
```
1. Run ParaView:
```console
[acnX.karolina]$ DISPLAY=:XX vglrun paraview
```
!!! note
It is not necessary to run Xorg from the command line on the visualization servers viz[1-2]. Xorg runs without interruption and is started when the visualization server boots.<br> Another option is to use [vglclient][2] for visualization server.
## Running Blender (Eevee) on the Background Without GUI and Without Interactive Job on Karolina
1. Download and extract Blender and Eevee scene:
```console
[loginX.karolina]$ wget https://ftp.nluug.nl/pub/graphics/blender/release/Blender2.93/blender-2.93.6-linux-x64.tar.xz ; tar -xvf blender-2.93.6-linux-x64.tar.xz ; wget https://download.blender.org/demo/eevee/mr_elephant/mr_elephant.blend
```
1. Create a running script:
```console
[loginX.karolina]$ echo 'Xorg :0 &' > run_eevee.sh ; echo 'cd' $PWD >> run_eevee.sh ; echo 'DISPLAY=:0 ./blender-2.93.6-linux-x64/blender --factory-startup --enable-autoexec -noaudio --background ./mr_elephant.blend --render-output ./#### --render-frame 0' >> run_eevee.sh ; chmod +x run_eevee.sh
```
1. Run job from terminal:
```console
[loginX.karolina]$ sbatch -A PROJECT-ID -q qcpu --comment use:xorg=true ./run_eevee.sh
```
[1]: ./vnc.md
[2]: ../../../software/viz/vgl.md
[a]: https://www.x.org/wiki/
[b]: https://en.wikipedia.org/wiki/VirtualGL
[c]: https://docs.mesa3d.org/index.html
[d]: https://virtualgl.org/Documentation/HeadlessNV
# PuTTY (Windows)
## Windows PuTTY Installer
We recommend you to download "**A Windows installer for everything except PuTTYtel**" with **Pageant** (SSH authentication agent) and **PuTTYgen** (PuTTY key generator) which is available [here][a].
!!! note
"Pageant" is optional.
"Change Password for Existing Private Key" is optional.
## PuTTY - How to Connect to the IT4Innovations Cluster
* Run PuTTY
* Enter Host name and Save session fields with login address and browse Connection - SSH - Auth menu. The _Host Name_ input may be in the format **"username@clustername.it4i.cz"** so you do not have to type your login each time. In this example, replace the word `cluster` in the `cluster.it4i.cz` address with the name of the cluster to which you want to connect.
![](../../../img/PuTTY_host_cluster.png)
* Category - Connection - SSH - Auth:
Select Attempt authentication using Pageant.
Select Allow agent forwarding.
Browse and select your private key file.
![](../../../img/PuTTY_keyV.png)
* Return to Session page and Save selected configuration with _Save_ button.
![](../../../img/PuTTY_save_cluster.png)
* Now you can log in using _Open_ button.
![](../../../img/PuTTY_open_cluster.png)
* Enter your username if the _Host Name_ input is not in the format "username@cluster.it4i.cz".
* Enter passphrase for selected private key file if Pageant **SSH authentication agent is not used.**
## Another PuTTY Settings
* Category - Windows - Translation - Remote character set and select **UTF-8**.
* Category - Terminal - Features and select **Disable application keypad mode** (enable numpad)
* Save your configuration in the Session - Basic options for your PuTTY section with the _Save_ button.
## Pageant SSH Agent
Pageant holds your private key in memory without needing to retype a passphrase on every login.
* Run Pageant.
* On Pageant Key List press _Add key_ and select your private key (id_rsa.ppk).
* Enter your passphrase.
* Now you have your private key in memory without needing to retype a passphrase on every login.
![](../../../img/PageantV.png)
## PuTTY Key Generator
PuTTYgen is the PuTTY key generator. You can load in an existing private key and change your passphrase or generate a new public/private key pair.
### Change Password for Existing Private Key
You can change the password of your SSH key with "PuTTY Key Generator". Make sure to back up the key.
* Load your private key file with _Load_ button.
* Enter your current passphrase.
* Change key passphrase.
* Confirm key passphrase.
* Save your private key with the _Save private key_ button.
![](../../../img/PuttyKeygeneratorV.png)
### Generate a New Public/Private Key
You can generate an additional public/private key pair and insert public key into authorized_keys file for authentication with your own private key.
* Start with _Generate_ button.
![](../../../img/PuttyKeygenerator_001V.png)
* Generate some randomness.
![](../../../img/PuttyKeygenerator_002V.png)
* Wait.
![](../../../img/PuttyKeygenerator_003V.png)
* Enter a comment for your key using the 'username@organization.example.com' format.
Enter a key passphrase, confirm it and save your new private key in the _ppk_ format.
![](../../../img/PuttyKeygenerator_004V.png)
* Save the public key with the _Save public key_ button.
You can copy public key out of the ‘Public key for pasting into the authorized_keys file’ box.
![](../../../img/PuttyKeygenerator_005V.png)
* Export the private key in the OpenSSH format "id_rsa" using Conversion - Export OpenSSH key
![](../../../img/PuttyKeygenerator_006V.png)
## Managing Your SSH Key
To manage your SSH key for authentication to clusters, see the [SSH Key Management][3] section.
[1]: ./ssh-key-management.md
[1]: #putty
[2]: ssh-keys.md#how-to-add-your-own-key
[3]: ./ssh-key-management.md
[a]: http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
# SSH
Secure Shell (SSH) is a cryptographic network protocol for operating network services securely over an unsecured network.
SSH uses public-private key pair for authentication, allowing users to log in without having to specify a password. The public key is placed on all computers that must allow access to the owner of the matching private key (the private key must be kept **secret**).
## Private Key
!!! note
The path to a private key is usually /home/username/.ssh/
A private key file in the `id_rsa` or `*.ppk` format is present locally on local side and used for example in the Pageant SSH agent (for Windows users). The private key should always be kept in a safe place.
### Example of RSA Private Key Format
```console
-----BEGIN RSA PRIVATE KEY-----
MIIEpAIBAAKCAQEAqbo7jokygnBpG2wYa5NB45ns6+UKTNLMLHF0BO3zmRtKEElE
aGqXfbYwvXlcuRb2d9/Y5dVpCZHV0kbY3NhtVOcEIe+1ROaiU9BEsUAhMNEvgiLV
gSql4QvRO4BWPlM8+WAWXDp3oeoBh8glXyuh9teb8yq98fv1r1peYGRrW3/s4V+q
O1SQ0XY2T7rWCYRLIP6rTMXArTI35v3WU513mn7nm1fJ7oN0QgVH5b0W9V1Kyc4l
9vILHeMXxvz+i/5jTEfLOJpiRGYZYcaYrE4dIiHPl3IlbV7hlkK23Xb1US8QJr5G
ADxp1VTkHjY+mKagEfxl1hQIb42JLHhKMEGqNQIDAQABAoIBAQCkypPuxZjL+vai
UGa5dAWiRZ46P2yrwHPKpvEdpCdDPbLAc1K/CtdBkHZsUPxNHVV6eFWweW99giIY
Av+mFWC58X8asBHQ7xkmxW0cqAZRzpkRAl9IBS9/fKjO28Fgy/p+suOi8oWbKIgJ
3LMkX0nnT9oz1AkOfTNC6Tv+3SE7eTj1RPcMjur4W1Cd1N3EljLszdVk4tLxlXBS
yl9NzVnJJbJR4t01l45VfFECgYEAno1WJSB/SwdZvS9GkfhvmZd3r4vyV9Bmo3dn
XZAh8HRW13imOnpklDR4FRe98D9A7V3yh9h60Co4oAUd6N+Oc68/qnv/8O9efA+M
/neI9ANYFo8F0+yFCp4Duj7zPV3aWlN/pd8TNzLqecqh10uZNMy8rAjCxybeZjWd
DyhgywXhAoGBAN3BCazNefYpLbpBQzwes+f2oStvwOYKDqySWsYVXeVgUI+OWTVZ
eZ26Y86E8MQO+q0TIxpwou+TEaUgOSqCX40Q37rGSl9K+rjnboJBYNCmwVp9bfyj
kCLL/3g57nTSqhgHNa1xwemePvgNdn6FZteA8sXiCg5ZzaISqWAffek5AoGBAMPw
V/vwQ96C8E3l1cH5cUbmBCCcfXM2GLv74bb1V3SvCiAKgOrZ8gEgUiQ0+TfcbAbe
7MM20vRNQjaLTBpai/BTbmqM1Q+r1KNjq8k5bfTdAoGANgzlNM9omM10rd9WagL5
yuJcal/03p048mtB4OI4Xr5ZJISHze8fK4jQ5veUT9Vu2Fy/w6QMsuRf+qWeCXR5
RPC2H0JzkS+2uZp8BOHk1iDPqbxWXJE9I57CxBV9C/tfzo2IhtOOcuJ4LY+sw+y/
ocKpJbdLTWrTLdqLHwicdn8OxeWot1mOukyK2l0UeDkY6H5pYPtHTpAZvRBd7ETL
Zs2RP3KFFvho6aIDGrY0wee740/jWotx7fbxxKwPyDRsbH3+1Wx/eX2RND4OGdkH
gejJEzpk/7y/P/hCad7bSDdHZwO+Z03HIRC0E8yQz+JYatrqckaRCtd7cXryTmTR
FbvLJmECgYBDpfno2CzcFJCTdNBZFi34oJRiDb+HdESXepk58PcNcgK3R8PXf+au
OqDBtZIuFv9U1WAg0gzGwt/0Y9u2c8m0nXziUS6AePxy5sBHs7g9C9WeZRz/nCWK
+cHIm7XOwBEzDKz5f9eBqRGipm0skDZNKl8X/5QMTT5K3Eci2n+lTw==
-----END RSA PRIVATE KEY-----
```
### Example of Ed25519 Private Key Format
```console
PuTTY-User-Key-File-3: ssh-ed25519
Encryption: aes256-cbc
Comment: eddsa-key-20240910
Public-Lines: 2
AAAAC3NzaC1lZDI1NTE5AAAAIBKNwqaWU260wueN00nBGRwIqeOedRedtS0T7QVn
h0i2
Key-Derivation: Argon2id
Argon2-Memory: 8192
Argon2-Passes: 21
Argon2-Parallelism: 1
Argon2-Salt: bb64fc32b368aa16d6e8159c8d921f63
Private-Lines: 1
+7StvvEmCMchEy1tUyIMLfGTZBk7dgGUpJEJzNl82qmNZD1TmQOqNmCRiK84P/TL
Private-MAC: dc3f83cef42026a2038f28e96f87367d762e72265621d82e2fe124634ec3c905
```
## Public Key
A public key file in the `*.pub` format is present on the remote side and allows an access to the owner of the matching private key.
### Example of RSA Public Key Format
```console
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCpujuOiTKCcGkbbBhrk0Hjmezr5QpM0swscXQE7fOZG0oQSURoapd9tjC9eVy5FvZ339jl1WkJkdXSRtjc2G1U5wQh77VE5qJT0ESxQCEw0S+CItWBKqXhC9E7gFY+UyP5YBZcOneh6gGHyCVfK6H215vzKr3x+/WvWl5gZGtbf+zhX6o4RJDRdjZPutYJhEsg/qtMxcCtMjfm/dZTnXeafuebV8nug3RCBUflvRb1XUrJuiX28gsd4xfG/P6L/mNMR8s4kmJEZhlhxpj8Th0iIc+XciVtXuGWQrbddcVRLxAmvkYAPGnVVOQeNj69pqAR/GXaFAhvjYkseEowQao1 username@organization.example.com
```
### Example of Ed25519 Public Key Format
```console
---- BEGIN SSH2 PUBLIC KEY ----
Comment: "eddsa-key-20240910"
AAAAC3NzaC1lZDI1NTE5AAAAIBKNwqaWU260wueN00nBGRwIqeOedRedtS0T7QVn
h0i2
---- END SSH2 PUBLIC KEY ----
```
## SSH Key Management
You can manage your own SSH key for authentication to clusters:
* [e-INFRA CZ account][3]
* [IT4I account][4]
[1]: ./ssh-keys.md
[2]: ./putty.md
[3]: ../../management/einfracz-profile.md
[4]: ../../management/it4i-profile.md
# OpenSSH Keys (UNIX)
## Creating Your Own Key
To generate a new keypair of your public and private key, use the `ssh-keygen` tool:
```console
local $ ssh-keygen -t ed25519 -C username@organization.example.com' -f additional_key
```
!!! note
Enter a **strong** **passphrase** for securing your private key.
By default, your private key is saved to the `id_rsa` file in the `.ssh` directory
and your public key is saved to the `id_rsa.pub` file.
## Adding SSH Key to Linux System SSH Agent
1. Check if SSH Agent is running:
```
eval "$(ssh-agent -s)"
```
1. Add the key to SSH Agent:
```
ssh-add ~/.ssh/name_of_your_ssh_key_file
```
1. Verify the key Added to SSH Agent:
```
ssh-add -l
```
## Managing Your SSH Key
To manage your SSH key for authentication to clusters, see the [SSH Key Management][1] section.
[1]: ./ssh-key-management.md
# Tmux
[Tmux][1] is an open-source terminal multiplexer which allows multiple terminal sessions to be accessed simultaneously in a single window. Tmux allows you to switch easily between several programs in one terminal, detach them (they keep running in the background) and reattach them to a different terminal.
Note that [GNU Screen][2] is not supported, but if you prefer it, you can install it in your `/home` folder:
```console
wget https://ftp.gnu.org/gnu/screen/screen-4.9.0.tar.gz
tar xf screen-4.9.0.tar.gz && rm screen-4.9.0.tar.gz
cd screen-4.9.0
./autogen.sh
./configure --prefix=$HOME/.local/screen
make
make install
mkdir $HOME/.local/screen/etc
cp etc/etcscreenrc $HOME/.local/screen/etc/screenrc
echo "export PATH=\$HOME/.local/screen/bin:\$PATH" >> $HOME/.bashrc
cd ../ && rm -rf screen-4.9.0
```
[1]: https://github.com/tmux/tmux/wiki
[2]: https://www.gnu.org/software/screen/
# VPN Access
## Accessing IT4Innovations Internal Resources via VPN
To access IT4Innovations' resources and licenses, it is necessary to connect to its local network via VPN. IT4Innovations uses the FortiClient VPN software. For the list of supported operating systems, see the [FortiClient Administration Guide][a].
## VPN Client Download
* Windows: Download the FortiClient app from the [official page][g] (Microsoft Store app is not recommended).
* Mac: Download the FortiClient VPN app from the [Apple Store][d].
* Linux: Download the [FortiClient][e] or [OpenFortiVPN][f] app.
## Working With Windows/Mac VPN Client
Before the first login, you must configure the VPN. In the New VPN Connection section, provide the name of your VPN connection and the following settings:
Name | Value
:-------------------|:------------------
VPN | SSL-VPN
Remote Gateway | reconnect.it4i.cz
Port | 443
Client Certificate | None
Optionally, you can describe the VPN connection and select Save Login under Authentication.
!!! Note "Realms"
If you are member of a partner organization, we may ask you to use so called realm in your VPN connection. In the Remote Gateway field, include the realm path after the IP address or hostname. For example, for a realm `excellent`, the field would read as follows `reconnect.it4i.cz:443/excellent`.
![](../../img/fc_vpn_web_login_2_1.png)
Save the settings, enter your login credentials and click Connect.
![](../../img/fc_vpn_web_login_3_1.png)
## Linux Client
Connection will work with following settings:
Name | Value
:------------|:----------------------
VPN-Server | reconnect.it4i.cz
VPN-Port | 443
Set-Routes | Enabled
Set-DNS | Enabled
DNS Servers | 10.5.8.11, 10.5.8.22
Linux VPN clients need to run under root.
OpenFortiGUI uses sudo by default; be sure that your user is allowed to use sudo.
[1]: ../../general/obtaining-login-credentials/obtaining-login-credentials.md#login-credentials
[2]: ../../general/access/einfracz-account.md
[a]: http://docs.fortinet.com/document/forticlient/latest/administration-guide/646779/installation-requirements
[c]: https://github.com/theinvisible/openfortigui
[d]: https://apps.apple.com/cz/app/forticlient-vpn/id1475674905?l=cs
[e]: https://www.fortinet.com/support/product-downloads/linux
[f]: https://github.com/adrienverge/openfortivpn
[g]: https://www.fortinet.com/support/product-downloads#vpn
# Get Project
The computational resources of IT4I are allocated by the Allocation Committee via several [allocation mechanisms][a] to a project investigated by a Primary Investigator. By allocating the computational resources, the Allocation Committee is authorizing the PI to access and use the clusters. The PI may decide to authorize a number of their collaborators to access and use the clusters to consume the resources allocated to their Project. These collaborators will be associated to the Project. The Figure below is depicting the authorization chain:
![](../img/Authorization_chain.png)
**Allocation Mechanisms:**
* Academic researchers may apply via Open Access Competitions.
* Commercial and non-commercial institutions may also apply via the Directors Discretion.
In all cases, IT4Innovations’ access mechanisms are aimed at distributing computational resources while taking into account the development and application of supercomputing methods and their benefits and usefulness for society. The applicants are expected to submit a proposal. In the proposal, the applicants **apply for a particular amount of core-hours** of computational resources. The requested core-hours should be substantiated by scientific excellence of the proposal, its computational maturity and expected impacts. The allocation decision is based on the scientific, technical, and economic evaluation of the proposal.
## Becoming Primary Investigator
Once you create an account, log in to the [IT4I SCS portal][e] and apply for a project.
You will be informed by IT4I about the Allocation Committee decision.
Once approved by the Allocation Committee, you become the Primary Investigator (PI) for the project
and are authorized to use the clusters and any allocated resources as well as authorize collaborators for your project.
### Authorize Collaborators for Your Project
As a PI, you can approve or deny users' requests to join your project. There are two methods of authorizing collaborators:
#### Authorization by Web
This is a preferred method if you have an IT4I or e-INFRA CZ account.
Log in to the [IT4I SCS portal][e] using your credentials and go to the **Authorization Requests** section.
Here you can authorize collaborators for your project.
#### Authorization by Email (An Alternative Approach)
In order to authorize a Collaborator to utilize the allocated resources, the PI should contact the [IT4I support][f] (email: [support\[at\]it4i.cz][g]) and provide the following information:
1. Identify their project by project ID.
1. Provide a list of people, including themself, who are authorized to use the resources allocated to the project. The list must include the full name, email and affiliation. If collaborators' login access already exists in the IT4I systems, provide their usernames as well.
1. Include "Authorization to IT4Innovations" into the subject line.
!!! warning
Should the above information be provided by email, the email **must be** digitally signed. Read more on [digital signatures][2].
Example (except the subject line which must be in English, you may use Czech or Slovak language for communication with us):
```console
Subject: Authorization to IT4Innovations
Dear support,
Please include my collaborators to project OPEN-0-0.
John Smith, john.smith@myemail.com, Department of Chemistry, MIT, US
Jonas Johansson, jjohansson@otheremail.se, Department of Physics, RIT, Sweden
Luisa Fibonacci, lf@emailitalia.it, Department of Mathematics, National Research Council, Italy
Thank you,
PI
(Digitally signed)
```
!!! note
Web-based email interfaces cannot be used for secure communication; external application, such as Thunderbird or Outlook must be used. This way, your new credentials will be visible only in applications that have access to your certificate.
[1]: obtaining-login-credentials/obtaining-login-credentials.md
[2]: https://docs.it4i.cz/general/obtaining-login-credentials/obtaining-login-credentials/#certificates-for-digital-signatures
[a]: https://www.it4i.cz/en/for-users/computing-resources-allocation
[b]: http://www.it4i.cz/open-access-competition/?lang=en&lang=en
[c]: http://www.it4i.cz/obtaining-computational-resources-through-directors-discretion/?lang=en&lang=en
[d]: https://prace-ri.eu/hpc-access/deci-access/deci-access-information-for-applicants/
[e]: https://scs.it4i.cz
[f]: https://support.it4i.cz/rt/
[g]: mailto:support@it4i.cz
# Acceptable Use Policy
![PDF presentation on Slurm Batch Jobs Examples](../general/AUP-final.pdf){ type=application/pdf style="min-height:100vh;width:100%" }
---
hide:
- toc
---
# Barbora Partitions
!!! important
Active [project membership][1] is required to run jobs.
Below is the list of partitions available on the Barbora cluster:
| Partition | Project resources | Nodes | Min ncpus | Priority | Authorization | Walltime (def/max) |
| ---------------- | -------------------- | -------------------------- | --------- | -------- | ------------- | ------------------ |
| **qcpu** | > 0 | 190 | 36 | 2 | no | 24 / 48h |
| **qcpu_biz** | > 0 | 190 | 36 | 3 | no | 24 / 48h |
| **qcpu_exp** | < 150% of allocation | 16 | 36 | 4 | no | 1 / 1h |
| **qcpu_free** | < 150% of allocation | 124<br>max 4 per job | 36 | 1 | no | 12 / 18h |
| **qcpu_long** | > 0 | 60<br>max 20 per job | 36 | 2 | no | 72 / 144h |
| **qcpu_preempt** | active Barbora<br>CPU alloc. | 190<br>max 4 per job | 36 | 0 | no | 12 / 12h |
| **qgpu** | > 0 | 8 | 24 | 2 | yes | 24 / 48h |
| **qgpu_biz** | > 0 | 8 | 24 | 3 | yes | 24 / 48h |
| **qgpu_exp** | < 150% of allocation | 4<br>max 1 per job | 24 | 4 | no | 1 / 1h |
| **qgpu_free** | < 150% of allocation | 5<br>max 2 per job | 24 | 1 | no | 12 / 18h |
| **qgpu_preempt** | active Barbora<br>GPU alloc. | 4<br>max 2 per job | 24 | 0 | no | 12 / 12h |
| **qdgx** | > 0 | cn202 | 96 | 2 | yes | 4 / 48h |
| **qviz** | > 0 | 2 with NVIDIA Quadro P6000 | 4 | 2 | no | 1 / 8h |
| **qfat** | > 0 | 1 fat node | 128 | 2 | yes | 24 / 48h |
[1]: access/project-access.md
# Capacity Computing
## Introduction
In many cases, it is useful to submit a huge (>100) number of computational jobs into the Slurm queue system.
A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations,
achieving the best runtime, throughput, and computer utilization.
However, executing a huge number of jobs via the Slurm queue may strain the system. This strain may
result in slow response to commands, inefficient scheduling, and overall degradation of performance
and user experience for all users.
[//]: # (For this reason, the number of jobs is **limited to 100 jobs per user, 4,000 jobs and subjobs per user, 1,500 subjobs per job array**.)
!!! note
Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
You can use [HyperQueue][1] when running a huge number of jobs. HyperQueue can help efficiently
load balance a large number of jobs amongst available computing nodes.
[1]: hyperqueue.md
File added
# Energy Saving
IT4Innovations has implemented a set of energy saving measures on the supercomputing clusters. The measures are selected to minimize the performance impact and achieve significant cost, energy, and carbon footprint reduction effect.
The energy saving measures are effective as of **1.2.2023**.
## Karolina
### Measures
The CPU core and GPU streaming multiprocessors frequency limit is implemented for the Karolina supercomputer:
|Measure | Value |
|---------------------------------------------------------|---------|
|Compute nodes **cn[001-720]**<br> CPU core frequency limit | 2.100 GHz |
|Accelerated compute nodes **acn[001-72]**<br> CPU core frequency limit | 2.600 GHz |
|Accelerated compute nodes **acn[001-72]**<br> GPU SMs frequency limit | 1.290 GHz |
### Performance Impact
The performance impact depends on the [arithmetic intensity][1] of the executed workload.
The [arithmetic intensity][2] is a measure of floating-point operations (FLOPs) performed by a given code (or code section) relative to the amount of memory accesses (Bytes) that are required to support those operations. It is defined as a FLOP per Byte ratio (F/B).Arithmetic intensity is a characteristic of the computational algorithm.
In general, the processor frequency [capping][3] has low performance impact for memory bound computations (arithmetic intensity below the [ridge point][2]). For processor bound computations (arithmetic intensity above the [ridge point][2]), the impact is proportional to the frequency reduction.
On Karolina, runtime increase **up to 16%** is [observed][4] for arithmeticaly intensive CPU workloads and **up to 10%** for intensive GPU workloads. **No slowdown** is [observed][4] for memory bound workloads.
### Energy Efficiency
The energy efficiency in floating point operations per energy unit is increased by **up to 30%** for both the CPU and GPU workloads. The efficiency depends on the arithmetic intensity, however energy savings are always achieved.
## Barbora
None implemented yet.
## NVIDIA DGX-2
None implemented yet.
## Complementary Systems
None implemented yet.
[1]: https://en.wikipedia.org/wiki/Roofline_model
[2]: https://dl.acm.org/doi/10.1145/1498765.1498785
[3]: https://slovnik.seznam.cz/preklad/anglicky_cesky/capping
[4]: Energy_saving_Karolina.pdf
# Satisfaction and Feedback
IT4Innovations National Supercomputing Center is interested in [user satisfaction and feedback][1]. It allows us to prioritize and focus on the most pressing issues. With the help of user feedback, we strive to provide smooth and productive environment, where computational tasks may be solved without distraction or annoyance.
## Feedback Form
Please provide us with feedback regarding your satisfaction with our services using [the online form][1]. Set the values and comment on the individual aspects of our services.
We prefer you enter [**new inputs 3 times a year**][1].
You may view your [feedback history][2] any time.
You are welcome to modify your most recent input.
The form inquires about:
- Resource allocation and access
- Computing environment
- Added value services
You may set the satisfaction score on a **scale of 1 to 5** as well as leave **text comments**.
The score is interpreted as follows:
|Value | Interpretation |
|-----|---|
| 1-2 | Values below 3 indicate a level of dissatisfaction; improvements or other actions are desirable. The values are interpreted as a measure of how deep the dissatisfaction is.|
| 3 | Value 3 indicates a degree of satisfaction. Users are reasonably happy with the environment and services and do not require changes, although there still might be room for improvements. |
| 4-5 | Values above 3 indicate a level of exceptional appreciation and satisfaction; the values are interpreted as a measure of how rewarding the experience is. |
## Feedback Automation
In order to obtain ample feedback data without forcing our users
to spend efforts in filling out the feedback form, we implement automatic data collection.
The automation works as follows:
If the last feedback entry is older than 4 months, a new feedback entry is created as a copy of the last entry.
The new entry is modified in this way:
- score values greater than 3 are decremented by one;
- score values lower than 3 are incremented by one;
- score values equal to 3 are preserved;
- text fields are set blank.
Once a new feedback is created, users are notified by email and invited to [modify the feedback entry][2] as they see fit.
**Rationale:** Feedback automation takes away some effort from a group of moderately satisfied users,
while prompting the users to express satisfaction/dissatisfaction.
We assume that moderately satisfied users (satisfaction value 3) do not require changes to the environment
and tend to remain moderately satisfied in time.
Further, we assume that satisfied users (values 4-5) develop in time towards moderately satisfied (value 3)
by getting accustomed to the provided standards.
The dissatisfied users (values 1-2) also develop towards moderately satisfied due to
gradual improvements implemented by the IT4I.
## Request Tracker Feedback
Please use the [user satisfaction and feedback][1] form to provide your overall view.
For acute, pressing issues and immediate contact, reach out for support via the [Request tracker portal][3] or [support\[at\]it4i.cz][4] email.
Express your satisfaction with the solution of an individual [Request tracker][3] ticket by selecting **Feedback** menu on the ticket form.
## Evaluation
The user feedback is evaluated 4 times a year, in the end of March, June, September, and December.
We consider the text comments, as well as evaluate the score average, distribution and trends.
This is done in summary as well as per individual category.
[1]: https://scs.it4i.cz/feedbacks/new
[2]: https://scs.it4i.cz/feedbacks/
[3]: https://support.it4i.cz/rt
[4]: mailto:support@it4i.cz
# HyperQueue
HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS.
It dynamically groups tasks into Slurm jobs and distributes them to fully utilize allocated nodes.
You thus do not have to manually aggregate your tasks into Slurm jobs.
Find more about HyperQueue in its [documentation][a].
![](../img/hq-idea-s.png)
## Features
* **Transparent task execution on top of a Slurm/PBS cluster**
* Automatic task distribution amongst jobs, nodes, and cores
* Automatic submission of PBS/Slurm jobs
* **Dynamic load balancing across jobs**
* Work-stealing scheduler
* NUMA-aware, core planning, task priorities, task arrays
* Nodes and tasks may be added/removed on the fly
* **Scalable**
* Low overhead per task (~100μs)
* Handles hundreds of nodes and millions of tasks
* Output streaming avoids creating many files on network filesystems
* **Easy deployment**
* Single binary, no installation, depends only on *libc*
* No elevated privileges required
## Installation
* On Barbora and Karolina, you can simply load the HyperQueue module:
```console
$ ml HyperQueue
```
* If you want to install/compile HyperQueue manually, follow the steps on the [official webpage][b].
## Usage
### Starting the Server
To use HyperQueue, you first have to start the HyperQueue server. It is a long-lived process that
is supposed to be running on a login node. You can start it with the following command:
```console
$ hq server start
```
### Submitting Computation
Once the HyperQueue server is running, you can submit jobs into it. Here are a few examples of job submissions.
You can find more information in the [documentation][1].
* Submit a simple job (command `echo 'Hello world'` in this case)
```console
$ hq submit echo 'Hello world'
```
* Submit a job with 10000 tasks
```console
$ hq submit --array 1-10000 my-script.sh
```
Once you start some jobs, you can observe their status using the following commands:
```console
# Display status of a single job
$ hq job <job-id>
# Display status of all jobs
$ hq jobs
```
!!! important
Before the jobs can start executing, you have to provide HyperQueue with some computational resources.
### Providing Computational Resources
Before HyperQueue can execute your jobs, it needs to have access to some computational resources.
You can provide these by starting HyperQueue *workers* which connect to the server and execute your jobs.
The workers should run on computing nodes, therefore they should be started inside Slurm jobs.
There are two ways of providing computational resources.
* **Allocate Slurm jobs automatically**
HyperQueue can automatically submit Slurm jobs with workers on your behalf. This system is called
[automatic allocation][c]. After the server is started, you can add a new automatic allocation
queue using the `hq alloc add` command:
```console
$ hq alloc add slurm -- -A<PROJECT-ID> -p qcpu_exp
```
After you run this command, HQ will automatically start submitting Slurm jobs on your behalf
once some HQ jobs are submitted.
* **Manually start Slurm jobs with HQ workers**
With the following command, you can submit a Slurm job that will start a single HQ worker which
will connect to a running HQ server.
```console
$ salloc <salloc-params> -- /bin/bash -l -c "$(which hq) worker start"
```
!!! tip
For debugging purposes, you can also start the worker e.g. on a login node, simply by running
`$ hq worker start`. Do not use such worker for any long-running computations though!
## Architecture
Here you can see the architecture of HyperQueue.
The user submits jobs into the server which schedules them onto a set of workers running on compute nodes.
![](../img/hq-architecture.png)
[1]: https://it4innovations.github.io/hyperqueue/stable/jobs/jobs/
[a]: https://it4innovations.github.io/hyperqueue/stable/
[b]: https://it4innovations.github.io/hyperqueue/stable/installation/
[c]: https://it4innovations.github.io/hyperqueue/stable/deployment/allocation/
!!!warning
This page has not been updated yet. The page does not reflect the transition from PBS to Slurm.
# Job Arrays
A job array is a compact representation of many jobs called subjobs. Subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
* each subjob has a unique index, $PBS_ARRAY_INDEX
* job Identifiers of subjobs only differ by their indices
* the state of subjobs can differ (R, Q, etc.)
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. An entire job array is submitted through a single `qsub` command and may be managed by `qdel`, `qalter`, `qhold`, `qrls`, and `qsig` commands as a single job.
## Shared Jobscript
All subjobs in a job array use the very same single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by the `$PBS_ARRAY_INDEX` variable.
Example:
Assume we have 900 input files with the name of each beginning with "file" (e.g. file001, ..., file900). Assume we would like to use each of these input files with myprog.x program executable, each as a separate job.
First, we create a tasklist file (or subjobs list), listing all tasks (subjobs) - all input files in our example:
```console
$ find . -name 'file*' > tasklist
```
Then we create a jobscript:
```bash
#!/bin/bash
#PBS -A OPEN-00-00
#PBS -q qprod
#PBS -l select=1,walltime=02:00:00
# change to scratch directory
SCRDIR=/scratch/project/${PBS_ACCOUNT,,}/${USER}/${PBS_JOBID}
mkdir -p $SCRDIR
cd $SCRDIR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input
cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory contains the 900 input files, the myprog.x executable, and the jobscript file. As an input for each run, we take the filename of the input file from the created tasklist file. We copy the input file to the local scratch memory `/lscratch/$PBS_JOBID`, execute the myprog.x and copy the output file back to the submit directory, under the `$TASK.out` name. The myprog.x executable runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single-thread programs in a sequential manner**. Due to the allocation of the whole node, the accounted time is equal to the usage of the whole node, while using only 1/16 of the node.
If running a huge number of parallel multicore (in means of multinode multithread, e.g. MPI enabled) jobs is needed, then a job array approach should be used. The main difference, as compared to the previous examples using one node, is that the local scratch memory should not be used (as it is not shared between nodes) and MPI or other techniques for parallel multinode processing has to be used properly.
## Submiting Job Array
To submit the job array, use the `qsub -J` command. The 900 jobs of the [example above][3] may be submitted like this:
```console
$ qsub -N JOBNAME -J 1-900 jobscript
506493[].isrv5
```
In this example, we submit a job array of 900 subjobs. Each subjob will run on one full node and is assumed to take less than 2 hours (note the #PBS directives in the beginning of the jobscript file, do not forget to set your valid PROJECT_ID and desired queue).
Sometimes for testing purposes, you may need to submit a one-element only array. This is not allowed by PBSPro, but there is a workaround:
```console
$ qsub -N JOBNAME -J 9-10:2 jobscript
```
This will only choose the lower index (9 in this example) for submitting/running your job.
## Managing Job Array
Check status of the job array using the `qstat` command.
```console
$ qstat -a 12345[].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[].dm2 user2 qprod xx 13516 1 16 -- 00:50 B 00:02
```
When the status is B, it means that some subjobs are already running.
Check the status of the first 100 subjobs using the `qstat` command.
```console
$ qstat -a 12345[1-100].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[1].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[2].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[3].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:01
12345[4].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
. . . . . . . . . . .
, . . . . . . . . . .
12345[100].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
```
Delete the entire job array. Running subjobs will be killed, queueing subjobs will be deleted.
```console
$ qdel 12345[].dm2
```
Deleting large job arrays may take a while.
Display status information for all user's jobs, job arrays, and subjobs.
```console
$ qstat -u $USER -t
```
Display status information for all user's subjobs.
```console
$ qstat -u $USER -tJ
```
For more information on job arrays, see the [PBSPro Users guide][1].
## Examples
Download the examples in [capacity.zip][2], illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs.
Unzip the archive in an empty directory on cluster and follow the instructions in the README file-
```console
$ unzip capacity.zip
$ cat README
```
[1]: ../pbspro.md
[2]: capacity.zip
[3]: #shared-jobscript
# Job Scheduling
## Job Priority
The scheduler gives each job a priority and then uses this job priority to select which job(s) to run.
Job priority is determined by these job properties (in order of importance):
1. queue priority
1. fair-share priority
1. job age/eligible time
### Queue Priority
Queue priority is the priority of the queue in which the job is waiting prior to execution.
Queue priority has the biggest impact on job priority. The priority of jobs in higher priority queues is always greater than the priority of jobs in lower priority queues. Other properties of jobs used for determining the job priority (fair-share priority, eligible time) cannot compete with queue priority.
Queue priorities can be seen [here][a].
### Fair-Share Priority
Fair-share priority is calculated based on recent usage of resources. Fair-share priority is calculated per project, i.e. all members of a project share the same fair-share priority. Projects with higher recent usage have a lower fair-share priority than projects with lower or no recent usage.
Fair-share priority is used for ranking jobs with equal queue priority.
Usage decays, halving at intervals of 7 days.
### Job Age/Eligible Time
The job age factor represents the length of time a job has been sitting in the queue and eligible to run.
Job age has the least impact on priority.
### Formula
Job priority is calculated as:
---8<--- "job_sort_formula.md"
### Job Backfilling
The scheduler uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (the job with the highest priority) cannot run.
The scheduler makes a list of jobs to run in order of priority. The scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
This means that jobs with lower priority can be run before jobs with higher priority.
!!! note
It is **very beneficial to specify the timelimit** when submitting jobs.
Specifying more accurate timelimit enables better scheduling, better times, and better resource usage. Jobs with suitable (small) timelimit can be backfilled - and overtake job(s) with a higher priority.
---8<--- "mathjax.md"
## Technical Details
Priorities are set using Slurm's [Multifactor Priority Plugin][1]. Current settings are as follows:
```
$ grep ^Priority /etc/slurm/slurm.conf
PriorityFlags=DEPTH_OBLIVIOUS
PriorityType=priority/multifactor
PriorityDecayHalfLife=7-0
PriorityMaxAge=14-0
PriorityWeightAge=100000
PriorityWeightFairshare=10000000
PriorityWeightPartition=1000000000
```
## Inspecting Job Priority
One can inspect job priority using `sprio` command. Job priority is in the field PRIORITY and it is comprised of PARTITION, FAIRSHARE and AGE priorities.
```
$ sprio -l -j 894782
JOBID PARTITION USER ACCOUNT PRIORITY SITE AGE ASSOC FAIRSHARE JOBSIZE PARTITION QOSNAME QOS NICE TRES
894782 qgpu user1 service 300026688 0 17 0 26671 0 300000000 normal 0 0
```
[1]: https://slurm.schedmd.com/priority_multifactor.html
[a]: https://extranet.it4i.cz/rsweb/karolina/queues
slurm-job-submission-and-execution.md
\ No newline at end of file
!!!warning
This page has not been updated yet. The page does not reflect the transition from PBS to Slurm.
# Parallel Runs Setting on Karolina
Important aspect of each parallel application is correct placement of MPI processes
or threads to available hardware resources.
Since incorrect settings can cause significant degradation of performance,
all users should be familiar with basic principles explained below.
At the beginning, a basic [hardware overview][1] is provided,
since it influences settings of `mpirun` command.
Then placement is explained for major MPI implementations [Intel MPI][2] and [OpenMPI][3].
The last section describes an appropriate placement for [memory bound][4] and [compute bound][5] applications.
## Hardware Overview
[Karolina][6] contains several types of nodes.
This documentation contains description of basic hardware structure of universal and accelerated nodes.
More technical details can be found in [this presentation][a].
### Universal Nodes
- 720 x 2 x AMD 7H12, 64 cores, 2,6 GHz
<table>
<tr>
<td rowspan="8">universal<br/>node</td>
<td rowspan="4">socket 0<br/> AMD 7H12</td>
<td>NUMA 0</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 1</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 2</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 3</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td rowspan="4">socket 1<br/> AMD 7H12</td>
<td>NUMA 4</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 5</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 6</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
<tr>
<td>NUMA 7</td>
<td>2 x ch DDR4-3200</td>
<td>4 x 16MB L3</td>
<td>16 cores (4 cores / L3)</td>
</tr>
</table>
### Accelerated Nodes
- 72 x 2 x AMD 7763, 64 cores, 2,45 GHz
- 72 x 8 x NVIDIA A100 GPU
<table>
<tr>
<td rowspan="8">accelerated<br/>node</td>
<td rowspan="4">socket 0<br/> AMD 7763</td>
<td>NUMA 0</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td></td>
</tr>
<tr>
<td>NUMA 1</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td>2 x A100 </td>
</tr>
<tr>
<td>NUMA 2</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td></td>
</tr>
<tr>
<td>NUMA 3</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td>2 x A100 </td>
</tr>
<tr>
<td rowspan="4">socket 1<br/> AMD 7763</td>
<td>NUMA 4</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td></td>
</tr>
<tr>
<td>NUMA 5</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td>2 x A100 </td>
</tr>
<tr>
<td>NUMA 6</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td></td>
</tr>
<tr>
<td>NUMA 7</td>
<td>2 x ch DDR4-3200</td>
<td>2 x 32MB L3</td>
<td>16 cores (8 cores / L3)</td>
<td>2 x A100 </td>
</tr>
</table>
## Assigning Processes / Threads to Particular Hardware
When an application is started, the operating system maps MPI processes and threads to particular cores.
This mapping is not fixed as the system is allowed to move your application to other cores.
Inappropriate mapping or frequent moving can lead to significant degradation
of performance of your application.
Hence, a user should:
- set **mapping** according to their application needs;
- **pin** the application to particular hardware resources.
Settings can be described by environment variables that are briefly described on [HPC wiki][b].
However, the mapping and pining is highly non-portable.
It is dependent on a particular system and used MPI library.
The following sections describe settings for the Karolina cluster.
The number of MPI processes per node should be set by PBS via the [`qsub`][7] command.
Mapping and pinning are set for [Intel MPI](#intel-mpi) and [Open MPI](#open-mpi) differently.
## Open MPI
In the case of Open MPI, mapping can be set by the parameter `--map-by`.
Pinning can be set by the parameter `--bind-to`.
The list of all available options can be found [here](https://www-lb.open-mpi.org/doc/v4.1/man1/mpirun.1.php#sect6).
The most relevant options are:
- bind-to: core, l3cache, numa, socket
- map-by: core, l3cache, numa, socket, slot
Mapping and pinning to, for example, L3 cache can be set by the `mpirun` command in the following way:
```
mpirun -n 32 --map-by l3cache --bind-to l3cache ./app
```
Both parameters can be also set by environment variables:
```
export OMPI_MCA_rmaps_base_mapping_policy=l3cache
export OMPI_MCA_hwloc_base_binding_policy=l3cache
mpirun -n 32 ./app
```
## Intel MPI
In the case of Intel MPI, mapping and pinning can be set by environment variables
that are described [on Intel's Developer Reference][c].
The most important variable is `I_MPI_PIN_DOMAIN`.
It denotes the number of cores allocated for each MPI process
and specifies both mapping and pinning.
Default setting is `I_MPI_PIN_DOMAIN=auto:compact`.
It computes the number of cores allocated to each MPI process
from the number of available cores and requested number of MPI processes
(total cores / requested MPI processes).
It is usually the optimal settings and majority applications can be run
with the simple `mpirun -n N ./app` command, where `N` denotes the number of MPI processes.
### Examples of Placement to Different Hardware
Let us have a job allocated by the following `qsub`:
```console
qsub -lselect=2,nprocs=128,mpiprocs=4,ompthreads=4
```
Then the following table shows placement of `app` started
with 8 MPI processes on the universal node for various mapping and pining:
<table style="text-align: center">
<tr>
<th style="text-align: center" colspan="2">Open MPI</th>
<th style="text-align: center">Intel MPI</th>
<th style="text-align: center">node</th>
<th style="text-align: center" colspan="4">0</th>
<th style="text-align: center" colspan="4">1</th>
</tr>
<tr>
<th style="text-align: center">map-by</th>
<th style="text-align: center">bind-to</th>
<th style="text-align: center">I_MPI_PIN_DOMAIN</th>
<th style="text-align: center">rank</th>
<th style="text-align: center">0</th>
<th style="text-align: center">1</th>
<th style="text-align: center">2</th>
<th style="text-align: center">3</th>
<th style="text-align: center">4</th>
<th style="text-align: center">5</th>
<th style="text-align: center">6</th>
<th style="text-align: center">7</th>
</tr>
<tr>
<td rowspan="3">socket</td>
<td rowspan="3">socket</td>
<td rowspan="3">socket</td>
<td>socket</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>numa</td>
<td>0-3</td>
<td>4-7</td>
<td>0-3</td>
<td>4-7</td>
<td>0-3</td>
<td>4-7</td>
<td>0-3</td>
<td>4-7</td>
</tr>
<tr>
<td>cores</td>
<td>0-63</td>
<td>64-127</td>
<td>0-63</td>
<td>64-127</td>
<td>0-63</td>
<td>64-127</td>
<td>0-63</td>
<td>64-127</td>
</tr>
<tr>
<td rowspan="3">numa</td>
<td rowspan="3">numa</td>
<td rowspan="3">numa</td>
<td>socket</td>
<td colspan="4">0</td>
<td colspan="4">0</td>
</tr>
<tr>
<td>numa</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr>
<td>cores</td>
<td>0-15</td>
<td>16-31</td>
<td>32-47</td>
<td>48-63</td>
<td>0-15</td>
<td>16-31</td>
<td>32-47</td>
<td>48-63</td>
</tr>
<tr>
<td rowspan="3">l3cache</td>
<td rowspan="3">l3cache</td>
<td rowspan="3"><s>cache3</s></td>
<td>socket</td>
<td colspan="4">0</td>
<td colspan="4">0</td>
</tr>
<tr>
<td>numa</td>
<td colspan="4">0</td>
<td colspan="4">0</td>
</tr>
<tr>
<td>cores</td>
<td>0-3</td>
<td>4-7</td>
<td>8-11</td>
<td>12-15</td>
<td>0-3</td>
<td>4-7</td>
<td>8-11</td>
<td>12-15</td>
</tr>
<tr>
<td rowspan="3">slot:pe=32</td>
<td rowspan="3">core</td>
<td rowspan="3">32</td>
<td>socket</td>
<td colspan="2">0</td>
<td colspan="2">1</td>
<td colspan="2">0</td>
<td colspan="2">1</td>
</tr>
<tr>
<td>numa</td>
<td>0-1</td>
<td>2-3</td>
<td>4-5</td>
<td>6-7</td>
<td>0-1</td>
<td>2-3</td>
<td>4-5</td>
<td>6-7</td>
</tr>
<tr>
<td>cores</td>
<td>0-31</td>
<td>32-63</td>
<td>64-95</td>
<td>96-127</td>
<td>0-31</td>
<td>32-63</td>
<td>64-95</td>
<td>96-127</td>
</tr>
</table>
We can see from the above table that mapping starts from the first node.
When the first node is fully occupied
(according to the number of MPI processes per node specified by `qsub`),
mapping continues to the second node, etc.
We note that in the case of `--map-by numa` and `--map-by l3cache`,
the application is not spawned across whole node.
For utilization of a whole node, more MPI processes per node should be used.
In addition, `I_MPI_PIN_DOMAIN=cache3` maps processes incorrectly.
The last mapping (`--map-by slot:pe=32` or `I_MPI_PIN_DOMAIN=32`) is the most general one.
In this way, a user can directly specify the number of cores for each MPI process
independently to a hardware specification.
## Memory Bound Applications
The performance of memory bound applications is dependent on throughput to the memory.
Hence, it is optimal to use the number of cores equal to the number of memory channels;
i.e., 16 cores per node (see the tables with the hardware description at the top of this document).
Running your memory bound application on more than 16 cores can cause lower performance.
Two MPI processes to each NUMA domain must be assigned in order to fully utilize bandwidth to the memory.
It can be achieved by the following commands (for a single node):
- Intel MPI: `mpirun -n 16 ./app`
- Open MPI: `mpirun -n 16 --map-by slot:pe=8 ./app`
Intel MPI automatically puts MPI processes to each 8th core.
In the case of Open MPI, parameter `--map-by` must be used.
Required mapping can be achieved, for example by `--map-by slot:pe=8`
that maps MPI processes to each 8-th core (in the same way as Intel MPI).
This mapping also assures that each MPI process will be assigned to different L3 cache.
## Compute Bound Applications
For compute bound applications it is optimal to use as much cores as possible; i.e. 128 cores per node.
The following command can be used:
- Intel MPI: `mpirun -n 128 ./app`
- Open MPI: `mpirun -n 128 --map-by core --bind-to core ./app`
Pinning assures that operating system does not migrate MPI processes among cores.
## Finding Optimal Setting for Your Application
Sometimes it is not clear what the best settings for your application is.
In that case, you should test your application with a different number of MPI processes.
A good practice is to test your application with 16-128 MPI per node
and measure the time required to finish the computation.
With Intel MPI, it is enough to start your application with a required number of MPI processes.
For Open MPI, you can specify mapping in the following way:
```
mpirun -n 16 --map-by slot:pe=8 --bind-to core ./app
mpirun -n 32 --map-by slot:pe=4 --bind-to core ./app
mpirun -n 64 --map-by slot:pe=2 --bind-to core ./app
mpirun -n 128 --map-by core --bind-to core ./app
```
[1]: #hardware-overview
[2]: #intel-mpi
[3]: #open-mpi
[4]: #memory-bound-applications
[5]: #compute-bound-applications
[6]: ../karolina/introduction.md
[7]: job-submission-and-execution.md
[a]: https://events.it4i.cz/event/123/attachments/417/1578/Technical%20features%20and%20the%20use%20of%20Karolina%20GPU%20accelerated%20partition.pdf
[b]: https://hpc-wiki.info/hpc/Binding/Pinning
[c]: https://www.intel.com/content/www/us/en/develop/documentation/mpi-developer-reference-linux/top/environment-variable-reference.html
---
hide:
- toc
---
# Karolina Partitions
!!! important
Active [project membership][1] is required to run jobs.
Below is the list of partitions available on the Karolina cluster:
| Partition | Project resources | Nodes | Min ncpus | Priority | Authorization | Walltime (def/max) |
| ---------------- | -------------------- | --------------------------------------------------------- | ----------- | -------- | ------------- | ------------------ |
| **qcpu** | > 0 | 720 | 128 | 2 | no | 24 / 48h |
| **qcpu_biz** | > 0 | 720 | 128 | 3 | no | 24 / 48h |
| **qcpu_exp** | < 150% of allocation | 720<br>max 2 per user | 128 | 4 | no | 1 / 1h |
| **qcpu_free** | < 150% of allocation | 720<br>max 4 per job | 128 | 1 | no | 12 / 18h |
| **qcpu_long** | > 0 | 200<br>max 20 per job, only non-accelerated nodes allowed | 128 | 2 | no | 72 / 144h |
| **qcpu_preempt** | active Karolina<br> CPU alloc. | 720<br>max 4 per job | 128 | 0 | no | 12 / 12h |
| **qgpu** | > 0 | 72<br>max 16 per job | 16<br>1 gpu | 3 | yes | 24 / 48h |
| **qgpu_big** | > 0 | 72<br>max 64 per job | 128 | 2 | yes | 12 / 12h |
| **qgpu_biz** | > 0 | 72<br>max 16 per job | 128 | 4 | yes | 24 / 48h |
| **qgpu_exp** | < 150% of allocation | 4<br>max 1 per job | 16<br>1 gpu | 5 | no | 1 / 1h |
| **qgpu_free** | < 150% of allocation | 46<br>max 2 per job | 16<br>1 gpu | 1 | no | 12 / 18h |
| **qgpu_preempt** | active Karolina<br> GPU alloc. | 72<br>max 2 per job | 16<br>1 gpu | 0 | no | 12 / 12h |
| **qviz** | > 0 | 2 with NVIDIA® Quadro RTX™ 6000 | 8 | 2 | no | 1 / 8h |
| **qfat** | > 0 | 1 (sdf1) | 24 | 2 | yes | 24 / 48h |
[1]: access/project-access.md
# Karolina - Job Submission and Execution
## Introduction
[Slurm][1] workload manager is used to allocate and access Karolina cluster's resources.
This page describes Karolina cluster's specific Slurm settings and usage.
General information about Slurm usage at IT4Innovations can be found at [Slurm Job Submission and Execution][2].
## Partition Information
Partitions/queues on the system:
```console
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
qcpu* up 2-00:00:00 1/717/0/718 cn[001-718]
qcpu_biz up 2-00:00:00 1/717/0/718 cn[001-718]
qcpu_exp up 1:00:00 1/719/0/720 cn[001-720]
qcpu_free up 18:00:00 1/717/0/718 cn[001-718]
qcpu_long up 6-00:00:00 1/717/0/718 cn[001-718]
qcpu_preempt up 12:00:00 1/717/0/718 cn[001-718]
qgpu up 2-00:00:00 0/70/0/70 acn[01-70]
qgpu_big up 12:00:00 71/1/0/72 acn[01-72]
qgpu_biz up 2-00:00:00 0/70/0/70 acn[01-70]
qgpu_exp up 1:00:00 0/72/0/72 acn[01-72]
qgpu_free up 18:00:00 0/70/0/70 acn[01-70]
qgpu_preempt up 12:00:00 0/70/0/70 acn[01-70]
qfat up 2-00:00:00 0/1/0/1 sdf1
qviz up 8:00:00 0/2/0/2 viz[1-2]
```
For more information about Karolina's queues, see [this page][8].
Graphical representation of cluster usage, partitions, nodes, and jobs could be found
at [https://extranet.it4i.cz/rsweb/karolina][3]
On Karolina cluster
* all CPU queues/partitions provide full node allocation, whole nodes (all node resources) are allocated to a job.
* other queues/partitions (gpu, fat, viz) provide partial node allocation. Jobs' resources (cpu, mem) are separated and dedicated for job.
!!! important "Partial node allocation and security"
Division of nodes means that if two users allocate a portion of the same node, they can see each other's running processes.
If this solution is inconvenient for you, consider allocating a whole node.
IT4I clusters are monitored for resources utilization.
One of the monitoring daemons is using registers to collect performance
monitoring counters (PMC), which user may need when analysing performance
of the executed application (perf or [Score-P][10] profiling tools).
To deactivate the daemon and release the respective registers set job feature
during allocation, as specified [here][9].
## Using CPU Queues
Access [standard compute nodes][4].
Whole nodes are allocated. Use the `--nodes` option to specify the number of requested nodes.
There is no need to specify the number of cores and memory size.
```console
#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qcpu
#SBATCH --time 12:00:00
#SBATCH --nodes 8
...
```
## Using GPU Queues
!!! important "Nodes per job limit"
Because we are still in the process of fine-tuning and setting optimal parameters for SLURM,
we have temporarily limited the maximum number of nodes per job on `qgpu` and `qgpu_biz` to **16**.
Access [GPU accelerated nodes][5].
Every GPU accelerated node is divided into eight parts, each part contains one GPU, 16 CPU cores and corresponding memory.
By default, only one part, i.e. 1/8 of the node - one GPU and corresponding CPU cores and memory, is allocated.
There is no need to specify the number of cores and memory size, on the contrary, it is undesirable.
There are employed some restrictions which aim to provide fair division and efficient use of node resources.
```console
#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qgpu
#SBATCH --time 12:00:00
...
```
To allocate more GPUs use `--gpus` option.
The default behavior is to allocate enough nodes to satisfy the requested resources as expressed by `--gpus` option and without delaying the initiation of the job.
The following code requests four GPUs; scheduler can allocate from one up to four nodes depending on the actual cluster state (i.e. GPU availability) to fulfil the request.
```console
#SBATCH --gpus 4
```
The following code requests 16 GPUs; scheduler can allocate from two up to sixteen nodes depending on the actual cluster state (i.e. GPU availability) to fulfil the request.
```console
#SBATCH --gpus 16
```
To allocate GPUs within one node you have to specify the `--nodes` option.
The following code requests four GPUs on exactly one node
```console
#SBATCH --gpus 4
#SBATCH --nodes 1
```
The following code requests 16 GPUs on exactly two nodes.
```console
#SBATCH --gpus 16
#SBATCH --nodes 2
```
Alternatively, you can use the `--gpus-per-node` option.
Only value 8 is allowed for multi-node allocation to prevent fragmenting nodes.
The following code requests 16 GPUs on exactly two nodes.
```console
#SBATCH --gpus-per-node 8
#SBATCH --nodes 2
```
## Using Fat Queue
Access [data analytics aka fat node][6].
Fat node is divided into 32 parts, each part contains one socket/processor (24 cores) and corresponding memory.
By default, only one part, i.e. 1/32 of the node - one processor and corresponding memory, is allocated.
To allocate requested memory use the `--mem` option.
Corresponding CPUs will be allocated.
Fat node has about 22.5TB of memory available for jobs.
```console
#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qfat
#SBATCH --time 2:00:00
#SBATCH --mem 6TB
...
```
You can also specify CPU-oriented options (like `--cpus-per-task`), then appropriate memory will be allocated to the job.
To allocate a whole fat node, use the `--exclusive` option
```console
#SBATCH --exclusive
```
## Using Viz Queue
Access [visualization nodes][7].
Every visualization node is divided into eight parts.
By default, only one part, i.e. 1/8 of the node, is allocated.
```console
$ salloc -A PROJECT-ID -p qviz
```
To allocate a whole visualisation node, use the `--exclusive` option
```console
$ salloc -A PROJECT-ID -p qviz --exclusive
```
[1]: https://slurm.schedmd.com/
[2]: /general/slurm-job-submission-and-execution
[3]: https://extranet.it4i.cz/rsweb/karolina
[4]: /karolina/compute-nodes/#compute-nodes-without-accelerators
[5]: /karolina/compute-nodes/#compute-nodes-with-a-gpu-accelerator
[6]: /karolina/compute-nodes/#data-analytics-compute-node
[7]: /karolina/visualization/
[8]: ./karolina-partitions.md
[9]: /job-features/#cluster-monitoring
[10]: /software/debuggers/score-p/
\ No newline at end of file