Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 2232 additions and 0 deletions
# Change e-INFRA CZ Profile Settings
To change the settings of your e-INFRA CZ profile go to:<br>[https://profile.e-infra.cz/][1]
## Change Password
To change your e-INFRA CZ account password, go to:<br>[https://profile.e-infra.cz/profile/settings/passwordReset][2]
## Change SSH Key
To change SSH key(s) associated with your e-INFRA CZ account, go to:<br>[https://profile.e-infra.cz/profile/settings/sshKeys][3]
[1]: https://profile.e-infra.cz/profile
[2]: https://profile.e-infra.cz/profile/settings/passwordReset
[3]: https://profile.e-infra.cz/profile/settings/sshKeys
# Change IT4I Account Settings
## Change Password
To change your IT4I account password, go to:<br>[https://extranet.it4i.cz/ssp/][2]
## Change SSH Key
To change SSH key(s) associated with your IT4I account, go to:<br>[https://extranet.it4i.cz/ssp/?action=changesshkey][3]
[1]: https://scs.it4i.cz/
[2]: https://extranet.it4i.cz/ssp/
[3]: https://extranet.it4i.cz/ssp/?action=changesshkey
# Certificates FAQ
FAQ about certificates in general.
## Q: What Are Certificates?
IT4Innovations employs X.509 certificates for secure communication (e.g. credentials exchange) and for grid services related to PRACE, as they present a single method of authentication for all PRACE services, where only one password is required.
There are different kinds of certificates, each with a different scope of use. We mention here:
* User (Private) certificates
* Certificate Authority (CA) certificates
* Host certificates
* Service certificates
However, users only need to manage User and CA certificates. Note that your user certificate is protected by an associated private key, and this **private key must never be disclosed**.
## Q: Which X.509 Certificates Are Recognized by IT4Innovations?
See the [Certificates for Digital Signatures][1] section.
## Q: How Do I Get a User Certificate That Can Be Used With IT4Innovations?
To get a certificate, you must make a request to your local, IGTF approved Certificate Authority (CA). Then, you must usually visit, in person, your nearest Registration Authority (RA) to verify your affiliation and identity (photo identification is required). Usually, you will then be emailed details on how to retrieve your certificate, although procedures can vary between CAs. If you are in Europe, you can locate [your trusted CA][a].
In some countries, certificates can also be retrieved using the TERENA Certificate Service, see the FAQ below for the link.
## Q: Does IT4Innovations Support Short Lived Certificates (SLCS)?
Yes, if the CA which provides this service is also a member of IGTF.
## Q: Does IT4Innovations Support the TERENA Certificate Service?
Yes, IT4Innovations supports TERENA eScience personal certificates. For more information, visit [TCS - Trusted Certificate Service][b], where you can also find if your organization/country can use this service.
## Q: What Format Should My Certificate Take?
User Certificates come in many formats, the three most common being the ’PKCS12’, ’PEM’, and JKS formats.
The PKCS12 (often abbreviated to ’p12’) format stores your user certificate, along with your associated private key, in a single file. This form of your certificate is typically employed by web browsers, mail clients, and grid services like UNICORE, DART, gsissh-term, and Globus toolkit (GSI-SSH, GridFTP, and GRAM5).
The PEM format (`*`.pem) stores your user certificate and your associated private key in two separate files. This form of your certificate can be used by PRACE’s gsissh-term and with the grid related services like Globus toolkit (GSI-SSH, GridFTP, and GRAM5).
To convert your Certificate from PEM to p12 formats and _vice versa_, IT4Innovations recommends using the OpenSSL tool (see the [separate FAQ entry][2]).
JKS is the Java KeyStore and may contain both your personal certificate with your private key and a list of your trusted CA certificates. This form of your certificate can be used by grid services like DART and UNICORE6.
To convert your certificate from p12 to JKS, IT4Innovations recommends using the keytool utility (see the [separate FAQ entry][3]).
## Q: What Are CA Certificates?
Certification Authority (CA) certificates are used to verify the link between your user certificate and the issuing authority. They are also used to verify the link between the host certificate of an IT4Innovations server and the CA that issued the certificate. In essence, they establish a chain of trust between you and the target server. Thus, for some grid services, users must have a copy of all the CA certificates.
To assist users, SURFsara (a member of PRACE) provides a complete and up-to-date bundle of all the CA certificates that any PRACE user (or IT4Innovations grid services user) will require. Bundle of certificates, in either p12, PEM, or JKS formats, are [available here][c].
It is worth noting that gsissh-term and DART automatically update their CA certificates from this SURFsara website. In other cases, if you receive a warning that a server’s certificate cannot be validated (not trusted), update your CA certificates via the SURFsara website. If this fails, contact the IT4Innovations helpdesk.
Lastly, if you need the CA certificates for a personal Globus 5 installation, you can install the CA certificates from a MyProxy server with the following command:
```console
myproxy-get-trustroots -s myproxy-prace.lrz.de
```
If you run this command as `root`, it will install the certificates into /etc/grid-security/certificates. Otherwise, the certificates will be installed into $HOME/.globus/certificates. For Globus, you can download the globuscerts.tar.gz packet [available here][c].
## Q: What Is a DN and How Do I Find Mine?
DN stands for Distinguished Name and is a part of your user certificate. IT4Innovations needs to know your DN to enable your account to use the grid services. You may use OpenSSL (see [below][2]) to determine your DN or, if your browser contains your user certificate, you can extract your DN from your browser.
For Internet Explorer users, the DN is referred to as the "subject" of your certificate. ToolsInternet OptionsContentCertificatesViewDetailsSubject.
For users running Firefox under Windows, the DN is referred to as the "subject" of your certificate. ToolsOptionsAdvancedEncryptionView Certificates. Highlight your name and then click ViewDetailsSubject.
## Q: How Do I Use the Openssl Tool?
The following examples are for Unix/Linux operating systems only.
To convert from PEM to p12, enter the following command:
```console
openssl pkcs12 -export -in usercert.pem -inkey userkey.pem -out
username.p12
```
To convert from p12 to PEM, type the following _four_ commands:
```console
openssl pkcs12 -in username.p12 -out usercert.pem -clcerts -nokeys
openssl pkcs12 -in username.p12 -out userkey.pem -nocerts
chmod 444 usercert.pem
chmod 400 userkey.pem
```
To check your Distinguished Name (DN), enter the following command:
```console
openssl x509 -in usercert.pem -noout -subject -nameopt
RFC2253
```
To check your certificate (e.g. DN, validity, issuer, public key algorithm, etc.), enter the following command:
```console
openssl x509 -in usercert.pem -text -noout
```
To download OpenSSL if not pre-installed, see [here][d]. On Macintosh Mac OS X computers, OpenSSL is already pre-installed and can be used immediately.
## Q: How Do I Create and Then Manage a Keystore?
IT4innovations recommends the Java-based keytool utility to create and manage keystores, which themselves are stores of keys and certificates. For example if you want to convert your pkcs12 formatted key pair into a Java keystore you can use the following command:
```console
keytool -importkeystore -srckeystore $my_p12_cert -destkeystore
$my_keystore -srcstoretype pkcs12 -deststoretype jks -alias
$my_nickname -destalias $my_nickname
```
where `$my_p12_cert` is the name of your p12 (pkcs12) certificate, `$my_keystore` is the name that you give to your new java keystore and `$my_nickname` is the alias name that the p12 certificate was given and is also used for the new keystore.
You can also import CA certificates into your Java keystore with the tool, for exmaple:
```console
keytool -import -trustcacerts -alias $mydomain -file $mydomain.crt -keystore $my_keystore
```
where `$mydomain.crt` is the certificate of a trusted signing authority (CA) and `$mydomain` is the alias name that you give to the entry.
More information on the tool can be found [here][e].
## Q: How Do I Use My Certificate to Access Different Grid Services?
Most grid services require the use of your certificate; however, the format of your certificate depends on the grid Service you wish to employ.
If employing the PRACE version of GSISSH-term (also a Java Web Start Application), you may use either the PEM or p12 formats. Note that this service automatically installs up-to-date PRACE CA certificates.
If the grid service is UNICORE, then you bind your certificate, in either the p12 format or JKS, to UNICORE during the installation of the client on your local machine.
If the grid service is a part of Globus (e.g. GSI-SSH, GriFTP, or GRAM5), the certificates can be in either p12 or PEM format and must reside in the "$HOME/.globus" directory for Linux and Mac users or %HOMEPATH%.globus for Windows users. (Windows users will have to use the DOS command `cmd` to create a directory which starts with a ’.’). Further, user certificates should be named either "usercred.p12" or "usercert.pem" and "userkey.pem", and the CA certificates must be kept in a pre-specified directory as follows. For Linux and Mac users, this directory is either $HOME/.globus/certificates or /etc/grid-security/certificates. For Windows users, this directory is %HOMEPATH%.globuscertificates. (If you are using GSISSH-Term from prace-ri.eu, you do not have to create the .globus directory nor install CA certificates to use this tool alone).
## Q: How Do I Manually Import My Certificate Into My Browser?
In Firefox, you can import your certificate by first choosing the "Preferences" window. For Windows, this is ToolsOptions. For Linux, this is EditPreferences. For Mac, this is FirefoxPreferences. Then choose the "Advanced" button, followed by the "Encryption" tab. Then choose the "Certificates" panel, select the "Select one automatically" option if you have only one certificate, or "Ask me every time" if you have more than one. Then, click on the "View Certificates" button to open the "Certificate Manager" window. You can then select the "Your Certificates" tab and click on the "Import" button. Then locate the PKCS12 (.p12) certificate you wish to import and employ its associated password.
If you are a Safari user, then simply open the "Keychain Access" application and follow "FileImport items".
If you are an Internet Explorer user, click Start > Settings > Control Panel and then double-click on Internet. On the Content tab, click Personal and then click Import. Type your password in the Password field. You may be prompted multiple times for your password. In the "Certificate File To Import" box, type the filename of the certificate you wish to import, and then click OK. Click Close, and then click OK.
## Q: What Is a Proxy Certificate?
A proxy certificate is a short-lived certificate, which may be employed by UNICORE and the Globus services. The proxy certificate consists of a new user certificate and a newly generated proxy private key. This proxy typically has a rather short lifetime (normally 12 hours) and often allows only a limited delegation of rights. Its default location for Unix/Linux, is /tmp/x509_u_uid_ but can be set via the `$X509_USER_PROXY` environment variable.
## Q: What Is the MyProxy Service?
[MyProxy Service][g] can be employed by gsissh-term and Globus tools and is an online repository that allows users to store long-lived proxy certificates remotely, which can then be retrieved for later use. Each proxy is protected by a password provided by the user at the time of storage. This is beneficial to Globus users, as they do not have to carry their private keys and certificates when travelling; nor do users have to install private keys and certificates on possibly insecure computers.
## Q: Someone May Have Copied or Had Access to the Private Key of My Certificate Either in a Separate File or in the Browser. What Should I Do?
Please ask the Certificate Authority that issued your certificate to revoke this certificate and to supply you with a new one. In addition, report this to IT4Innovations by contacting [the support team][h].
## Q: My Certificate Expired. What Should I Do?
In order to still be able to communicate with us, make a request for a new certificate to your CA. There is no need to explicitly send us any information about your new certificate if a new one has the same Distinguished Name (DN) as the old one.
[1]: obtaining-login-credentials.md#certificates-for-digital-signatures
[2]: #q-how-do-i-use-the-openssl-tool
[3]: #q-how-do-i-create-and-then-manage-a-keystore
[a]: https://www.eugridpma.org/members/worldmap/
[b]: https://tcs-escience-portal.terena.org/
[c]: https://winnetou.surfsara.nl/prace/certs/
[d]: https://www.openssl.org/source/
[e]: http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html
[g]: http://grid.ncsa.illinois.edu/myproxy/
[h]: https://support.it4i.cz/rt
# IT4I Account
!!! important
If you are affiliated with an academic institution from the Czech Republic ([eduID.cz][u]), create an [e-INFRA CZ account][8], instead.
If you are not eligible for an e-INFRA CZ account, contact the [IT4I support][a] (email: [support\[at\]it4i.cz][b]) and provide the following information:
1. Personal information (**required**, note that without this information, you cannot use IT4I resources):
1. **Full name**
1. **Gender**
1. **Citizenship**
1. **Country of residence**
1. **Organization/affiliation**
1. **Organization/affiliation country**
1. **Organization/affiliation type** (university, company, R&D institution, private/public sector (hospital, police), academy of sciences, etc.)
1. **Job title** (student, PhD student, researcher, research assistant, employee, etc.)
1. Project name and/or primary investigator's (PI) name. Project name consists of project type (OPEN|DD|EU|ATR|FTA|ICA) and number in -XX-XX format, for example OPEN-33-12.
1. Statement that you have read and accepted the [Acceptable use policy document][c] (AUP)
1. Attach the AUP file
1. Your preferred username (length is limited between 4 and 7 letters)<br>The preferred username must associate with your first and last name or be otherwise derived from it. Note that the system will automatically add the `it4i-` prefix to your username.
1. Public part of your SSH key<br>If you don't provide it in the ticket, you must [add it manually][s] after your account is created.
1. All information above should be provided by email that is **digitally signed by a CA authority**. Read more on [digital signatures][4] below. If you do not have such a digital signature, you can choose an [Alternative way to personal certificate][3].
Example (except the subject line, which must be in English, you may use Czech or Slovak language for communication with us):
```console
Subject: Access to IT4Innovations
Dear support,
Please open the user account for me and attach the account to PROJECTNAME-XX-XX.
Personal information: John Smith, USA, Department of Chemistry, MIT, MA, US.
I have read and accept the Acceptable use policy document (attached).
Preferred username: johnsm
Thank you,
John Smith
(Digitally signed)
```
You will receive your personal login credentials in an encrypted email. The login credentials include:
1. username
1. system password
The clusters are accessed by the [private key][5] and username. Username and password are used for login to the [information systems][d].
## Certificates for Digital Signatures
We accept personal certificates issued by any widely respected certification authority (CA). This includes certificates by CAs organized in [International Grid Trust Federation][f], its European branch [EUGridPMA][g] and its member organizations, e.g. the [CESNET certification authority][h]. The Czech _"Qualified certificate" (Kvalifikovaný certifikát)_ provided by [PostSignum][i] or [I.CA][j], which is used in electronic contact with Czech authorities, is accepted as well. **In general, we accept certificates issued by any trusted CA that ensures unambiguous identification of the user.**
Certificate generation process for academic purposes, utilizing the CESNET certification authority, is well described here:
* [How to generate a personal TCS certificate in Mozilla Firefox ESR web browser.][k] (in Czech)
!!! note
The certificate file can be installed into your email client. Web-based email interfaces cannot be used for secure communication, external application, such as Thunderbird or Outlook must be used. This way, your new credentials will be visible only in applications that have access to your certificate.
If you are not able to obtain the certificate from any of the respected certification authorities, follow the Alternative Way below.
FAQ about certificates can be found here: [Certificates FAQ][7].
## Alternative Way to Personal Certificate
!!! important
Choose this alternative **only** if you cannot obtain your certificate in a standard way.
Note that in this case **you must attach a scan of your photo ID** (personal ID, passport, or driver's license) when applying for login credentials.
An alternative to personal certificate is an S/MIME certificate allowing secure email communication,
e.g. providing sensitive information such as ID scan or user login/password.
The following example is for Actalis free S/MIME certificate, but you can choose your preferred CA.
1. Go to the [Actalis Free Email Certificate][l] request form.
1. Follow the instructions: fill out the form, accept the terms and conditions, and submit the request.
1. You will receive an email with the certificate.
1. Import the certificate to one of the supported email clients.
1. Attach a scan of photo ID (personal ID, passport, or driver license) to your email request for IT4I account.
!!! note
Web-based email interfaces cannot be used for secure communication; external application, such as Thunderbird or Outlook must be used. This way, your new credentials will be visible only in applications that have access to your certificate.
[1]: ./obtaining-login-credentials.md#certificates-for-digital-signatures
[2]: #authorization-by-web
[3]: #alternative-way-to-personal-certificate
[4]: #certificates-for-digital-signatures
[5]: ../accessing-the-clusters/shell-access-and-data-transfer/ssh-keys.md
[6]: ../accessing-the-clusters/shell-access-and-data-transfer/putty.md#putty-key-generator
[7]: ../obtaining-login-credentials/certificates-faq.md
[8]: ../access/einfracz-account.md
[10]: ../access/project-access.md
[a]: https://support.it4i.cz/rt/
[b]: mailto:support@it4i.cz
[c]: https://docs.it4i.cz/general/aup/
[d]: http://support.it4i.cz/
[e]: https://scs.it4i.cz
[f]: http://www.igtf.net/
[g]: https://www.eugridpma.org
[h]: https://tcs.cesnet.cz
[i]: http://www.postsignum.cz/
[j]: http://www.ica.cz/Kvalifikovany-certifikat.aspx
[k]: http://idoc.vsb.cz/xwiki/wiki/infra/view/uzivatel/moz-cert-gen
[l]: https://extrassl.actalis.it/portal/uapub/freemail?lang=en
[r]: https://www.it4i.cz/computing-resources-allocation/?lang=en
[s]: https://extranet.it4i.cz/ssp/?action=changesshkey
[u]: https://www.eduid.cz/
!!!warning
This page has not been updated yet. The page does not reflect the transition from PBS to Slurm.
# Job Submission and Execution
## Job Submission
When allocating computational resources for the job, specify:
1. a suitable queue for your job (the default is qprod)
1. the number of computational nodes (required)
1. the number of cores per node (not required)
1. the maximum wall time allocated to your calculation, note that jobs exceeding the maximum wall time will be killed
1. your Project ID
1. a Jobscript or interactive switch
Submit the job using the `qsub` command:
```console
$ qsub -A Project_ID -q queue -l select=x:ncpus=y,walltime=[[hh:]mm:]ss[.ms] jobscript
```
The `qsub` command submits the job to the queue, i.e. it creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to the above described policies and constraints. **After the resources are allocated, the jobscript or interactive shell is executed on the first of the allocated nodes.**
!!! note
`ncpus=y` is usually not required, because the smallest allocation unit is an entire node. The exception are corner cases for `qviz` and `qfat` on Karolina.
### Job Submission Examples
```console
$ qsub -A OPEN-0-0 -q qprod -l select=64,walltime=03:00:00 ./myjob
```
In this example, we allocate 64 nodes, 36 cores per node, for 3 hours. We allocate these resources via the `qprod` queue, consumed resources will be accounted to the project identified by Project ID `OPEN-0-0`. The jobscript `myjob` will be executed on the first node in the allocation.
```console
$ qsub -q qexp -l select=4 -I
```
In this example, we allocate 4 nodes, 36 cores per node, for 1 hour. We allocate these resources via the `qexp` queue. The resources will be available interactively.
```console
$ qsub -A OPEN-0-0 -q qnvidia -l select=10 ./myjob
```
In this example, we allocate 10 NVIDIA accelerated nodes, 24 cores per node, for 24 hours. We allocate these resources via the `qnvidia` queue. The jobscript `myjob` will be executed on the first node in the allocation.
```console
$ qsub -A OPEN-0-0 -q qfree -l select=10 ./myjob
```
In this example, we allocate 10 nodes, 24 cores per node, for 12 hours. We allocate these resources via the `qfree` queue. It is not required that the project `OPEN-0-0` has any available resources left. Consumed resources are still accounted for. The jobscript `myjob` will be executed on the first node in the allocation.
All `qsub` options may be [saved directly into the jobscript][1]. In such cases, it is not necessary to specify any options for `qsub`.
```console
$ qsub ./myjob
```
By default, the PBS batch system sends an email only when the job is aborted. Disabling mail events completely can be done as follows:
```console
$ qsub -m n
```
#### Dependency Job Submission
To submit dependent jobs in sequence, use the `depend` function of `qsub`.
First submit the first job in a standard manner:
```console
$ qsub -A OPEN-0-0 -q qprod -l select=64,walltime=02:00:00 ./firstjob
123456[].isrv1
```
Then submit the second job using the `depend` function:
```console
$ qsub -W depend=afterok:123456 ./secondjob
```
Both jobs will be queued, but the second job won't start until the first job has finished successfully.
Below is the list of arguments that can be used with `-W depend=dependency:jobid`:
| Argument | Description |
| ----------- | --------------------------------------------------------------- |
| after | This job is scheduled after `jobid` begins execution. |
| afterok | This job is scheduled after `jobid` finishes successfully. |
| afternotok | This job is scheduled after `jobid` finishes unsucessfully. |
| afterany | This job is scheduled after `jobid` finishes in any state. |
| before | This job must begin execution before `jobid` is scheduled. |
| beforeok | This job must finish successfully before `jobid` begins. |
| beforenotok | This job must finish unsuccessfully before `jobid` begins. |
| beforeany | This job must finish in any state before `jobid` begins. |
### Useful Tricks
All `qsub` options may be [saved directly into the jobscript][1]. In such a case, no options to `qsub` are needed.
```console
$ qsub ./myjob
```
By default, the PBS batch system sends an email only when the job is aborted. Disabling mail events completely can be done like this:
```console
$ qsub -m n
```
<!--- NOT IMPLEMENTED ON KAROLINA YET
## Advanced Job Placement
### Salomon - Placement by Network Location
The network location of allocated nodes in the [InfiniBand network][3] influences efficiency of network communication between nodes of job. Nodes on the same InfiniBand switch communicate faster with lower latency than distant nodes. To improve communication efficiency of jobs, PBS scheduler on Salomon is configured to allocate nodes (from currently available resources), which are as close as possible in the network topology.
For communication intensive jobs, it is possible to set stricter requirement - to require nodes directly connected to the same InfiniBand switch or to require nodes located in the same dimension group of the InfiniBand network.
### Salomon - Placement by InfiniBand Switch
Nodes directly connected to the same InfiniBand switch can communicate most efficiently. Using the same switch prevents hops in the network and provides for unbiased, most efficient network communication. There are 9 nodes directly connected to every InfiniBand switch.
!!! note
We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently.
Nodes directly connected to the one InfiniBand switch can be allocated using node grouping on the PBS resource attribute `switch`.
In this example, we request all 9 nodes directly connected to the same switch using node grouping placement.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=9 -l place=group=switch ./myjob
```
-->
## Advanced Job Handling
### Selecting Turbo Boost Off
!!! note
For Barbora only.
Intel Turbo Boost Technology is on by default. We strongly recommend keeping the default.
If necessary (such as in the case of benchmarking), you can disable Turbo for all nodes of the job by using the PBS resource attribute `cpu_turbo_boost`:
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4 -l cpu_turbo_boost=0 -I
```
More information about the Intel Turbo Boost can be found in the TurboBoost section
### Advanced Examples
In the following example, we select an allocation for benchmarking a very special and demanding MPI program. We request Turbo off, and 2 full chassis of compute nodes (nodes sharing the same IB switches) for 30 minutes:
```console
$ qsub -A OPEN-0-0 -q qprod
-l select=18:ibswitch=isw10:mpiprocs=1:ompthreads=16+18:ibswitch=isw20:mpiprocs=16:ompthreads=1
-l cpu_turbo_boost=0,walltime=00:30:00
-N Benchmark ./mybenchmark
```
The MPI processes will be distributed differently on the nodes connected to the two switches. On the isw10 nodes, we will run 1 MPI process per node with 16 threads per process, on isw20 nodes we will run 16 plain MPI processes.
Although this example is somewhat artificial, it demonstrates the flexibility of the qsub command options.
## Job Management
!!! note
Check the status of your jobs using the `qstat` and `check-pbs-jobs` commands
```console
$ qstat -a
$ qstat -a -u username
$ qstat -an -u username
$ qstat -f 12345.srv11
```
Example:
```console
$ qstat -a
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
16287.srv11 user1 qlong job1 6183 4 64 -- 144:0 R 38:25
16468.srv11 user1 qlong job2 8060 4 64 -- 144:0 R 17:44
16547.srv11 user2 qprod job3x 13516 2 32 -- 48:00 R 00:58
```
In this example user1 and user2 are running jobs named `job1`, `job2`, and `job3x`. `job1` and `job2` are using 4 nodes, 128 cores per node each. `job1` has already run for 38 hours and 25 minutes, and `job2` for 17 hours 44 minutes. So `job1`, for example, has already consumed `64 x 38.41 = 2,458.6` core-hours. `job3x` has already consumed `32 x 0.96 = 30.93` core-hours. These consumed core-hours will be [converted to node-hours][10] and accounted for on the respective project accounts, regardless of whether the allocated cores were actually used for computations.
The following commands allow you to check the status of your jobs using the `check-pbs-jobs` command, check for the presence of user's PBS jobs' processes on execution hosts, display load and processes, display job standard and error output, and continuously display (`tail -f`) job standard or error output.
```console
$ check-pbs-jobs --check-all
$ check-pbs-jobs --print-load --print-processes
$ check-pbs-jobs --print-job-out --print-job-err
$ check-pbs-jobs --jobid JOBID --check-all --print-all
$ check-pbs-jobs --jobid JOBID --tailf-job-out
```
Examples:
```console
$ check-pbs-jobs --check-all
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Check session id: OK
Check processes
cn164: OK
cn165: No process
```
In this example we see that job `35141.dm2` is not currently running any processes on the allocated node cn165, which may indicate an execution error:
```console
$ check-pbs-jobs --print-load --print-processes
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print load
cn164: LOAD: 16.01, 16.01, 16.00
cn165: LOAD: 0.01, 0.00, 0.01
Print processes
%CPU CMD
cn164: 0.0 -bash
cn164: 0.0 /bin/bash /var/spool/PBS/mom_priv/jobs/35141.dm2.SC
cn164: 99.7 run-task
...
```
In this example, we see that job `35141.dm2` is currently running a process run-task on node `cn164`, using one thread only, while node `cn165` is empty, which may indicate an execution error.
```console
$ check-pbs-jobs --jobid 35141.dm2 --print-job-out
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print job standard output:
======================== Job start ==========================
Started at : Fri Aug 30 02:47:53 CEST 2013
Script name : script
Run loop 1
Run loop 2
Run loop 3
```
In this example, we see the actual output (some iteration loops) of the job `35141.dm2`.
!!! note
Manage your queued or running jobs, using the `qhold`, `qrls`, `qdel`, `qsig`, or `qalter` commands
You may release your allocation at any time, using the `qdel` command
```console
$ qdel 12345.srv11
```
You may kill a running job by force, using the `qsig` command
```console
$ qsig -s 9 12345.srv11
```
Learn more by reading the PBS man page
```console
$ man pbs_professional
```
## Job Execution
### Jobscript
!!! note
Prepare the jobscript to run batch jobs in the PBS queue system
The Jobscript is a user made script controlling a sequence of commands for executing the calculation. It is often written in bash, though other scripts may be used as well. The jobscript is supplied to the PBS `qsub` command as an argument, and is executed by the PBS Professional workload manager.
!!! note
The jobscript or interactive shell is executed on first of the allocated nodes.
```console
$ qsub -q qexp -l select=4 -N Name0 ./myjob
$ qstat -n -u username
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.srv11 username qexp Name0 5530 4 128 -- 01:00 R 00:00
cn17/0*32+cn108/0*32+cn109/0*32+cn110/0*32
```
In this example, the nodes `cn17`, `cn108`, `cn109`, and `cn110` were allocated for 1 hour via the qexp queue. The `myjob` jobscript will be executed on the node `cn17`, while the nodes `cn108`, `cn109`, and `cn110` are available for use as well.
The jobscript or interactive shell is by default executed in the `/home` directory:
```console
$ qsub -q qexp -l select=4 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
```
In this example, 4 nodes were allocated interactively for 1 hour via the `qexp` queue. The interactive shell is executed in the `/home` directory.
!!! note
All nodes within the allocation may be accessed via SSH. Unallocated nodes are not accessible to the user.
The allocated nodes are accessible via SSH from login nodes. The nodes may access each other via SSH as well.
Calculations on allocated nodes may be executed remotely via the MPI, SSH, pdsh, or clush. You may find out which nodes belong to the allocation by reading the `$PBS_NODEFILE` file
```console
$ qsub -q qexp -l select=4 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
$ sort -u $PBS_NODEFILE
cn17.bullx
cn108.bullx
cn109.bullx
cn110.bullx
$ pdsh -w cn17,cn[108-110] hostname
cn17: cn17
cn108: cn108
cn109: cn109
cn110: cn110
```
In this example, the hostname program is executed via `pdsh` from the interactive shell. The execution runs on all four allocated nodes. The same result would be achieved if the `pdsh` were called from any of the allocated nodes or from the login nodes.
### Example Jobscript for MPI Calculation
!!! note
Production jobs must use the /scratch directory for I/O
The recommended way to run production jobs is to change to the `/scratch` directory early in the jobscript, copy all inputs to `/scratch`, execute the calculations, and copy outputs to the `/home` directory.
```bash
#!/bin/bash
cd $PBS_O_WORKDIR
SCRDIR=/scratch/project/open-00-00/${USER}/myjob
mkdir -p $SCRDIR
# change to scratch directory, exit on failure
cd $SCRDIR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/mympiprog.x .
# load the MPI module
# (Always specify the module's name and version in your script;
# for the reason, see https://docs.it4i.cz/software/modules/lmod/#loading-modules.)
ml OpenMPI/4.1.1-GCC-10.2.0-Java-1.8.0_221
# execute the calculation
mpirun -pernode ./mympiprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, a directory in `/home` holds the input file input and the `mympiprog.x` executable. We create the `myjob` directory on the `/scratch` filesystem, copy input and executable files from the `/home` directory where the `qsub` was invoked (`$PBS_O_WORKDIR`) to `/scratch`, execute the MPI program `mympiprog.x` and copy the output file back to the `/home` directory. `mympiprog.x` is executed as one process per node, on all allocated nodes.
!!! note
Consider preloading inputs and executables onto [shared scratch][6] memory before the calculation starts.
In some cases, it may be impractical to copy the inputs to the `/scratch` memory and the outputs to the `/home` directory. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such cases, it is the users' responsibility to preload the input files on the shared `/scratch` memory before the job submission, and retrieve the outputs manually after all calculations are finished.
!!! note
Store the `qsub` options within the jobscript. Use the `mpiprocs` and `ompthreads` qsub options to control the MPI job execution.
### Example Jobscript for MPI Calculation With Preloaded Inputs
Example jobscript for an MPI job with preloaded inputs and executables, options for `qsub` are stored within the script:
```bash
#!/bin/bash
#PBS -q qprod
#PBS -N MYJOB
#PBS -l select=100:mpiprocs=1:ompthreads=16
#PBS -A OPEN-00-00
# job is run using project resources; here ${PBS_ACCOUNT,,} translates to "open-00-00"
SCRDIR=/scratch/project/${PBS_ACCOUNT,,}/${USER}/myjob
# change to scratch directory, exit on failure
cd $SCRDIR || exit
# load the MPI module
# (Always specify the module's name and version in your script;
# for the reason, see https://docs.it4i.cz/software/modules/lmod/#loading-modules.)
ml OpenMPI/4.1.1-GCC-10.2.0-Java-1.8.0_221
# execute the calculation
mpirun ./mympiprog.x
#exit
exit
```
In this example, input and executable files are assumed to be preloaded manually in the `/scratch/project/open-00-00/$USER/myjob` directory. Because we used the `qprod` queue, we had to specify which project's resources we want to use, and our `PBS_ACCOUNT` variable will be set accordingly (OPEN-00-00). `${PBS_ACCOUNT,,}` uses one of the bash's built-in functions to translate it into lower case.
Note the `mpiprocs` and `ompthreads` qsub options controlling the behavior of the MPI execution. `mympiprog.x` is executed as one process per node, on all 100 allocated nodes. If `mympiprog.x` implements OpenMP threads, it will run 16 threads per node.
### Example Jobscript for Single Node Calculation
!!! note
The local scratch directory is often useful for single node jobs. Local scratch memory will be deleted immediately after the job ends.
Example jobscript for single node calculation, using [local scratch][6] memory on the node:
```bash
#!/bin/bash
# change to local scratch directory
cd /lscratch/$PBS_JOBID || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, a directory in `/home` holds the input file input and the executable `myprog.x`. We copy input and executable files from the `/home` directory where the `qsub` was invoked (`$PBS_O_WORKDIR`) to the local `/scratch` memory `/lscratch/$PBS_JOBID`, execute `myprog.x` and copy the output file back to the `/home directory`. `myprog.x` runs on one node only and may use threads.
### Other Jobscript Examples
Further jobscript examples may be found in the software section and the [Capacity computing][9] section.
[1]: #example-jobscript-for-mpi-calculation-with-preloaded-inputs
[2]: resources-allocation-policy.md
[3]: ../salomon/network.md
[5]: ../salomon/7d-enhanced-hypercube.md
[6]: ../salomon/storage.md
[9]: capacity-computing.md
[10]: resources-allocation-policy.md#resource-accounting-policy
# Resource Accounting Policy
Starting with the 24<sup>th</sup> open access grant competition,
the accounting policy has been changed from [normalized core hours (NCH)][2a] to **node-hours (NH)**.
This means that it is now required to apply for node hours of the specific cluster and node type:
1. [Barbora CPU][3a]
1. [Barbora GPU][4a]
1. [Barbora FAT][5a]
1. [DGX-2][6a]
1. [Karolina CPU][7a]
1. [Karolina GPU][8a]
1. [Karolina FAT][9a]
The accounting runs whenever the nodes are allocated via the Slurm workload manager (the `sbatch`, `salloc` command),
regardless of whether the nodes are actually used for any calculation.
The same rule applies for unspent [reservations][10a].
## Resource Accounting Formula
| Resources | NH Consumed |
| ------------------------------- | ---------------------------- |
| Barbora All types, Karolina CPU | allocated nodes \* time |
| Karolina GPU | allocated gpus \* time / 8 |
| Karolina FAT | allocated cpus \* time / 768 |
| Karolina VIZ | allocated cpus \* time / 64 |
time: duration of the Slurm job in hours
!!! important "CPU/GPU resources granularity"
Minimal granularity of all Barbora's partitions and Karolina's CPU partition is 1 node.
This means that if you request, for example, 32 cores on Karolina's CPU partition,
your job will still consume 1 NH \* time.
All other Karolina's partitions (GPU, FAT, VIZ) provide partial node allocation;
i.e.: if you request 4 GPUs on Karolina, you will consume only 0.5 NH \* time.
[1a]: job-submission-and-execution.md
[2a]: #normalized-core-hours-nch
[3a]: ../../barbora/compute-nodes/#compute-nodes-without-accelerators
[4a]: ../../barbora/compute-nodes/#compute-nodes-with-a-gpu-accelerator
[5a]: ../../barbora/compute-nodes/#fat-compute-node
[6a]: ../../dgx2/introduction/
[7a]: ../../karolina/compute-nodes/#compute-nodes-without-accelerators
[8a]: ../../karolina/compute-nodes/#compute-nodes-with-a-gpu-accelerator
[9a]: ../../karolina/compute-nodes/#data-analytics-compute-node
[10a]: resource_allocation_and_job_execution.md#resource-reservation
# How to Run Jobs
## Job Submission and Execution
To run a [job][1], computational resources for this particular job must be allocated. This is done via the [Slurm][a] job workload manager software, which distributes workloads across the supercomputer.
The `sbatch` or `salloc` command creates a request to the Slurm job manager for allocation of specified resources.
The resources will be allocated when available, subject to allocation policies and constraints.
**After the resources are allocated, the jobscript or interactive shell is executed on first of the allocated nodes.**
Read more on the [Job Submission and Execution][5] page.
## Resource Allocation Policy
Resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. [The Fair-share][3] ensures that individual users may consume approximately equal amount of resources per week. The resources are accessible via queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources.
!!! note
See the queue status for [Karolina][d] or [Barbora][e].
Read more on the [Resource Allocation Policy][4] page.
## Resource Reservation
You can request a reservation of a specific number, range, or type of computational resources at [support@it4i.cz][c].
Note that unspent reserved node-hours count towards the total computational resources used.
[1]: ../index.md#terminology-frequently-used-on-these-pages
[2]: https://slurm.schedmd.com/documentation.html
[3]: job-priority.md#fair-share-priority
[4]: resources-allocation-policy.md
[5]: job-submission-and-execution.md
[a]: https://slurm.schedmd.com/
[b]: https://slurm.schedmd.com/documentation.html
[c]: mailto:support@it4i.cz
[d]: https://extranet.it4i.cz/rsweb/karolina/queues
[e]: https://extranet.it4i.cz/rsweb/barbora/queues
# Resource Allocation Policy
## Job Queue Policies
Resources are allocated to jobs in a fair-share fashion,
subject to constraints set by the queue and the resources available to the project.
The fair-share system ensures that individual users may consume approximately equal amounts of resources per week.
Detailed information can be found in the [Job scheduling][1] section.
Resources are accessible via several queues for queueing the jobs.
Queues provide prioritized and exclusive access to the computational resources.
Computational resources are subject to [accounting policy][7].
!!! important
Queues are divided based on a resource type: `qcpu_` for non-accelerated nodes and `qgpu_` for accelerated nodes. <br><br>
EuroHPC queues are no longer available. If you are an EuroHPC user, use standard queues based on allocated/required type of resources.
### Queues
| <div style="width:86px">Queue</div>| Description |
| -------------------------------- | ----------- |
| `qcpu` | Production queue for non-accelerated nodes intended for standard production runs. Requires an active project with nonzero remaining resources. Full nodes are allocated. Identical to `qprod`. |
| `qgpu` | Dedicated queue for accessing the NVIDIA accelerated nodes. Requires an active project with nonzero remaining resources. It utilizes 8x NVIDIA A100 with 320GB HBM2 memory per node. The PI needs to explicitly ask support for authorization to enter the queue for all users associated with their project. **On Karolina, you can allocate 1/8 of the node - 1 GPU and 16 cores**. For more information, see [Karolina qgpu allocation][4]. |
| `qgpu_big` | Intended for big jobs (>16 nodes), queue priority is lower than production queue prority, **priority is temporarily increased every even weekend**. |
| `qcpu_biz`<br>`qgpu_biz` | Commercial queues, slightly higher priority. |
| `qcpu_exp`<br>`qgpu_exp` | Express queues for testing and running very small jobs. There are 2 nodes always reserved (w/o accelerators), max 8 nodes available per user. The nodes may be allocated on a per core basis. It is configured to run one job and accept five jobs in a queue per user. |
| `qcpu_free`<br>`qgpu_free` | Intended for utilization of free resources, after a project exhausted all its allocated resources. Note that the queue is **not free of charge**. [Normal accounting][2] applies. Consumed resources will be accounted to the Project. Access to the queue is removed if consumed resources exceed 150% of the allocation. Full nodes are allocated. |
| `qcpu_long` | Queues for long production runs. Require an active project with nonzero remaining resources. Only 200 nodes without acceleration may be accessed. Full nodes are allocated. |
| `qcpu_preempt`<br>`qgpu_preempt` | Free queues with the lowest priority (LP). The queues require a project with allocation of the respective resource type. There is no limit on resource overdraft. Jobs are killed if other jobs with a higher priority (HP) request the nodes and there are no other nodes available. LP jobs are automatically re-queued once HP jobs finish, so **make sure your jobs are re-runnable**. |
| `qdgx` | Queue for DGX-2, accessible from Barbora. |
| `qfat` | Queue for fat node, PI must request authorization to enter the queue for all users associated to their project. |
| `qviz` | Visualization queue Intended for pre-/post-processing using OpenGL accelerated graphics. Each user gets 8 cores of a CPU allocated (approx. 64 GB of RAM and 1/8 of the GPU capacity (default "chunk")). If more GPU power or RAM is required, it is recommended to allocate more chunks (with 8 cores each) up to one whole node per user. This is currently also the maximum allowed allocation per one user. One hour of work is allocated by default, the user may ask for 2 hours maximum. |
See the following subsections for the list of queues:
* [Karolina queues][5]
* [Barbora queues][6]
## Queue Notes
The job time limit defaults to **half the maximum time**, see the table above.
Longer time limits can be [set manually, see examples][3].
Jobs that exceed the reserved time limit get killed automatically.
The time limit can be changed for queuing jobs (state Q) using the `scontrol modify job` command,
however it cannot be changed for a running job.
## Queue Status
!!! tip
Check the status of jobs, queues and compute nodes [here][c].
![rsweb interface](../img/barbora_cluster_usage.png)
Display the queue status:
```console
$ sinfo -s
```
The Slurm allocation overview may also be obtained using the `rsslurm` command:
```console
$ rsslurm
Usage: rsslurm [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--get-server-details Print server
--get-queues Print queues
--get-queues-details Print queues details
--get-reservations Print reservations
--get-reservations-details
Print reservations details
...
..
.
```
---8<--- "mathjax.md"
[1]: job-priority.md
[2]: #resource-accounting-policy
[3]: job-submission-and-execution.md
[4]: karolina-slurm.md
[5]: ./karolina-partitions.md
[6]: ./barbora-partitions.md
[7]: ./resource-accounting.md
[a]: https://support.it4i.cz/rt/
[c]: https://extranet.it4i.cz/rsweb
# Access to IT4I Services
Once you have created an e-INFRA CZ or an IT4I account, you can access the following IT4I services:
## IT4Innovations Information System (SCS IS)
SCS IS is a system where users can apply for a project membership and primary investigators can apply for a project
or manage their projects (e.g. accept/deny users' requests to become project members).
You can also submit a feedback on support services, etc. SCS IS is available on [https://scs.it4i.cz/][1].
## Request Tracker (RT)
If you have a question or need help, you can contact our support on [https://support.it4i.cz/][2].
Please note that first response to a new ticket may take up to 24 hours.
## Cluster Usage Overview
For information about the current clusters usage, go to [https://extranet.it4i.cz/rsweb][3].
You can switch between the clusters by clicking on its name in the upper right corner.
You can filter your search by clicking on the respective keywords.
[1]: https://scs.it4i.cz/
[2]: https://support.it4i.cz/
[3]: https://extranet.it4i.cz/rsweb
# Accessing the Clusters
## Shell Access
All IT4Innovations clusters are accessed by the SSH protocol via login nodes at the address **cluster-name.it4i.cz**. The login nodes may be addressed specifically, by prepending the loginX node name to the address.
!!! note "Workgroups Access Limitation"
Projects from the **EUROHPC** workgroup can only access the **Karolina** cluster.
!!! important "Supported keys"
We accept only RSA or ED25519 keys for logging into our systems.
### Karolina Cluster
| Login address | Port | Protocol | Login node |
| ------------------------------- | ---- | -------- | ----------------------------------------- |
| karolina.it4i.cz | 22 | SSH | round-robin DNS record for login{1,2,3,4} |
| login{1,2,3,4}.karolina.it4i.cz | 22 | SSH | login{1,2,3,4} |
### Barbora Cluster
| Login address | Port | Protocol | Login node |
| ----------------------------- | ---- | -------- | ------------------------------------- |
| barbora.it4i.cz | 22 | SSH | round-robin DNS record for login{1,2} |
| login{1,2}.barbora.it4i.cz | 22 | SSH | login{1,2} |
## Authentication
Authentication is available by [private key][1] only. Verify SSH fingerprints during the first logon:
### Karolina
**Fingerprints**
Fingerprints are identical for all login nodes.
```console
# login{1,2,3,4}:22 SSH-2.0-OpenSSH_7.4
2048 MD5:41:3a:40:32:da:08:77:51:79:04:af:53:e4:57:d0:7c (RSA)
2048 SHA256:Ip37d/bE6XwtWf3KnWA+sqA+zRGSFlf5vXai0v3MBmo (RSA)
256 MD5:e9:b6:8e:7d:f8:c6:8f:42:34:10:71:02:14:a6:7c:22 (ED25519)
256 SHA256:zKEtQMi2KRsxzzgo/sHcog+NFZqQ9tIyvJ7BVxOfzgI (ED25519)
```
**Public Keys \ Known Hosts**
Public Keys \ Known Hosts are identical for all login nodes.
```console
login1,login1.karolina.it4i.cz,login1.karolina,karolina.it4i.cz ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Cp8/a3F7eOPQvH4+HjC778XvYgRXWmCEOQnE3clPcKw15iIat3bvKc8ckYLudAzomipWy4VYdDI2OnEXay5ba8HqdREJO31qNBtW1AXgydCfPnkeuUZS4WVlAWM+HDlK6caB8KlvHoarCnNj2jvuYsMbARgGEq3vrk3xW4uiGpS6Y/uGVBBwMFWFaINbmXUrU1ysv/ZD1VpH4eHykkD9+8xivhhZtcz5Z2T7ZnIib4/m9zZZvjKs4ejOo58cKXGYVl27kLkfyOzU3cirYNQOrGqllN/52fATfrXKMcQor9onsbTkNNjMgPFZkddufxTrUaS7EM6xYsj8xrPJ2RaN
login1,login1.karolina.it4i.cz,login1.karolina,karolina.it4i.cz ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDkIdDODkUYRgMy1h6g/UtH34RnDCQkwwiJZFB0eEu1c
login2,login2.karolina.it4i.cz,login2.karolina,karolina.it4i.cz ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Cp8/a3F7eOPQvH4+HjC778XvYgRXWmCEOQnE3clPcKw15iIat3bvKc8ckYLudAzomipWy4VYdDI2OnEXay5ba8HqdREJO31qNBtW1AXgydCfPnkeuUZS4WVlAWM+HDlK6caB8KlvHoarCnNj2jvuYsMbARgGEq3vrk3xW4uiGpS6Y/uGVBBwMFWFaINbmXUrU1ysv/ZD1VpH4eHykkD9+8xivhhZtcz5Z2T7ZnIib4/m9zZZvjKs4ejOo58cKXGYVl27kLkfyOzU3cirYNQOrGqllN/52fATfrXKMcQor9onsbTkNNjMgPFZkddufxTrUaS7EM6xYsj8xrPJ2RaN
login2,login2.karolina.it4i.cz,login2.karolina,karolina.it4i.cz ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDkIdDODkUYRgMy1h6g/UtH34RnDCQkwwiJZFB0eEu1c
login3,login3.karolina.it4i.cz,login3.karolina,karolina.it4i.cz ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Cp8/a3F7eOPQvH4+HjC778XvYgRXWmCEOQnE3clPcKw15iIat3bvKc8ckYLudAzomipWy4VYdDI2OnEXay5ba8HqdREJO31qNBtW1AXgydCfPnkeuUZS4WVlAWM+HDlK6caB8KlvHoarCnNj2jvuYsMbARgGEq3vrk3xW4uiGpS6Y/uGVBBwMFWFaINbmXUrU1ysv/ZD1VpH4eHykkD9+8xivhhZtcz5Z2T7ZnIib4/m9zZZvjKs4ejOo58cKXGYVl27kLkfyOzU3cirYNQOrGqllN/52fATfrXKMcQor9onsbTkNNjMgPFZkddufxTrUaS7EM6xYsj8xrPJ2RaN
login3,login3.karolina.it4i.cz,login3.karolina,karolina.it4i.cz ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDkIdDODkUYRgMy1h6g/UtH34RnDCQkwwiJZFB0eEu1c
login4,login4.karolina.it4i.cz,login4.karolina,karolina.it4i.cz ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9Cp8/a3F7eOPQvH4+HjC778XvYgRXWmCEOQnE3clPcKw15iIat3bvKc8ckYLudAzomipWy4VYdDI2OnEXay5ba8HqdREJO31qNBtW1AXgydCfPnkeuUZS4WVlAWM+HDlK6caB8KlvHoarCnNj2jvuYsMbARgGEq3vrk3xW4uiGpS6Y/uGVBBwMFWFaINbmXUrU1ysv/ZD1VpH4eHykkD9+8xivhhZtcz5Z2T7ZnIib4/m9zZZvjKs4ejOo58cKXGYVl27kLkfyOzU3cirYNQOrGqllN/52fATfrXKMcQor9onsbTkNNjMgPFZkddufxTrUaS7EM6xYsj8xrPJ2RaN
login4,login4.karolina.it4i.cz,login4.karolina,karolina.it4i.cz ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDkIdDODkUYRgMy1h6g/UtH34RnDCQkwwiJZFB0eEu1c
```
### Barbora
**Fingerprints**
```console
md5:
39:55:e2:b9:2a:a2:c4:9e:b1:8e:f0:f7:b1:66:a8:73 (RSA)
40:67:03:26:d3:6c:a0:7f:0a:df:0e:e7:a0:52:cc:4e (ED25519)
sha256:
TO5szOJf0bG7TWVLO3WABUpGKkP7nBm/RLyHmpoNpro (RSA)
ZQzFTJVDdZa3I0ics9ME2qz4v5a3QzXugvyVioaH6tI (ED25519)
```
**Public Keys \ Known Hosts**
```console
barbora.it4i.cz, ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABgQDHUHvIrv7VUcGIcfsrcBjYfHpFBst8uhtJqfiYckfbeMRIdaodfjTO0pIXvd5wx+61a0C14zy1pdhvx6ykT5lwYkkn8l2tf+LRd6qN0alq/s+NGDJKpWGvdAGD3mM9AO1RmUPt+Vfg4VePQUZMu2PXZQu2C4TFFbaH2yiyCFlKz/Md9q+7NM+9U86uf3uLFbBu8mzkk2z3jyDGR6pjmpYTAiV/goUGpHgsW8Qx4GUdCreObQ6GUfPVOPvYaTlfXfteD9HluB7gwCWaUi5hevHhc+kK4xj61v64mGBOPmCobnAlr2RYQv6cDn7PHgI2mE7ZwRsZkNyMXqGr1S2JK2M64K53ZfF70aGrW/muHlFrYVFaJg6s1f7K/Xqu21wjwwvnJ8CcP7lUjASqhfSn9OBzEI38KMMo5Qon9p108wvqSKP2QnEdrdv1QOsBPtOZMNRMfEVpw6xVvyPka0X6gxzGfEc9nn3nOok35Fbvoo3G0P8RmOeDJLqDjUOggOs0Gwk=
barbora.it4i.cz, ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOmUm4btn7OC0QLIT3xekKTTdg5ziby8WdxccEczEeE1
```
!!! note
Barbora has identical SSH fingerprints on all login nodes.
### Private Key Authentication:
On **Linux** or **Mac**, use:
```console
local $ ssh -i /path/to/id_rsa username@cluster-name.it4i.cz
```
If you see a warning message **UNPROTECTED PRIVATE KEY FILE!**, use this command to set lower permissions to the private key file:
```console
local $ chmod 600 /path/to/id_rsa
```
On **Windows**, use the [PuTTY SSH client][2].
After logging in, you will see the command prompt with the name of the cluster and the message of the day.
!!! note
The environment is **not** shared between login nodes, except for shared filesystems.
## Data Transfer
### Serial Transfer
Data in and out of the system may be transferred by SCP and SFTP protocols.
| Cluster | Port | Protocol |
| -------- | ---- | --------- |
| Karolina | 22 | SCP, SFTP |
| Barbora | 22 | SCP |
Authentication is by [private key][1] only.
On Linux or Mac, use an SCP or SFTP client to transfer data to the cluster:
```console
local $ scp -i /path/to/id_rsa my-local-file username@cluster-name.it4i.cz:directory/file
```
```console
local $ scp -i /path/to/id_rsa -r my-local-dir username@cluster-name.it4i.cz:directory
```
or
```console
local $ sftp -o IdentityFile=/path/to/id_rsa username@cluster-name.it4i.cz
```
You may request the **aes256-gcm@openssh.com cipher** for more efficient ssh based transfer:
```console
local $ scp -c aes256-gcm@openssh.com -i /path/to/id_rsa -r my-local-dir username@cluster-name.it4i.cz:directory
```
The -c argument may be used with ssh, scp and sftp, and is also applicable to sshfs and rsync below.
A very convenient way to transfer files in and out of the cluster is via the fuse filesystem [SSHFS][b].
```console
local $ sshfs -o IdentityFile=/path/to/id_rsa username@cluster-name.it4i.cz:. mountpoint
```
Using SSHFS, the user's home directory will be mounted on your local computer, just like an external disk.
Learn more about SSH, SCP, and SSHFS by reading the manpages:
```console
local $ man ssh
local $ man scp
local $ man sshfs
```
The rsync client uses ssh to establish connection.
```console
local $ rsync my-local-file
```
```console
local $ rsync -r my-local-dir username@cluster-name.it4i.cz:directory
```
### Parallel Transfer
!!! note
The data transfer speed is limited by the single TCP stream and single-core ssh encryption speed to about **250 MB/s** (750 MB/s in case of aes256-gcm@openssh.com cipher)
Run **multiple** streams for unlimited transfers
#### Many Files
Parallel execution of multiple rsync processes utilizes multiple cores to accelerate encryption and multiple tcp streams for enhanced bandwidth.
First, set up ssh-agent single sign on:
```console
local $ eval `ssh-agent`
local $ ssh-add
Enter passphrase for /home/user/.ssh/id_rsa:
```
Then run multiple rsync instances in parallel, f.x.:
```console
local $ cd my-local-dir
local $ ls | xargs -n 2 -P 4 /bin/bash -c 'rsync "$@" username@cluster-name.it4i.cz:mydir' sh
```
The **-n** argument detemines the number of files to transfer in one rsync call. Set according to file size and count (large for many small files).
The **-P** argument determines number of parallel rsync processes. Set to number of cores on your local machine.
Alternatively, use [HyperQueue][11]. First get [HyperQueue binary][e], then run:
```console
local $ hq server start &
local $ hq worker start &
local $ find my-local-dir -type f | xargs -n 2 > jobfile
local $ hq submit --log=/dev/null --progress --each-line jobfile \
bash -c 'rsync -R $HQ_ENTRY username@cluster-name.it4i.cz:mydir'
```
Again, the **-n** argument detemines the number of files to transfer in one rsync call. Set according to file size and count (large for many small files).
#### Single Very Large File
To transfer single very large file efficienty, we need to transfer many blocks of the file in parallel, utilizing multiple cores to accelerate ssh encryption and multiple tcp streams for enhanced bandwidth.
First, set up ssh-agent single sign on as [described above][10].
Second, start the [HyperQueue server and HyperQueue worker][f]:
```console
local $ hq server start &
local $ hq worker start &
```
Once set up, run the hqtransfer script listed below:
```console
local $ ./hqtransfer mybigfile username@cluster-name.it4i.cz outputpath/outputfile
```
The hqtransfer script:
```console
#!/bin/bash
#Read input
if [ -z $1 ]; then echo Usage: $0 'input_file ssh_destination [output_path/output_file]'; exit; fi
INFILE=$1
if [ -z $2 ]; then echo Usage: $0 'input_file ssh_destination [output_path/output_file]'; exit; fi
DEST=$2
OUTFILE=$INFILE
if [ ! -z $3 ]; then OUTFILE=$3; fi
#Calculate transfer blocks
SIZE=$(($(stat --printf %s $INFILE)/1024/1024/1024))
echo Transfering $(($SIZE+1)) x 1GB blocks
#Execute
hq submit --log=/dev/null --progress --array 0-$SIZE /bin/bash -c \
"dd if=$INFILE bs=1G count=1 skip=\$HQ_TASK_ID | \
ssh -c aes256-gcm@openssh.com $DEST \
dd of=$OUTFILE bs=1G conv=notrunc seek=\$HQ_TASK_ID"
exit
```
Copy-paste the script into `hqtransfer` file and set executable flags:
```console
local $ chmod u+x hqtransfer
```
The `hqtransfer` script is ready for use.
### Data Transfer From Windows Clients
On Windows, use the [WinSCP client][c] to transfer data. The [win-sshfs client][d] provides a way to mount the cluster filesystems directly as an external disc.
## Connection Restrictions
Outgoing connections from cluster login nodes to the outside world are restricted to the following ports:
| Port | Protocol |
| ---- | -------- |
| 22 | SSH |
| 80 | HTTP |
| 443 | HTTPS |
| 873 | Rsync |
!!! note
Use **SSH port forwarding** and proxy servers to connect from cluster to all other remote ports.
Outgoing connections from cluster compute nodes are restricted to the internal network. Direct connections from compute nodes to the outside world are cut.
| Service | IP/Port |
| ---------------- | ------------------ |
| TCP/22, TCP | port 1024-65535 |
| e-INFRA CZ Cloud | 195.113.243.0/24 |
| IT4I Cloud | 195.113.175.128/26 |
## Port Forwarding
### Port Forwarding From Login Nodes
!!! note
Port forwarding allows an application running on cluster to connect to arbitrary remote hosts and ports.
It works by tunneling the connection from cluster back to the user's workstations and forwarding from the workstation to the remote host.
Select an unused port on the cluster login node (for example 6000) and establish the port forwarding:
```console
$ ssh -R 6000:remote.host.com:1234 cluster-name.it4i.cz
```
In this example, we establish port forwarding between port 6000 on the cluster and port 1234 on the `remote.host.com`. By accessing `localhost:6000` on the cluster, an application will see the response of `remote.host.com:1234`. The traffic will run via the user's local workstation.
Port forwarding may be done **using PuTTY** as well. On the PuTTY Configuration screen, load your cluster configuration first. Then go to *Connection > SSH > Tunnels* to set up the port forwarding. Click the _Remote_ radio button. Insert 6000 to the _Source port_ textbox. Insert `remote.host.com:1234`. Click _Add_, then _Open_.
Port forwarding may be established directly to the remote host. However, this requires that the user has an SSH access to `remote.host.com`.
```console
$ ssh -L 6000:localhost:1234 remote.host.com
```
!!! note
Port number 6000 is chosen as an example only. Pick any free port.
### Port Forwarding From Compute Nodes
Remote port forwarding from compute nodes allows applications running on the compute nodes to access hosts outside the cluster.
First, establish the remote port forwarding from the login node, as [described above][5].
Second, invoke port forwarding from the compute node to the login node. Insert the following line into your jobscript or interactive shell:
```console
$ ssh -TN -f -L 6000:localhost:6000 login1
```
In this example, we assume that port forwarding from `login1:6000` to `remote.host.com:1234` has been established beforehand. By accessing `localhost:6000`, an application running on a compute node will see the response of `remote.host.com:1234`.
### Using Proxy Servers
Port forwarding is static; each single port is mapped to a particular port on a remote host. Connection to another remote host requires a new forward.
!!! note
Applications with inbuilt proxy support experience unlimited access to remote hosts via a single proxy server.
To establish a local proxy server on your workstation, install and run the SOCKS proxy server software. On Linux, SSHD demon provides the functionality. To establish the SOCKS proxy server listening on port 1080 run:
```console
local $ ssh -D 1080 localhost
```
On Windows, install and run the free, open source Sock Puppet server.
Once the proxy server is running, establish the SSH port forwarding from cluster to the proxy server, port 1080, exactly as [described above][5]:
```console
local $ ssh -R 6000:localhost:1080 cluster-name.it4i.cz
```
Now, configure the applications proxy settings to `localhost:6000`. Use port forwarding to access the [proxy server from compute nodes][9], as well.
[1]: ../general/accessing-the-clusters/shell-access-and-data-transfer/ssh-key-management.md
[2]: ../general/accessing-the-clusters/shell-access-and-data-transfer/putty.md
[5]: #port-forwarding-from-login-nodes
[6]: ../general/accessing-the-clusters/graphical-user-interface/x-window-system.md
[7]: ../general/accessing-the-clusters/graphical-user-interface/vnc.md
[8]: ../general/accessing-the-clusters/vpn-access.md
[9]: #port-forwarding-from-compute-nodes
[10]: #many-files
[11]: ../general/hyperqueue.md
[b]: http://linux.die.net/man/1/sshfs
[c]: http://winscp.net/eng/download.php
[d]: http://code.google.com/p/win-sshfs/
[e]: https://github.com/It4innovations/hyperqueue/releases/latest
[f]: https://it4innovations.github.io/hyperqueue/stable/cheatsheet/
---
hide:
- toc
---
# Slurm Batch Jobs Examples
Below is an excerpt from the [2024 e-INFRA CZ conference][1]
describing best practices for Slurm batch calculations and data managing, including examples, by Ondrej Meca.
![PDF presentation on Slurm Batch Jobs Examples](../src/srun_karolina.pdf){ type=application/pdf style="min-height:100vh;width:100%" }
[1]: https://www.e-infra.cz/en/e-infra-cz-conference
\ No newline at end of file
# Job Submission and Execution
!!! warning
Don't use the `#SBATCH --exclusive` parameter as it is already included in the SLURM configuration.<br><br>
Use the `#SBATCH --mem=` parameter **on `qfat` only**. On `cpu_` queues, whole nodes are allocated.
Accelerated nodes (`gpu_` queues) are divided each into eight parts with corresponding memory.
## Introduction
[Slurm][1] workload manager is used to allocate and access Karolina's, Barbora's and Complementary systems' resources.
A `man` page exists for all Slurm commands, as well as the `--help` command option,
which provides a brief summary of options.
Slurm [documentation][c] and [man pages][d] are also available online.
## Getting Partition Information
Display partitions/queues on system:
```console
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
qcpu* up 2-00:00:00 1/191/0/192 cn[1-192]
qcpu_biz up 2-00:00:00 1/191/0/192 cn[1-192]
qcpu_exp up 1:00:00 1/191/0/192 cn[1-192]
qcpu_free up 18:00:00 1/191/0/192 cn[1-192]
qcpu_long up 6-00:00:00 1/191/0/192 cn[1-192]
qcpu_preempt up 12:00:00 1/191/0/192 cn[1-192]
qgpu up 2-00:00:00 0/8/0/8 cn[193-200]
qgpu_biz up 2-00:00:00 0/8/0/8 cn[193-200]
qgpu_exp up 1:00:00 0/8/0/8 cn[193-200]
qgpu_free up 18:00:00 0/8/0/8 cn[193-200]
qgpu_preempt up 12:00:00 0/8/0/8 cn[193-200]
qfat up 2-00:00:00 0/1/0/1 cn201
qdgx up 2-00:00:00 0/1/0/1 cn202
qviz up 8:00:00 0/2/0/2 vizserv[1-2]
```
`NODES(A/I/O/T)` column summarizes node count per state, where the `A/I/O/T` stands for `allocated/idle/other/total`.
Example output is from Barbora cluster.
Graphical representation of clusters' usage, partitions, nodes, and jobs could be found
* for Karolina at [https://extranet.it4i.cz/rsweb/karolina][5]
* for Barbora at [https://extranet.it4i.cz/rsweb/barbora][4]
* for Complementary Systems at [https://extranet.it4i.cz/rsweb/compsys][6]
On Karolina cluster
* all cpu queues/partitions provide full node allocation, whole nodes are allocated to job
* other queues/partitions (gpu, fat, viz) provide partial node allocation
See [Karolina Slurm Specifics][7] for details.
On Barbora cluster, all queues/partitions provide full node allocation, whole nodes are allocated to job.
On Complementary systems, only some queues/partitions provide full node allocation,
see [Complementary systems documentation][2] for details.
## Running Interactive Jobs
Sometimes you may want to run your job interactively, for example for debugging,
running your commands one by one from the command line.
Run interactive job - queue `qcpu_exp`, one node by default, one task by default:
```console
$ salloc -A PROJECT-ID -p qcpu_exp
```
Run interactive job on four nodes, 128 tasks per node (Karolina cluster, CPU partition recommended value based on node core count),
two hours time limit:
```console
$ salloc -A PROJECT-ID -p qcpu -N 4 --ntasks-per-node 128 -t 2:00:00
```
Run interactive job, with X11 forwarding:
```console
$ salloc -A PROJECT-ID -p qcpu_exp --x11
```
To finish the interactive job, use the Ctrl+D (`^D`) control sequence.
!!! warning
Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
## Running Batch Jobs
Batch jobs is the standard way of running jobs and utilizing HPC clusters.
### Job Script
Create example job script called script.sh with the following content:
```shell
#!/usr/bin/bash
#SBATCH --job-name MyJobName
#SBATCH --account PROJECT-ID
#SBATCH --partition qcpu
#SBATCH --nodes 4
#SBATCH --ntasks-per-node 128
#SBATCH --time 12:00:00
ml purge
ml OpenMPI/4.1.4-GCC-11.3.0
srun hostname | sort | uniq -c
```
Script will:
* use bash shell interpreter
* use `MyJobName` as job name
* use project `PROJECT-ID` for job access and accounting
* use partition/queue `qcpu`
* use `4` nodes
* use `128` tasks per node - value used by MPI
* set job time limit to `12` hours
* load appropriate module
* run command, `srun` serves as Slurm's native way of executing MPI-enabled applications, `hostname` is used in the example just for sake of simplicity
!!! tip "Excluding Specific Nodes"
Use `#SBATCH --exclude=<node_name_list>` directive to exclude specific nodes from your job, e.g.: `#SBATCH --exclude=cn001,cn002,cn003`.
Submit directory will be used as working directory for submitted job,
so there is no need to change directory in the job script.
Alternatively you can specify job working directory using the sbatch `--chdir` (or shortly `-D`) option.
### Srun Over mpirun
While `mpirun` can be used to run parallel jobs on our Slurm-managed clusters, we recommend using `srun` for better integration with Slurm's scheduling and resource management. `srun` ensures more efficient job execution and resource control by leveraging Slurm’s features directly, and it simplifies the process by reducing the need for additional configurations often required with `mpirun`.
### Job Submit
Submit batch job:
```console
$ cd my_work_dir
$ sbatch script.sh
```
A path to `script.sh` (relative or absolute) should be given
if the job script is in a different location than the job working directory.
By default, job output is stored in a file called `slurm-JOBID.out` and contains both job standard output and error output.
This can be changed using the sbatch options `--output` (shortly `-o`) and `--error` (shortly `-e`).
Example output of the job:
```shell
128 cn017.karolina.it4i.cz
128 cn018.karolina.it4i.cz
128 cn019.karolina.it4i.cz
128 cn020.karolina.it4i.cz
```
### Job Environment Variables
Slurm provides useful information to the job via environment variables.
Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (`srun`, compatible `mpirun`).
See all Slurm variables
```
$ set | grep ^SLURM
```
Commonly used variables are:
| variable name | description | example |
| ------ | ------ | ------ |
| SLURM_JOB_ID | job id of the executing job| 593 |
| SLURM_JOB_NODELIST | nodes allocated to the job | cn[101-102] |
| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
| SLURM_STEP_NODELIST | nodes allocated to the job step | cn101 |
| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
| SLURM_JOB_PARTITION | name of the partition | qcpu |
| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
See relevant [Slurm documentation][3] for details.
Get job nodelist:
```
$ echo $SLURM_JOB_NODELIST
cn[101-102]
```
Expand nodelist to list of nodes:
```
$ scontrol show hostnames
cn101
cn102
```
## Job Management
### Getting Job Information
Show all jobs on system:
```console
$ squeue
```
Show my jobs:
```console
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 qcpu interact user R 1:48 2 cn[101-102]
```
Show job details for a specific job:
```console
$ scontrol show job JOBID
```
Show job details for executing job from job session:
```console
$ scontrol show job $SLURM_JOBID
```
Show my jobs using a long output format which includes time limit:
```console
$ squeue --me -l
```
Show my jobs in running state:
```console
$ squeue --me -t running
```
Show my jobs in pending state:
```console
$ squeue --me -t pending
```
Show jobs for a given project:
```console
$ squeue -A PROJECT-ID
```
### Job States
The most common job states are (in alphabetical order):
| Code | Job State | Explanation |
| :--: | :------------ | :------------------------------------------------------------------------------------------------------------- |
| CA | CANCELLED | Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated. |
| CD | COMPLETED | Job has terminated all processes on all nodes with an exit code of zero. |
| CG | COMPLETING | Job is in the process of completing. Some processes on some nodes may still be active. |
| F | FAILED | Job terminated with non-zero exit code or other failure condition. |
| NF | NODE_FAIL | Job terminated due to failure of one or more allocated nodes. |
| OOM | OUT_OF_MEMORY | Job experienced out of memory error. |
| PD | PENDING | Job is awaiting resource allocation. |
| PR | PREEMPTED | Job terminated due to preemption. |
| R | RUNNING | Job currently has an allocation. |
| RQ | REQUEUED | Completing job is being requeued. |
| SI | SIGNALING | Job is being signaled. |
| TO | TIMEOUT | Job terminated upon reaching its time limit. |
### Modifying Jobs
In general:
```
$ scontrol update JobId=JOBID ATTR=VALUE
```
Modify job's time limit:
```
$ scontrol update JobId=JOBID timelimit=4:00:00
```
Set/modify job's comment:
```
$ scontrol update JobId=JOBID Comment='The best job ever'
```
### Deleting Jobs
Delete a job by job ID:
```
$ scancel JOBID
```
Delete all my jobs:
```
$ scancel --me
```
Delete all my jobs in interactive mode, confirming every action:
```
$ scancel --me -i
```
Delete all my running jobs:
```
$ scancel --me -t running
```
Delete all my pending jobs:
```
$ scancel --me -t pending
```
Delete all my pending jobs for a project PROJECT-ID:
```
$ scancel --me -t pending -A PROJECT-ID
```
## Troubleshooting
### Invalid Account
`sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified`
Possible causes:
* Invalid account (i.e. project) was specified in job submission.
* User does not have access to given account/project.
* Given account/project does not have access to given partition.
* Access to given partition was retracted due to the project's allocation exhaustion.
[1]: https://slurm.schedmd.com/
[2]: /cs/job-scheduling/#partitions
[3]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES
[4]: https://extranet.it4i.cz/rsweb/barbora
[5]: https://extranet.it4i.cz/rsweb/karolina
[6]: https://extranet.it4i.cz/rsweb/compsys
[7]: /general/karolina-slurm
[a]: https://slurm.schedmd.com/
[b]: http://slurmlearning.deic.dk/
[c]: https://slurm.schedmd.com/documentation.html
[d]: https://slurm.schedmd.com/man_index.html
[e]: https://slurm.schedmd.com/sinfo.html
[f]: https://slurm.schedmd.com/squeue.html
[g]: https://slurm.schedmd.com/scancel.html
[h]: https://slurm.schedmd.com/scontrol.html
[i]: https://slurm.schedmd.com/job_array.html
# Getting Help and Support
Contact [support\[at\]it4i.cz][a] or use the [support][b] portal for help and support regarding the cluster technology at IT4Innovations.
For communication, use the **Czech**, **Slovak**, or **English** language.
Follow the status of your request to IT4Innovations [here][b].
The IT4Innovations support team will use best efforts to resolve requests within thirty days.
[a]: mailto:support@it4i.cz
[b]: http://support.it4i.cz/rt
# CI/CD
## Introduction
Continuous Integration (CI) is the practice of automatically executing a compilation script and set of test cases to ensure that the integrated codebase is in a workable state. The integration is often followed by Continuous Benchmarking (CB) to evaluate the impact of the code change on the application performance and Continuous Deployment (CD) to distribute a new version of the developed code.
IT4I offers its users a possibility to set up CI for their projects and to execute their dedicated CI jobs directly in computational nodes of the production HPC clusters (Karolina, Barbora) and Complementary systems. The Complementary systems gives a possibility to run the tests on emerging, non-traditional, and highly specialized hardware architectures. It consists of computational nodes built on Intel Sapphire Rapids + HBM, NVIDIA Grace CPU, IBM Power10, A64FX, and many more.
Besides that, there is also a possibility to execute CI jobs in a customizable virtual environment (Docker containers). This allows to test the code in a clean build environment. It also makes dependency management more straight-forward since all dependencies for building the project can be put in the Docker image, from which the corresponding containers are created.
## CI Infrastructure Deployed at IT4I
IT4Innovations maintains a GitLab server (code.it4i.cz), which has built-in support for CI/CD. It provides a set of GitLab runners, which is an application that executes jobs specified in the project CI pipelines, consisting of jobs and stages. Grouping jobs together in collections is called stages. Stages run in sequence, while all jobs in a stage can run in parallel.
Detailed documentation about GitLab CI/CD is available [here][1].
### Karolina, Barbora, and Complementary Systems
For all the users, a unified solution is provided to let them execute their CI jobs at Karolina, Barbora, and Complementary systems without the need to create their own project runners. For each of the HPC clusters, a GitLab instance runner has been deployed. The runners are running in the login nodes and are visible to all the projects of the IT4I GitLab server. These runners are shared by all users.
These runners are using **Jacamar CI driver** – an HPC-focused open-source CI/CD driver for GitLab runners. It allows a GitLab runner to interact directly with a job scheduler of a given cluster. One of the main benefits this driver provides is a downscoping mechanism. It ensures that every command within each CI job is executed as the user who triggers the CI pipeline to which the job belongs.
For more information about the Jacamar CI driver, please visit [the official documentation][2].
The execution of CI pipelines works as follows. First, a user in the IT4I GitLab server triggers a CI pipeline (for example, by making push to a repository, etc.). Then, the jobs, which the pipeline consists of, are sent to the corresponding runner, running in the login node. Lastly, for every CI job, the runner clones the repository (or just fetches changes to an already cloned one, if there are any), restores [cache][3], downloads [artifacts][4] (if specified), and submits the job as a Slurm job to the corresponding HPC cluster using the `sbatch` command. After each execution of a job, the runner reports the results back to the server, creates cache, and uploads artifacts (if specified).
<img src="../../../img/it4i-ci.svg" title="IT4I CI" width="750">
!!! note
The GitLab runners at Karolina and Barbora are able to submit (as a Slurm job) and execute 32 CI jobs concurrently, while the runner at Complementary systems can submit 16 jobs concurrently at most. Jobs above this limit are postponed in submission to respective slurm queue until a previous job has finished.
### Virtual Environment (Docker Containers)
There are also 5 GitLab instance runners with Docker executor configured, which have been deployed in the local virtual infrastructure (each runs in a dedicated virtual machine). The runners use Docker Engine to execute each job in a separate and isolated container created from the image specified beforehand. These runners are also visible to all the projects of the IT4I GitLab server.
Detailed information about the Docker executor and its workflow (the execution of CI pipelines) can be found [here][5].
In addition, these runners have distributed caching enabled. This feature uses pre-configured object storage server and allows to share the [cache][3] between subsequent CI jobs (of the same project) executed on multiple runners (2 or more of the 5 deployed). Refer to [Caching in GitLab CI/CD][6] for information about cache and how cache is different from artifacts.
## How to Set Up Continuous Integration for Your Project
To begin with, a CI pipeline of a project must be defined in a YAML file. The most common name of this file is `.gitlab-ci.yml` and it should be located in the repository top level. For detailed information, see [tutorial][7] on how to create your first pipeline. Additionally, [CI/CD YAML syntax reference][8] lists all possible keywords, that can be specified in the definition of CI/CD pipelines and jobs.
!!! note
The default maximum time that a CI job can run for before it times out is 1 hour. This can be changed in [project's CI/CD settings][9]. When jobs exceed the specified timeout, they are marked as failed. Pending jobs are dropped after 24 hours of inactivity.
### Execution of CI Pipelines at the HPC Clusters
Every CI job in the project CI pipeline, intended to be submitted as a Slurm job to one of the HPC clusters, must have the 3 following keywords specified in its definition.
* `id_tokens`, in which `SITE_ID_TOKEN` must be defined with `aud` set to the URL of IT4I GitLab server.
```yaml
id_tokens:
SITE_ID_TOKEN:
aud: https://code.it4i.cz/
```
* `tags`, by which the appropriate runner for the CI job is selected. There are exactly 3 tags that must be specified in the `tags` clause of the CI job. Two of these are `it4i` and `slurmjob`. The third one represents name of the target cluster. It can be `karolina`, `barbora`, or `compsys`.
```yaml
tags:
- it4i
- karolina/barbora/compsys
- slurmjob
```
* `variables`, where the `SCHEDULER_PARAMETERS` variable must be specified. This variable should contain all the arguments that the developer wants to pass to the `sbatch` command during the submission of the CI job - project, queue, partition, etc. There are also arguments, which are specified by the Jacamar CI driver automatically. Those are `--wait`, `--job-name`, and `--output`.
```yaml
variables:
SCHEDULER_PARAMETERS: "-A ... –p ... -N ..."
```
Optionally, a custom build directory can also be specified. The deployed GitLab runners are configured to store all files and directories for the CI job in the home directory of the user, who triggers the associated CI pipeline (the repository is also cloned there in a unique subpath). This behavior can be changed by specifying the `CUSTOM_CI_BUILDS_DIR` variable in the `variables` clause of the CI job.
```yaml
variables:
SCHEDULER_PARAMETERS: ...
CUSTOM_CI_BUILDS_DIR: /path/to/custom/build/dir/
```
A GitLab repository with examples of CI jobs can be found [here][10].
### Execution of CI Pipelines in Docker Containers
Every CI job in the project CI pipeline, intended to be executed by one of the 5 runners with Docker executor configured, must have the 2 following keywords specified in its definition.
* `image`, where the name of the Docker image must be specified. Image requirements are listed [here][11]. See also [the description][12] in CI/CD YAML syntax reference for information about all possible name formats. The runners are configured to pull the images from [Docker Hub][13].
```yaml
image: <image-name-in-one-of-the-accepted-formats>
# or
image:
name: <image-name-in-one-of-the-accepted-formats>
```
* `tags`, by which one of the 5 runners is selected (the selection is done automatically). There are exactly 2 tags that must be specified in the `tags` clause of the CI job. Those are `centos7` and `docker`.
```yaml
tags:
- centos7
- docker
```
[1]: https://docs.gitlab.com/ee/topics/build_your_application.html
[2]: https://ecp-ci.gitlab.io/docs/admin/jacamar/introduction.html
[3]: https://docs.gitlab.com/ee/ci/yaml/#cache
[4]: https://docs.gitlab.com/ee/ci/yaml/#artifacts
[5]: https://docs.gitlab.com/runner/executors/docker.html
[6]: https://docs.gitlab.com/ee/ci/caching/index.html
[7]: https://docs.gitlab.com/ee/ci/quick_start/
[8]: https://docs.gitlab.com/ee/ci/yaml/index.html
[9]: https://docs.gitlab.com/ee/ci/pipelines/settings.html#set-a-limit-for-how-long-jobs-can-run
[10]: https://code.it4i.cz/nie0056/it4i-cicd-example
[11]: https://docs.gitlab.com/ee/ci/docker/using_docker_images.html#image-requirements
[12]: https://docs.gitlab.com/ee/ci/yaml/index.html#image
[13]: https://hub.docker.com/
# Code.it4i.cz
[code.it4i.cz][1] is a GitLab server maintained by IT4Innovations.
It is available for all IT4I users and can be accessed via LDAP login credentials.
It offers various tools including:
## Project Management
Collaborate with teammates using built-in code review tools.
Control, track, and manage different versions of your project.
Manage access and permission rights for different roles within your project.
Report, track, and manage bugs and feature requests within your projects.
Maintain project-specific documentation directly within GitLab.
For more detailed information see the [GitLab documentation][2].
## Continuous Integration / Continuous Deployment
Automatically execute compilation scripts and test cases
and distribute new versions of your code.
See the [CI/CD section][a] for more information.
## IT4I Documentation
Participate on improving and expanding our documentation to help other users.
[1]: https://code.it4i.cz/
[2]: https://docs.gitlab.com/
[a]: cicd.md
# Opencode
[Opencode][b] is a GitLab-based page for projects on which IT4Innovations participates.
## Who Can Access Opencode
IT4Innovations account is not required, it is possible to access through [B2ACCESS][a].
## Sign in Through B2ACCESS
First, we need to verify your identity:
1. Sign in with your organization B2ACCESS; the page requests a valid personal certificate (e.g. GEANT).
Accounts with "Low" level of assurance are not granted access to IT4I zone.
2. Confirm your certificate in the browser:
![](../../img/B2ACCESS_chrome_eng.jpg)
3. Confirm your certificate in the OS (Windows):
![](../../img/crypto_v2.jpg)
4. Sign to EUDAT/B2ACCESS:
![](../../img/b2access-univerzity.jpg)
- If you aren't affiliated with any university, try log in with Social account. If you aren't able to sign in through an IDP, It might be a result of an internal policy of your employer.
- If you don't have, or refuse to use, any of the above-mentioned accounts, you can create an account in B2ACCESS.
![](../../img/b2access-no_account.jpg)
![](../../img/b2access-select.jpg)
![](../../img/b2access-fill.jpg)
[a]: https://b2access.eudat.eu/
[b]: https://opencode.it4i.eu
# it4i-portal-clients
it4i-portal-clients provides simple user-friendly shell interface
to call [IT4I API](https://docs.it4i.cz/apiv1/) requests and display their respond.
!!! important
Python 2.7 is required.
Limits are placed on the number of requests you may make to [IT4I API](https://docs.it4i.cz/apiv1/).
Rate limit can be changed without any warning at any time, but the default is **6 requests per minute**.
Exceeding the limit will lead to your IP address being temporarily blocked from making further requests.
The block will automatically be lifted by waiting an hour.
## List of Available Utilities
* [it4icheckaccess](#it4icheckaccess) - Shows if IT4I account and/or related project has the access to specified cluster and queue.
* [it4idedicatedtime](#it4idedicatedtime) - Shows IT4I dedicated time.
* [it4ifree](#it4ifree) - Shows some basic information from IT4I Slurm accounting.
* [it4ifsusage](#it4ifsusage) - Shows filesystem usage of IT4I cluster storage systems.
* [it4iuserfsusage](#it4iuserfsusage) - Shows user filesystem usage of IT4I cluster storage systems.
* [it4projectifsusage](#it4iprojectfsusage) - Shows project filesystem usage of IT4I cluster storage systems.
* [it4imotd](#it4imotd) - Shows IT4I messages of the day into formatted text or HTML page (using TAL / Zope Page Template).
## Installation/Upgrading
```bash
pip install --upgrade it4i.portal.clients
```
## Sample Configuration File main.cfg
```bash
[main]
# IT4I API
api_url = https://scs.it4i.cz/api/v1/
it4ifreetoken = <your_token>
```
Username is taken from OS, therefore the script has to be run under the same user login name as you use to log into clusters.
* System-wide config file path: ```/usr/local/etc/it4i-portal-clients/main.cfg```
* Local user's config file path: ```~/.it4ifree```
## it4icheckaccess
### Help of IT4ICHECKACCESS
```console
$ it4icheckaccess -h
usage: it4icheckaccess [-h] -l LOGIN -c CLUSTER -q QUEUE [-p PROJECT]
The command shows if an IT4I account and/or related project has the access to
specified cluster and queue. Return exit code 99 if access is not granted.
optional arguments:
-h, --help show this help message and exit
-l LOGIN, --login LOGIN
user login
-c CLUSTER, --cluster CLUSTER
cluster name
-q QUEUE, --queue QUEUE
queue
-p PROJECT, --project PROJECT
project id
```
### Example of IT4ICHECKACCESS
```console
$ it4icheckaccess -l xxx0123 -c barbora -q qcpu_exp -p DD-12-345
OK Access granted for regular queue.
```
## it4idedicatedtime
### Help of IT4IDEDICATEDTIME
```console
$ it4idedicatedtime -h
usage: it4idedicatedtime [-h] [-m {active,planned}]
[-c {barbora,karolina}]
The command shows IT4I dedicated time. By default all planned and active
outages of all clusters are displayed. Return exit code 99 if there is no
outage, otherwise return 0.
optional arguments:
-h, --help show this help message and exit
-m {active,planned}, --message {active,planned}
select type of dedicated time. Planned contains also
active
-c {barbora,karolina}, --cluster {barbora,karolina}
select cluster
```
### Example of IT4IDEDICATEDTIME
```console
$ it4idedicatedtime
Cluster Start End Last update
--------- ------------------- ------------------- -------------------
barbora 2024-03-19 08:00:00 2024-03-19 09:30:00 2024-03-08 08:24:33
karolina 2024-03-19 08:00:00 2024-03-19 09:30:00 2024-03-08 08:23:40
```
## it4ifree
### Help of IT4IFREE
```console
$ it4ifree -h
usage: it4ifree [-h] [-p] [-a]
The command shows some basic information from IT4I SLURM accounting. The
data is related to the current user and to all projects in which user
participates.
optional arguments:
-h, --help show this help message and exit
-p, --percent
show values in percentage. Projects with unlimited resources are not displayed
-a, --all Show all resources include inactive and future ones.
Columns of "Projects I am participating in":
PID: Project ID/account string.
Type: Standard or multiyear project.
Days left: Days till the given project expires.
Total: Core-hours allocated to the given project.
Used: Sum of core-hours used by all project members.
My: Core-hours used by the current user only.
Free: Core-hours that haven't yet been utilized.
Columns of "Projects I am Primarily Investigating" (if present):
PID: Project ID/account string.
Type: Standard or multiyear project.
Login: Project member's login name.
Used: Project member's used core-hours.
```
### Example of IT4IFREE
```console
$ it4ifree
Projects I am participating in
==============================
PID Resource type Days left Total Used By me Free
---------- --------------- ------------- -------- -------- -------- --------
OPEN-XX-XX Karolina GPU 249 42 0 0 42
Barbora CPU 249 42 5 5 37
Legacy NCH 249 100 0 0 100
Projects I am Primarily Investigating
=====================================
PID Resource type Login Usage
---------- -------------- ------- --------
OPEN-XX-XX Barbora CPU user1 3
Barbora CPU user2 2
Karolina GPU N/A 0
Legacy NCH N/A 0
Legend
======
N/A = No one used this resource yet
Legacy Normalized core hours are in NCH
Everything else is in Node Hours
```
## it4ifsusage
### Help of IT4IFSUSAGE
```console
$ it4ifsusage -h
usage: it4ifsusage [-h]
The command shows filesystem usage of IT4I cluster storage systems
optional arguments:
-h, --help show this help message and exit
```
### Example of IT4IFSUSAGE
```console
$ it4ifsusage
Quota Type Cluster / PID File System Space used Space limit Entries used Entries limit Last update
------------- --------------- ------------- ------------ ------------- -------------- --------------- -------------------
User barbora /home 2.9 GB 25.0 GB 183 500,000 2024-03-22 16:50:10
User karolina /home 3.0 MB 25.0 GB 150 500,000 2024-03-22 17:00:07
User barbora /scratch 0 Bytes 10.0 TB 0 10,000,000 2024-03-22 16:50:28
User karolina /scratch 0 Bytes 0 Bytes 0 0 2024-03-22 17:00:43
Project service proj3 1.5 TB 1.0 TB 169,933 198,000 2024-03-22 17:00:02
```
## it4iuserfsusage
### Help of IT4IUSERFSUSAGE
```console
$ it4iuserfsusage -h
usage: it4iuserfsusage [-h] [-c {all,barbora, karolina}]
The command shows user filesystem usage of IT4I cluster storage systems
optional arguments:
-h, --help show this help message and exit
```
### Example of IT4IUSERFSUSAGE
```console
$ it4iuserfsusage
Cluster File System Space used Space limit Entries used Entries limit Last update
--------------- ------------- ------------ ------------- -------------- --------------- -------------------
barbora /home 2.9 GB 25.0 GB 183 500,000 2024-03-22 16:50:10
karolina /home 3.0 MB 25.0 GB 150 500,000 2024-03-22 17:00:07
barbora /scratch 0 Bytes 10.0 TB 0 10,000,000 2024-03-22 16:50:28
karolina /scratch 0 Bytes 0 Bytes 0 0 2024-03-22 17:00:43
```
## it4iprojectfsusage
### Help of IT4IPROJECTFSUSAGE
```console
$ it4iprojectfsusage -h
usage: it4iprojectfsusage [-h] [-p {PID, all}]
The command shows project filesystem usage of IT4I cluster storage systems
optional arguments:
-h, --help show this help message and exit
```
### Example of IT4IPROJECTFSUSAGE
```console
$ it4iprojectfsusage
PID File System Space used Space limit Entries used Entries limit Last update
--------------- ------------- ------------ ------------- -------------- --------------- -------------------
service proj3 3.1 GB 1.0 TB 5 100,000 2024-03-22 17:00:02
it4i-x-y proj1 3.1 TB 2.0 TB 5 100,000 2024-03-22 17:00:02
dd-13-5 proj3 2 GB 3.0 TB 5 100,000 2024-03-22 17:00:02
projectx proj2 150 TB 4.0 TB 5 100,000 2024-03-22 17:00:02
```
## it4imotd
### Help of IT4IMOTD
```console
$ it4imotd -h
usage: it4imotd [-h] [-t TEMPLATE] [-w WIDTH] [-c]
The command shows IT4I messages of the day into formatted text or HTML page.
optional arguments:
-h, --help show this help message and exit
-t TEMPLATE, --template TEMPLATE
path to TAL / Zope Page Template, output will be
formatted into HTML page
-w WIDTH, --width WIDTH
maximum line width (intended for text rendering,
default of 78 columns)
-c, --cron sleep from 10 up to 60 seconds prior to any actions
-m {TYPE}, --message {TYPE}
select type of messages
supported types:
all,
public-service-announcement,
service-recovered-up,
critical-service-down,
service-hard-down,
auxiliary-service-down,
planned-outage,
service-degraded,
important,
notice.
```
### Example of IT4IMOTD
```console
$ it4imotd
Message of the Day (DD/MM/YYYY)
(YYYY-MM-DD hh:mm:ss)
More on https://...
```
---
hide:
- toc
---
# IT4I Data Sharing Tools
Below is the list of tools with description available at IT4 Innovations.
| Hostname | VPN | Access | Domain | Function | Technology | Target Group | License |
| ----------------------- | --- | -------------------------------------------------------------- | ------- | --------------------------------------------------------------------------------------------------------------------- | ----------- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------- |
| adasoffice.vsb.cz | no | | vsb.cz | Data sharing<br>Project work<br>\- Tasks<br>\- Milestones<br>\- Chat<br>Calendar<br>CRM<br>Collaborative writing | onlyoffice | Adas only, to shut down | 110/150 - Counts for 30 days from file open;<br>returns to pool if not used in 30 days.<br>50ks/1000E |
| internal.office.it4i.cz | no | ldap vsb | it4i.cz | Data sharing<br>Project work<br>\- Tasks<br>\- Milestones<br>\- Chat<br>Calendar<br>CRM<br>Collaborative writing | onlyoffice | Private projects | 110/150 - Counts for 30 days from file open;<br>returns to pool if not used in 30 days.<br>50ks/1000E |
| events.it4i.cz | | local account | it4i.cz | Events management<br>Schedules & programs<br>Evaluation & feedback<br>Participants registration | indico | IT4I | |
| code.it4i.cz | no | ldap it4i | it4i.cz | Versions management<br>Project management<br>CI/CD<br>Tasks<br>Documentation | gitlab | IT4I users | free |
| gitlab.it4i.cz | yes | ldap it4i | it4i.cz | Versions management<br>Project management<br>CI/CD<br>Tasks<br>Documentation | gitlab | IT4I | free |
| opencode.it4i.eu | no | ldap vsb<br>b2access<br>MyAccessID | it4i.eu | Versions management<br>Project management<br>CI/CD<br>Tasks<br>Documentation | gitlab | Lexis, OWS, Partners | free |
| openmm.it4i.eu | no | redirecting to opencode:<br>ldap vsb<br>b2access<br>MyAccessID | it4i.eu | Chat channels<br>Data sharing<br>Integration<br>Direct messages | gitlab | Lexis, OWS, Partners | free |
| sharing.office.it4i.cz | no | SSO<br>ldap vsb<br>MyAccessID | it4i.cz | Data sharing<br>Project work<br>\- Tasks<br>\- Milestones<br>\- Chat<br>Calendar<br>CRM<br>Collaborative writing | onlyoffice | Partners outside VSB | 110/150 - Counts for 30 days from file open;<br>returns to pool if not used in 30 days.<br>50ks/1000E |
| ext-folder.it4i.cz | no | ldap vsb<br>ldap it4i<br>MyAccessID | vsb.cz | Data sharing<br>draw.io<br>Forms | nextcloud | IT4I,<br>VSB users,<br>Partners outside VSB | free |