Skip to content
Snippets Groups Projects
Commit ccd01794 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

initial commit

parents
No related branches found
No related tags found
No related merge requests found
Showing
with 571 additions and 0 deletions
File added
# Satisfaction and Feedback
IT4Innovations National Supercomputing Center is interested in [user satisfaction and feedback][1]. It allows us to prioritize and focus on the most pressing issues. With the help of user feedback, we strive to provide smooth and productive environment, where computational tasks may be solved without distraction or annoyance.
## Feedback Form
Please provide us with feedback regarding your satisfaction with our services using [the online form][1]. Set the values and comment on the individual aspects of our services.
We prefer you enter [**new inputs 3 times a year**][1].
You may view your [feedback history][2] any time.
You are welcome to modify your most recent input.
The form inquires about:
- Resource allocation and access
- Computing environment
- Added value services
You may set the satisfaction score on a **scale of 1 to 5** as well as leave **text comments**.
The score is interpreted as follows:
|Value | Interpretation |
|-----|---|
| 1-2 | Values below 3 indicate a level of dissatisfaction; improvements or other actions are desirable. The values are interpreted as a measure of how deep the dissatisfaction is.|
| 3 | Value 3 indicates a degree of satisfaction. Users are reasonably happy with the environment and services and do not require changes, although there still might be room for improvements. |
| 4-5 | Values above 3 indicate a level of exceptional appreciation and satisfaction; the values are interpreted as a measure of how rewarding the experience is. |
## Feedback Automation
In order to obtain ample feedback data without forcing our users
to spend efforts in filling out the feedback form, we implement automatic data collection.
The automation works as follows:
If the last feedback entry is older than 4 months, a new feedback entry is created as a copy of the last entry.
The new entry is modified in this way:
- score values greater than 3 are decremented by one;
- score values lower than 3 are incremented by one;
- score values equal to 3 are preserved;
- text fields are set blank.
Once a new feedback is created, users are notified by email and invited to [modify the feedback entry][2] as they see fit.
**Rationale:** Feedback automation takes away some effort from a group of moderately satisfied users,
while prompting the users to express satisfaction/dissatisfaction.
We assume that moderately satisfied users (satisfaction value 3) do not require changes to the environment
and tend to remain moderately satisfied in time.
Further, we assume that satisfied users (values 4-5) develop in time towards moderately satisfied (value 3)
by getting accustomed to the provided standards.
The dissatisfied users (values 1-2) also develop towards moderately satisfied due to
gradual improvements implemented by the IT4I.
## Request Tracker Feedback
Please use the [user satisfaction and feedback][1] form to provide your overall view.
For acute, pressing issues and immediate contact, reach out for support via the [Request tracker portal][3] or [support\[at\]it4i.cz][4] email.
Express your satisfaction with the solution of an individual [Request tracker][3] ticket by selecting **Feedback** menu on the ticket form.
## Evaluation
The user feedback is evaluated 4 times a year, in the end of March, June, September, and December.
We consider the text comments, as well as evaluate the score average, distribution and trends.
This is done in summary as well as per individual category.
[1]: https://scs.it4i.cz/feedbacks/new
[2]: https://scs.it4i.cz/feedbacks/
[3]: https://support.it4i.cz/rt
[4]: mailto:support@it4i.cz
# Job Scheduling
## Job Execution Priority
The scheduler gives each job an execution priority and then uses this job execution priority to select which job(s) to run.
Job execution priority is determined by these job properties (in order of importance):
1. queue priority
1. fair-share priority
1. eligible time
### Queue Priority
Queue priority is the priority of the queue in which the job is waiting prior to execution.
Queue priority has the biggest impact on job execution priority. The execution priority of jobs in higher priority queues is always greater than the execution priority of jobs in lower priority queues. Other properties of jobs used for determining the job execution priority (fair-share priority, eligible time) cannot compete with queue priority.
Queue priorities can be seen [here][a].
### Fair-Share Priority
Fair-share priority is calculated based on recent usage of resources. Fair-share priority is calculated per project, i.e. all members of a project share the same fair-share priority. Projects with higher recent usage have a lower fair-share priority than projects with lower or no recent usage.
Fair-share priority is used for ranking jobs with equal queue priority.
Fair-share priority is calculated as:
---8<--- "fairshare_formula.md"
where MAX_FAIRSHARE has the value of 1E6
usage<sub>Project</sub> is the usage accumulated by all members of a selected project
usage<sub>Total</sub> is the total usage by all users, across all projects.
Usage counts allocated core-hours (`ncpus x walltime`). Usage decays, halving at intervals of 168 hours (one week).
Jobs queued in the queue qexp are not used to calculate the project's usage.
!!! note
Calculated usage and fair-share priority can be seen [here][b].
Calculated fair-share priority can also be seen in the Resource_List.fairshare attribute of a job.
### Eligible Time
Eligible time is the amount of eligible time (in seconds) a job accrues while waiting to run. Jobs with higher eligible time gain higher priority.
Eligible time has the least impact on execution priority. Eligible time is used for sorting jobs with equal queue priority and fair-share priority. It is very, very difficult for eligible time to compete with fair-share priority.
Eligible time can be seen in the `eligible_time` attribute of a job.
### Formula
Job execution priority (job sort formula) is calculated as:
---8<--- "job_sort_formula.md"
### Job Backfilling
The scheduler uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (the job with the highest execution priority) cannot run.
The scheduler makes a list of jobs to run in order of execution priority. The scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
This means that jobs with lower execution priority can be run before jobs with higher execution priority.
!!! note
It is **very beneficial to specify the walltime** when submitting jobs.
Specifying more accurate walltime enables better scheduling, better execution times, and better resource usage. Jobs with suitable (small) walltime can be backfilled - and overtake job(s) with a higher priority.
---8<--- "mathjax.md"
### Job Placement
Job [placement can be controlled by flags during submission][1].
[1]: job-submission-and-execution.md#advanced-job-placement
[a]: https://extranet.it4i.cz/rsweb/barbora/queues
[b]: https://extranet.it4i.cz/rsweb/barbora/projects
This diff is collapsed.
# Certificates FAQ
FAQ about certificates in general.
## Q: What Are Certificates?
IT4Innovations employs X.509 certificates for secure communication (e.g. credentials exchange) and for grid services related to PRACE, as they present a single method of authentication for all PRACE services, where only one password is required.
There are different kinds of certificates, each with a different scope of use. We mention here:
* User (Private) certificates
* Certificate Authority (CA) certificates
* Host certificates
* Service certificates
However, users only need to manage User and CA certificates. Note that your user certificate is protected by an associated private key, and this **private key must never be disclosed**.
## Q: Which X.509 Certificates Are Recognized by IT4Innovations?
See the [Certificates for Digital Signatures][1] section.
## Q: How Do I Get a User Certificate That Can Be Used With IT4Innovations?
To get a certificate, you must make a request to your local, IGTF approved Certificate Authority (CA). Then, you must usually visit, in person, your nearest Registration Authority (RA) to verify your affiliation and identity (photo identification is required). Usually, you will then be emailed details on how to retrieve your certificate, although procedures can vary between CAs. If you are in Europe, you can locate [your trusted CA][a].
In some countries, certificates can also be retrieved using the TERENA Certificate Service, see the FAQ below for the link.
## Q: Does IT4Innovations Support Short Lived Certificates (SLCS)?
Yes, if the CA which provides this service is also a member of IGTF.
## Q: Does IT4Innovations Support the TERENA Certificate Service?
Yes, IT4Innovations supports TERENA eScience personal certificates. For more information, visit [TCS - Trusted Certificate Service][b], where you can also find if your organization/country can use this service.
## Q: What Format Should My Certificate Take?
User Certificates come in many formats, the three most common being the ’PKCS12’, ’PEM’, and JKS formats.
The PKCS12 (often abbreviated to ’p12’) format stores your user certificate, along with your associated private key, in a single file. This form of your certificate is typically employed by web browsers, mail clients, and grid services like UNICORE, DART, gsissh-term, and Globus toolkit (GSI-SSH, GridFTP, and GRAM5).
The PEM format (`*`.pem) stores your user certificate and your associated private key in two separate files. This form of your certificate can be used by PRACE’s gsissh-term and with the grid related services like Globus toolkit (GSI-SSH, GridFTP, and GRAM5).
To convert your Certificate from PEM to p12 formats and _vice versa_, IT4Innovations recommends using the OpenSSL tool (see the [separate FAQ entry][2]).
JKS is the Java KeyStore and may contain both your personal certificate with your private key and a list of your trusted CA certificates. This form of your certificate can be used by grid services like DART and UNICORE6.
To convert your certificate from p12 to JKS, IT4Innovations recommends using the keytool utility (see the [separate FAQ entry][3]).
## Q: What Are CA Certificates?
Certification Authority (CA) certificates are used to verify the link between your user certificate and the issuing authority. They are also used to verify the link between the host certificate of an IT4Innovations server and the CA that issued the certificate. In essence, they establish a chain of trust between you and the target server. Thus, for some grid services, users must have a copy of all the CA certificates.
To assist users, SURFsara (a member of PRACE) provides a complete and up-to-date bundle of all the CA certificates that any PRACE user (or IT4Innovations grid services user) will require. Bundle of certificates, in either p12, PEM, or JKS formats, are [available here][c].
It is worth noting that gsissh-term and DART automatically update their CA certificates from this SURFsara website. In other cases, if you receive a warning that a server’s certificate cannot be validated (not trusted), update your CA certificates via the SURFsara website. If this fails, contact the IT4Innovations helpdesk.
Lastly, if you need the CA certificates for a personal Globus 5 installation, you can install the CA certificates from a MyProxy server with the following command:
```console
myproxy-get-trustroots -s myproxy-prace.lrz.de
```
If you run this command as `root`, it will install the certificates into /etc/grid-security/certificates. Otherwise, the certificates will be installed into $HOME/.globus/certificates. For Globus, you can download the globuscerts.tar.gz packet [available here][c].
## Q: What Is a DN and How Do I Find Mine?
DN stands for Distinguished Name and is a part of your user certificate. IT4Innovations needs to know your DN to enable your account to use the grid services. You may use OpenSSL (see [below][2]) to determine your DN or, if your browser contains your user certificate, you can extract your DN from your browser.
For Internet Explorer users, the DN is referred to as the "subject" of your certificate. ToolsInternet OptionsContentCertificatesViewDetailsSubject.
For users running Firefox under Windows, the DN is referred to as the "subject" of your certificate. ToolsOptionsAdvancedEncryptionView Certificates. Highlight your name and then click ViewDetailsSubject.
## Q: How Do I Use the Openssl Tool?
The following examples are for Unix/Linux operating systems only.
To convert from PEM to p12, enter the following command:
```console
openssl pkcs12 -export -in usercert.pem -inkey userkey.pem -out
username.p12
```
To convert from p12 to PEM, type the following _four_ commands:
```console
openssl pkcs12 -in username.p12 -out usercert.pem -clcerts -nokeys
openssl pkcs12 -in username.p12 -out userkey.pem -nocerts
chmod 444 usercert.pem
chmod 400 userkey.pem
```
To check your Distinguished Name (DN), enter the following command:
```console
openssl x509 -in usercert.pem -noout -subject -nameopt
RFC2253
```
To check your certificate (e.g. DN, validity, issuer, public key algorithm, etc.), enter the following command:
```console
openssl x509 -in usercert.pem -text -noout
```
To download OpenSSL if not pre-installed, see [here][d]. On Macintosh Mac OS X computers, OpenSSL is already pre-installed and can be used immediately.
## Q: How Do I Create and Then Manage a Keystore?
IT4innovations recommends the Java-based keytool utility to create and manage keystores, which themselves are stores of keys and certificates. For example if you want to convert your pkcs12 formatted key pair into a Java keystore you can use the following command:
```console
keytool -importkeystore -srckeystore $my_p12_cert -destkeystore
$my_keystore -srcstoretype pkcs12 -deststoretype jks -alias
$my_nickname -destalias $my_nickname
```
where `$my_p12_cert` is the name of your p12 (pkcs12) certificate, `$my_keystore` is the name that you give to your new java keystore and `$my_nickname` is the alias name that the p12 certificate was given and is also used for the new keystore.
You can also import CA certificates into your Java keystore with the tool, for exmaple:
```console
keytool -import -trustcacerts -alias $mydomain -file $mydomain.crt -keystore $my_keystore
```
where `$mydomain.crt` is the certificate of a trusted signing authority (CA) and `$mydomain` is the alias name that you give to the entry.
More information on the tool can be found [here][e].
## Q: How Do I Use My Certificate to Access Different Grid Services?
Most grid services require the use of your certificate; however, the format of your certificate depends on the grid Service you wish to employ.
If employing the PRACE version of GSISSH-term (also a Java Web Start Application), you may use either the PEM or p12 formats. Note that this service automatically installs up-to-date PRACE CA certificates.
If the grid service is UNICORE, then you bind your certificate, in either the p12 format or JKS, to UNICORE during the installation of the client on your local machine.
If the grid service is a part of Globus (e.g. GSI-SSH, GriFTP, or GRAM5), the certificates can be in either p12 or PEM format and must reside in the "$HOME/.globus" directory for Linux and Mac users or %HOMEPATH%.globus for Windows users. (Windows users will have to use the DOS command `cmd` to create a directory which starts with a ’.’). Further, user certificates should be named either "usercred.p12" or "usercert.pem" and "userkey.pem", and the CA certificates must be kept in a pre-specified directory as follows. For Linux and Mac users, this directory is either $HOME/.globus/certificates or /etc/grid-security/certificates. For Windows users, this directory is %HOMEPATH%.globuscertificates. (If you are using GSISSH-Term from prace-ri.eu, you do not have to create the .globus directory nor install CA certificates to use this tool alone).
## Q: How Do I Manually Import My Certificate Into My Browser?
In Firefox, you can import your certificate by first choosing the "Preferences" window. For Windows, this is ToolsOptions. For Linux, this is EditPreferences. For Mac, this is FirefoxPreferences. Then choose the "Advanced" button, followed by the "Encryption" tab. Then choose the "Certificates" panel, select the "Select one automatically" option if you have only one certificate, or "Ask me every time" if you have more than one. Then, click on the "View Certificates" button to open the "Certificate Manager" window. You can then select the "Your Certificates" tab and click on the "Import" button. Then locate the PKCS12 (.p12) certificate you wish to import and employ its associated password.
If you are a Safari user, then simply open the "Keychain Access" application and follow "FileImport items".
If you are an Internet Explorer user, click Start > Settings > Control Panel and then double-click on Internet. On the Content tab, click Personal and then click Import. Type your password in the Password field. You may be prompted multiple times for your password. In the "Certificate File To Import" box, type the filename of the certificate you wish to import, and then click OK. Click Close, and then click OK.
## Q: What Is a Proxy Certificate?
A proxy certificate is a short-lived certificate, which may be employed by UNICORE and the Globus services. The proxy certificate consists of a new user certificate and a newly generated proxy private key. This proxy typically has a rather short lifetime (normally 12 hours) and often allows only a limited delegation of rights. Its default location for Unix/Linux, is /tmp/x509_u_uid_ but can be set via the `$X509_USER_PROXY` environment variable.
## Q: What Is the MyProxy Service?
[MyProxy Service][g] can be employed by gsissh-term and Globus tools and is an online repository that allows users to store long-lived proxy certificates remotely, which can then be retrieved for later use. Each proxy is protected by a password provided by the user at the time of storage. This is beneficial to Globus users, as they do not have to carry their private keys and certificates when travelling; nor do users have to install private keys and certificates on possibly insecure computers.
## Q: Someone May Have Copied or Had Access to the Private Key of My Certificate Either in a Separate File or in the Browser. What Should I Do?
Please ask the Certificate Authority that issued your certificate to revoke this certificate and to supply you with a new one. In addition, report this to IT4Innovations by contacting [the support team][h].
## Q: My Certificate Expired. What Should I Do?
In order to still be able to communicate with us, make a request for a new certificate to your CA. There is no need to explicitly send us any information about your new certificate if a new one has the same Distinguished Name (DN) as the old one.
[1]: obtaining-login-credentials.md#certificates-for-digital-signatures
[2]: #q-how-do-i-use-the-openssl-tool
[3]: #q-how-do-i-create-and-then-manage-a-keystore
[a]: https://www.eugridpma.org/members/worldmap/
[b]: https://tcs-escience-portal.terena.org/
[c]: https://winnetou.surfsara.nl/prace/certs/
[d]: https://www.openssl.org/source/
[e]: http://docs.oracle.com/javase/7/docs/technotes/tools/solaris/keytool.html
[g]: http://grid.ncsa.illinois.edu/myproxy/
[h]: https://support.it4i.cz/rt
# Resource Allocation and Job Execution
To run a [job][1], computational resources for this particular job must be allocated. This is done via the PBS Pro job workload manager software, which distributes workloads across the supercomputer. Extensive information about PBS Pro can be found in the [PBS Pro User's Guide][2].
## Resources Allocation Policy
Resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. [The Fair-share][3] ensures that individual users may consume approximately equal amount of resources per week. The resources are accessible via queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following queues are the most important:
* **qexp** - Express queue
* **qprod** - Production queue
* **qlong** - Long queue
* **qmpp** - Massively parallel queue
* **qnvidia**, **qmic**, **qfat** - Dedicated queues
* **qfree** - Free resource utilization queue
!!! note
Check the queue status [here][a].
Read more on the [Resources Allocation Policy][4] page.
## Job Submission and Execution
!!! note
Use the **qsub** command to submit your jobs.
The `qsub` command creates a request to the PBS Job manager for allocation of specified resources. The **smallest allocation unit is an entire node - 16 cores**, with the exception of the `qexp` queue. The resources will be allocated when available, subject to allocation policies and constraints. **After the resources are allocated, the jobscript or interactive shell is executed on first of the allocated nodes.**
Read more on the [Job Submission and Execution][5] page.
## Capacity Computing
!!! note
Use Job arrays when running huge number of jobs.
Use GNU Parallel and/or Job arrays when running (many) single core jobs.
In many cases, it is useful to submit a huge (100+) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute parallel calculations, achieving best runtime, throughput and computer utilization. In this chapter, we discuss the recommended way to run huge numbers of jobs, including **ways to run huge numbers of single core jobs**.
Read more on [Capacity Computing][6] page.
[1]: ../index.md#terminology-frequently-used-on-these-pages
[2]: ../pbspro.md
[3]: job-priority.md#fair-share-priority
[4]: resources-allocation-policy.md
[5]: job-submission-and-execution.md
[6]: capacity-computing.md
[a]: https://extranet.it4i.cz/rsweb/salomon/queues
# Resources Allocation Policy
## Job Queue Policies
Resources are allocated to jobs in a fair-share fashion, subject to constraints set by the queue and the resources available to the project. The fair-share system ensures that individual users may consume approximately equal amounts of resources per week. Detailed information can be found in the [Job scheduling][1] section. Resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. The following table provides the queue partitioning overview:
!!! hint
The qexp queue is configured to run one job and accept five jobs in a queue per user.
!!! note
**The qfree queue is not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of qfree after exhaustion of DD projects' computational resources is allowed after request for this queue.
!!! note
**The qexp queue is equipped with nodes that do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
### Salomon
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------------------- | -------------- | -------------------- | ------------------------------------------------------------- | --------- | -------- | ------------- | --------- |
| **qexp** Express queue | no | none required | 32 nodes, max 8 per user | 24 | 150 | no | 1 / 1h |
| **qprod** Production queue | yes | > 0 | 1006 nodes, max 86 per job | 24 | 0 | no | 24 / 48h |
| **qlong** Long queue | yes | > 0 | 256 nodes, max 40 per job, only non-accelerated nodes allowed | 24 | 0 | no | 72 / 144h |
| **qmpp** Massive parallel queue | yes | > 0 | 1006 nodes | 24 | 0 | yes | 2 / 4h |
| **qfat** UV2000 queue | yes | > 0 | 1 (uv1) | 8 | 200 | yes | 24 / 48h |
| **qfree** Free resource queue | yes | < 120% of allocation | 987 nodes, max 86 per job | 24 | -1024 | no | 12 / 12h |
| **qviz** Visualization queue | yes | none required | 2 (with NVIDIA Quadro K5000) | 4 | 150 | no | 1 / 8h |
| **qmic** Intel Xeon Phi cards | yes | > 0 | 864 Intel Xeon Phi cards, max 8 mic per job | 0 | 0 | no | 24 / 48h |
* **qexp**, the Express queue: This queue is dedicated for testing and running very small jobs. It is not required to specify a project to enter the qexp. There are 2 nodes always reserved for this queue (w/o accelerator), maximum 8 nodes are available via the qexp for a particular user. The nodes may be allocated on per core basis. No special authorization is required to use it. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, however only 86 per job. Full nodes, 24 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that active project with nonzero remaining resources is specified to enter the qlong. Only 336 nodes without acceleration may be accessed via the qlong queue. Full nodes, 24 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times of the standard qprod time - 3 \* 48 h)
* **qmpp**, the massively parallel queue. This queue is intended for massively parallel runs. It is required that active project with nonzero remaining resources is specified to enter the qmpp. All nodes may be accessed via the qmpp queue. Full nodes, 24 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qmpp is 4 hours. An PI needs explicitly ask support for authorization to enter the queue for all users associated to her/his Project.
* **qfat**, the UV2000 queue. This queue is dedicated to access the fat SGI UV2000 SMP machine. The machine (uv1) has 112 Intel IvyBridge cores at 3.3GHz and 3.25TB RAM (8 cores and 128GB RAM are dedicated for system). The PI needs to explicitly ask support for authorization to enter the queue for all users associated to their Project.
* **qfree**, the Free resource queue: The queue qfree is intended for utilization of free resources, after a Project exhausted all its allocated computational resources (Does not apply to DD projects by default. DD projects have to request for permission on qfree after exhaustion of computational resources.). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 987 nodes without accelerator may be accessed from this queue. Full nodes, 24 cores per node are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
* **qviz**, the Visualization queue: Intended for pre-/post-processing using OpenGL accelerated graphics. Currently when accessing the node, each user gets 4 cores of a CPU allocated, thus approximately 73 GB of RAM and 1/7 of the GPU capacity (default "chunk"). If more GPU power or RAM is required, it is recommended to allocate more chunks (with 4 cores each) up to one whole node per user, so that all 28 cores, 512 GB RAM and whole GPU is exclusive. This is currently also the maximum allowed allocation per one user. One hour of work is allocated by default, the user may ask for 2 hours maximum.
* **qmic**, the queue qmic to access MIC nodes. It is required that active project with nonzero remaining resources is specified to enter the qmic. All 864 MICs are included.
!!! note
To access a node with Xeon Phi co-processor, you need to specify it in a [job submission select statement][3].
### Barbora
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------- | -------------- | -------------------- | ---------------------------------------------------- | --------- | -------- | ------------- | -------- |
| qexp | no | none required | 189 nodes | 36 | 150 | no | 1 h |
| qprod | yes | > 0 | 187 nodes w/o accelerator | 36 | 0 | no | 24/48 h |
| qlong | yes | > 0 | 20 nodes w/o accelerator | 36 | 0 | no | 72/144 h |
| qnvidia | yes | > 0 | 8 NVIDIA nodes | 24 | 200 | yes | 24/48 h |
| qfat | yes | > 0 | 1 fat nodes | 8 | 200 | yes | 24/144 h |
| qfree | yes | < 120% of allocation | 189 w/o accelerator | 36 | -1024 | no | 12 h |
**The qexp queue is equipped with nodes that do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
* **qexp**, the Express queue: This queue is dedicated to testing and running very small jobs. It is not required to specify a project to enter the qexp. There are always 2 nodes reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user. The nodes may be allocated on a per core basis. No special authorization is required to use qexp. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 187 nodes without accelerators are included. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 20 nodes without acceleration may be accessed via the qlong queue. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h).
* **qnvidia**, **qfat**, the Dedicated queues: The queue qnvidia is dedicated to accessing the Nvidia accelerated nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. Influded are 8 NVIDIA (4 NVIDIA cards per node) and 1 fat nodes. Full nodes, 24 cores per node, are allocated. The queues run with very high priority. The PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with their project.
* **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request permission to use qfree after exhaustion of computational resources). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 189 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
## Queue Notes
The job wall clock time defaults to **half the maximum time**, see the table above. Longer wall time limits can be [set manually, see examples][3].
Jobs that exceed the reserved wall clock time (Req'd Time) get killed automatically. The wall clock time limit can be changed for queuing jobs (state Q) using the `qalter` command, however it cannot be changed for a running job (state R).
You can check the current queue configuration on rsweb: [Barbora][b] or [Salomon][d].
## Queue Status
!!! tip
Check the status of jobs, queues and compute nodes [here][c].
![rspbs web interface](../img/barbora_cluster_usage.png)
Display the queue status:
```console
$ qstat -q
```
The PBS allocation overview may also be obtained using the `rspbs` command:
```console
$ rspbs
Usage: rspbs [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--get-server-details Print server
--get-queues Print queues
--get-queues-details Print queues details
--get-reservations Print reservations
--get-reservations-details
Print reservations details
--get-nodes Print nodes of PBS complex
--get-nodeset Print nodeset of PBS complex
--get-nodes-details Print nodes details
--get-vnodes Print vnodes of PBS complex
--get-vnodeset Print vnodes nodeset of PBS complex
--get-vnodes-details Print vnodes details
--get-jobs Print jobs
--get-jobs-details Print jobs details
--get-job-nodes Print job nodes
--get-job-nodeset Print job nodeset
--get-job-vnodes Print job vnodes
--get-job-vnodeset Print job vnodes nodeset
--get-jobs-check-params
Print jobid, job state, session_id, user, nodes
--get-users Print users of jobs
--get-allocated-nodes
Print nodes allocated by jobs
--get-allocated-nodeset
Print nodeset allocated by jobs
--get-allocated-vnodes
Print vnodes allocated by jobs
--get-allocated-vnodeset
Print vnodes nodeset allocated by jobs
--get-node-users Print node users
--get-node-jobs Print node jobs
--get-node-ncpus Print number of cpus per node
--get-node-naccelerators
Print number of accelerators per node
--get-node-allocated-ncpus
Print number of allocated cpus per node
--get-node-allocated-naccelerators
Print number of allocated accelerators per node
--get-node-qlist Print node qlist
--get-node-ibswitch Print node ibswitch
--get-vnode-users Print vnode users
--get-vnode-jobs Print vnode jobs
--get-vnode-ncpus Print number of cpus per vnode
--get-vnode-naccelerators
Print number of naccelerators per vnode
--get-vnode-allocated-ncpus
Print number of allocated cpus per vnode
--get-vnode-allocated-naccelerators
Print number of allocated accelerators per vnode
--get-vnode-qlist Print vnode qlist
--get-vnode-ibswitch Print vnode ibswitch
--get-user-nodes Print user nodes
--get-user-nodeset Print user nodeset
--get-user-vnodes Print user vnodes
--get-user-vnodeset Print user vnodes nodeset
--get-user-jobs Print user jobs
--get-user-job-count Print number of jobs per user
--get-user-node-count
Print number of allocated nodes per user
--get-user-vnode-count
Print number of allocated vnodes per user
--get-user-ncpus Print number of allocated ncpus per user
--get-qlist-nodes Print qlist nodes
--get-qlist-nodeset Print qlist nodeset
--get-qlist-vnodes Print qlist vnodes
--get-qlist-vnodeset Print qlist vnodes nodeset
--get-ibswitch-nodes Print ibswitch nodes
--get-ibswitch-nodeset
Print ibswitch nodeset
--get-ibswitch-vnodes
Print ibswitch vnodes
--get-ibswitch-vnodeset
Print ibswitch vnodes nodeset
--last-job Print expected time of last running job
--summary Print summary
--get-node-ncpu-chart
Obsolete. Print chart of allocated ncpus per node
--server=SERVER Use given PBS server
--state=STATE Only for given job state
--jobid=JOBID Only for given job ID
--user=USER Only for given user
--node=NODE Only for given node
--vnode=VNODE Only for given vnode
--nodestate=NODESTATE
Only for given node state (affects only --get-node*
--get-vnode* --get-qlist-* --get-ibswitch-* actions)
--incl-finished Include finished jobs
--walltime-exceeded-used-walltime
Job walltime exceeded - resources_used.walltime
--walltime-exceeded-real-runtime
Job walltime exceeded - real runtime
--backend-sqlite Use SQLite backend - experimental
```
---8<--- "resource_accounting.md"
---8<--- "mathjax.md"
[1]: job-priority.md
[2]: #resource-accounting-policy
[3]: job-submission-and-execution.md
[a]: https://support.it4i.cz/rt/
[b]: https://extranet.it4i.cz/rsweb/barbora/queues
[c]: https://extranet.it4i.cz/rsweb
[d]: https://extranet.it4i.cz/rsweb/salomon/queues
This diff is collapsed.
docs.it4i/img/49213048_2722927791082867_3152356642071248896_n.png

4.78 KiB

This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment