Commit 0f9950d3 authored by Pavel Jirásek's avatar Pavel Jirásek
Browse files

Merge branch 'content_revision' into 'master'

Content revision

See merge request !27
parents 7b81e1fc 991aeb25
Pipeline #1680 passed with stages
in 57 seconds
......@@ -5,12 +5,12 @@ Resources Allocation Policy
---------------------------
The resources are allocated to the job in a fairshare fashion, subject to constraints set by the queue and resources available to the Project. The Fairshare at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling](job-priority/) section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview:
|queue |active project |project resources |nodes|min ncpus*|priority|authorization|>walltime |
| --- | --- |
|<strong>qexp</strong> |no |none required |2 reserved, 31 totalincluding MIC, GPU and FAT nodes |1 |><em>150</em> |no |1h |
|<strong>qprod</strong> |yes |&gt; 0 |><em>178 nodes w/o accelerator</em> |16 |0 |no |24/48h |
|queue |active project |project resources |nodes|min ncpus|priority|authorization|walltime |
| --- | --- | --- | --- | --- | --- | --- | --- |
|<strong>qexp</strong> |no |none required |2 reserved, 31 totalincluding MIC, GPU and FAT nodes |1 |<em>150</em> |no |1h |
|<strong>qprod</strong> |yes |&gt; 0 |<em>178 nodes w/o accelerator</em> |16 |0 |no |24/48h |
|<strong>qlong</strong>Long queue |yes |&gt; 0 |60 nodes w/o accelerator |16 |0 |no |72/144h |
|<strong>qnvidia, qmic, qfat</strong>Dedicated queues |yes |<p>&gt; 0 |23 total qnvidia4 total qmic2 total qfat |16 |><em>200</em> |yes |24/48h |
|<strong>qnvidia, qmic, qfat</strong>Dedicated queues |yes |<p>&gt; 0 |23 total qnvidia4 total qmic2 total qfat |16 |<em>200</em> |yes |24/48h |
|<strong>qfree</strong> |yes |none required |178 w/o accelerator |16 |-1024 |no |12h |
!!! Note "Note"
......@@ -21,7 +21,7 @@ The resources are allocated to the job in a fairshare fashion, subject to constr
- **qexp**, the Express queue: This queue is dedicated for testing and running very small jobs. It is not required to specify a project to enter the qexp. There are 2 nodes always reserved for this queue (w/o accelerator), maximum 8 nodes are available via the qexp for a particular user, from a pool of nodes containing Nvidia accelerated nodes (cn181-203), MIC accelerated nodes (cn204-207) and Fat nodes with 512GB RAM (cn208-209). This enables to test and tune also accelerated code or code with higher RAM requirements. The nodes may be allocated on per core basis. No special authorization is required to use it. The maximum runtime in qexp is 1 hour.
- **qprod**, the Production queue: This queue is intended for normal production runs. It is required that active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 178 nodes without accelerator are included. Full nodes, 16 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
- **qlong**, the Long queue: This queue is intended for long production runs. It is required that active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 16 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times of the standard qprod time - 3 * 48 h).
- **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to access the Nvidia accelerated nodes, the qmic to access MIC nodes and qfat the Fat nodes. It is required that active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic and 2 fat nodes are included. Full nodes, 16 cores per node are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs explicitly ask support for authorization to enter the dedicated queues for all users associated to her/his Project.
- **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to access the Nvidia accelerated nodes, the qmic to access MIC nodes and qfat the Fat nodes. It is required that active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic and 2 fat nodes are included. Full nodes, 16 cores per node are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs explicitly ask [support](https://support.it4i.cz/rt/) for authorization to enter the dedicated queues for all users associated to her/his Project.
- **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a Project exhausted all its allocated computational resources (Does not apply to DD projects by default. DD projects have to request for persmission on qfree after exhaustion of computational resources.). It is required that active project is specified to enter the queue, however no remaining resources are required. Consumed resources will be accounted to the Project. Only 178 nodes without accelerator may be accessed from this queue. Full nodes, 16 cores per node are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
### Notes
......@@ -36,7 +36,7 @@ Anselm users may check current queue configuration at <https://extranet.it4i.cz/
>Check the status of jobs, queues and compute nodes at <https://extranet.it4i.cz/anselm/>
![rspbs web interface](../../img/rsweb.png)
![rspbs web interface](../img/rsweb.png)
Display the queue status on Anselm:
......
......@@ -35,7 +35,7 @@ Compute nodes
-------------
|Node|Count|Processor|Cores|Memory|Accelerator|
|---|---|
|---|---|---|---|---|---|
|w/o accelerator|576|2x Intel Xeon E5-2680v3, 2.5GHz|24|128GB|-|
|MIC accelerated|432|2x Intel Xeon E5-2680v3, 2.5GHz|24|128GB|2x Intel Xeon Phi 7120P, 61cores, 16GB RAM|
......@@ -46,7 +46,7 @@ Remote visualization nodes
For remote visualization two nodes with NICE DCV software are available each configured:
|Node|Count|Processor|Cores|Memory|GPU Accelerator|
|---|---|
|---|---|---|---|---|---|
|visualization|2|2x Intel Xeon E5-2695v3, 2.3GHz|28|512GB|NVIDIA QUADRO K5000, 4GB RAM|
SGI UV 2000
......@@ -54,7 +54,7 @@ SGI UV 2000
For large memory computations a special SMP/NUMA SGI UV 2000 server is available:
|Node |Count |Processor |Cores|Memory|Extra HW |
| --- | --- |
|UV2000 |1 |14x Intel Xeon E5-4627v2, 3.3GHz, 8cores |112 |3328GB DDR3@1866MHz |2x 400GB local SSD1x NVIDIA GM200(GeForce GTX TITAN X),12GB RAM\ |
| --- | --- | --- | --- | --- | --- |
|UV2000 |1 |14x Intel Xeon E5-4627v2, 3.3GHz, 8cores |112 |3328GB DDR3@1866MHz |2x 400GB local SSD1x NVIDIA GM200(GeForce GTX TITAN X),12GB RAM |
![](../img/uv-2000.jpeg)
......@@ -93,7 +93,7 @@ In this example, we allocate 2000GB of memory on the UV2000 for 72 hours. By req
### Useful tricks
All qsub options may be [saved directly into the jobscript](job-submission-and-execution/#PBSsaved). In such a case, no options to qsub are needed.
All qsub options may be [saved directly into the jobscript](#example-jobscript-for-mpi-calculation-with-preloaded-inputs). In such a case, no options to qsub are needed.
```bash
$ qsub ./myjob
......@@ -110,13 +110,16 @@ Advanced job placement
### Placement by name
Specific nodes may be allocated via the PBS
!!! Note "Note"
Not useful for ordinary computing, suitable for node testing/bechmarking and management tasks.
Specific nodes may be selected using PBS resource attribute host (for hostnames):
```bash
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=r24u35n680+1:ncpus=24:host=r24u36n681 -I
```
Or using short names
Specific nodes may be selected using PBS resource attribute cname (for short names in cns[0-1]+ format):
```bash
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns681 -I
......@@ -124,74 +127,111 @@ qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns68
In this example, we allocate nodes r24u35n680 and r24u36n681, all 24 cores per node, for 24 hours.  Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively.
### Placement by |Hypercube|dimension|
### Placement by network location
Network location of allocated nodes in the [Infiniband network](network/) influences efficiency of network communication between nodes of job. Nodes on the same Infiniband switch communicate faster with lower latency than distant nodes. To improve communication efficiency of jobs, PBS scheduler on Salomon is configured to allocate nodes - from currently available resources - which are as close as possible in the network topology.
For communication intensive jobs it is possible to set stricter requirement - to require nodes directly connected to the same Infiniband switch or to require nodes located in the same dimension group of the Infiniband network.
Nodes may be selected via the PBS resource attribute ehc_[1-7]d .
|Hypercube|dimension|
|---|---|
|1D|ehc_1d|
|2D|ehc_2d|
|3D|ehc_3d|
|4D|ehc_4d|
|5D|ehc_5d|
|6D|ehc_6d|
|7D|ehc_7d|
### Placement by Infiniband switch
Nodes directly connected to the same Infiniband switch can communicate most efficiently. Using the same switch prevents hops in the network and provides for unbiased, most efficient network communication. There are 9 nodes directly connected to every Infiniband switch.
!!! Note "Note"
We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently.
Nodes directly connected to the one Infiniband switch can be allocated using node grouping on PBS resource attribute switch.
In this example, we request all 9 nodes directly connected to the same switch using node grouping placement.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=24 -l place=group=ehc_1d -I
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
```
In this example, we allocate 4 nodes, 24 cores, selecting only the nodes with [hypercube dimension](../network/7d-enhanced-hypercube/) 1.
### Placement by specific Infiniband switch
!!! Note "Note"
Not useful for ordinary computing, suitable for testing and management tasks.
### Placement by IB switch
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband  network](../network/) . Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and provides for unbiased, most efficient network communication.
Nodes directly connected to the specific Infiniband switch can be selected using the PBS resource attribute *switch*.
There are at most 9 nodes sharing the same Infiniband switch.
In this example, we request all 9 nodes directly connected to r4i1s0sw1 switch.
Infiniband switch list:
```bash
$ qmgr -c "print node @a" | grep switch
set node r4i1n11 resources_available.switch = r4i1s0sw1
set node r2i0n0 resources_available.switch = r2i0s0sw1
set node r2i0n1 resources_available.switch = r2i0s0sw1
...
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob
```
List of all nodes per Infiniband switch:
List of all Infiniband switches:
```bash
$ qmgr -c "print node @a" | grep r36sw3
set node r36u31n964 resources_available.switch = r36sw3
set node r36u32n965 resources_available.switch = r36sw3
set node r36u33n966 resources_available.switch = r36sw3
set node r36u34n967 resources_available.switch = r36sw3
set node r36u35n968 resources_available.switch = r36sw3
set node r36u36n969 resources_available.switch = r36sw3
set node r37u32n970 resources_available.switch = r36sw3
set node r37u33n971 resources_available.switch = r36sw3
set node r37u34n972 resources_available.switch = r36sw3
$ qmgr -c 'print node @a' | grep switch | awk '{print $6}' | sort -u
r1i0s0sw0
r1i0s0sw1
r1i1s0sw0
r1i1s0sw1
r1i2s0sw0
...
...
```
List of all all nodes directly connected to the specific Infiniband switch:
```bash
$ qmgr -c 'p n @d' | grep 'switch = r36sw3' | awk '{print $3}' | sort
r36u31n964
r36u32n965
r36u33n966
r36u34n967
r36u35n968
r36u36n969
r37u32n970
r37u33n971
r37u34n972
```
Nodes sharing the same switch may be selected via the PBS resource attribute switch.
### Placement by Hypercube dimension
We recommend allocating compute nodes of a single switch when best possible computational network performance is required to run the job efficiently:
Nodes located in the same dimension group may be allocated using node grouping on PBS resource attribute ehc_[1-7]d .
|Hypercube dimension|node_group_key|#nodes per group|
|---|---|---|
|1D|ehc_1d|18|
|2D|ehc_2d|36|
|3D|ehc_3d|72|
|4D|ehc_4d|144|
|5D|ehc_5d|144,288|
|6D|ehc_6d|432,576|
|7D|ehc_7d|all|
In this example, we allocate 16 nodes in the same [hypercube dimension](7d-enhanced-hypercube/) 1 group.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob
$ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I
```
In this example, we request all the 9 nodes sharing the r4i1s0sw1 switch for 24 hours.
For better understanding:
List of all groups in dimension 1:
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
$ qmgr -c 'p n @d' | grep ehc_1d | awk '{print $6}' | sort |uniq -c
18 r1i0
18 r1i1
18 r1i2
18 r1i3
...
```
In this example, we request 9 nodes placed on the same switch using node grouping placement for 24 hours.
HTML commented section #1 (turbo boost is to be implemented)
List of all all nodes in specific dimension 1 group:
```bash
$ $ qmgr -c 'p n @d' | grep 'ehc_1d = r1i0' | awk '{print $3}' | sort
r1i0n0
r1i0n1
r1i0n10
r1i0n11
...
```
Job Management
--------------
......@@ -409,6 +449,8 @@ In some cases, it may be impractical to copy the inputs to scratch and outputs t
!!! Note "Note"
Store the qsub options within the jobscript. Use **mpiprocs** and **ompthreads** qsub options to control the MPI job execution.
### Example Jobscript for MPI Calculation with preloaded inputs
Example jobscript for an MPI job with preloaded inputs and executables, options for qsub are stored within the script :
```bash
......
......@@ -55,7 +55,7 @@ To access Salomon cluster, two login nodes running GSI SSH service are available
It is recommended to use the single DNS name salomon-prace.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are:
|Login address|Port|Protocol|Login node|
|---|---|
|---|---|---|---|
|salomon-prace.it4i.cz|2222|gsissh|login1, login2, login3 or login4|
|login1-prace.salomon.it4i.cz|2222|gsissh|login1|
|login2-prace.salomon.it4i.cz|2222|gsissh|login2|
......@@ -77,7 +77,7 @@ When logging from other PRACE system, the prace_service script can be used:
It is recommended to use the single DNS name salomon.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are:
|Login address|Port|Protocol|Login node|
|---|---|
|---|---|---|---|
|salomon.it4i.cz|2222|gsissh|login1, login2, login3 or login4|
|login1.salomon.it4i.cz|2222|gsissh|login1|
|login2-prace.salomon.it4i.cz|2222|gsissh|login2|
......@@ -132,7 +132,7 @@ There's one control server and three backend servers for striping and/or backup
**Access from PRACE network:**
|Login address|Port|Node role|
|---|---|
|---|---|---|
|gridftp-prace.salomon.it4i.cz|2812|Front end /control server|
|lgw1-prace.salomon.it4i.cz|2813|Backend / data mover server|
|lgw2-prace.salomon.it4i.cz|2813|Backend / data mover server|
......@@ -165,7 +165,7 @@ Or by using prace_service script:
**Access from public Internet:**
|Login address|Port|Node role|
|---|---|
|---|---|---|---|
|gridftp.salomon.it4i.cz|2812|Front end /control server|
|lgw1.salomon.it4i.cz|2813|Backend / data mover server|
|lgw2.salomon.it4i.cz|2813|Backend / data mover server|
......@@ -198,11 +198,11 @@ Or by using prace_service script:
Generally both shared file systems are available through GridFTP:
|File system mount point|Filesystem|Comment|
|---|---|
|---|---|---|
|/home|Lustre|Default HOME directories of users in format /home/prace/login/|
|/scratch|Lustre|Shared SCRATCH mounted on the whole cluster|
More information about the shared file systems is available [here](storage/storage/).
More information about the shared file systems is available [here](storage/).
Please note, that for PRACE users a "prace" directory is used also on the SCRATCH file system.
......@@ -234,7 +234,7 @@ General information about the resource allocation, job queuing and job execution
For PRACE users, the default production run queue is "qprace". PRACE users can also use two other queues "qexp" and "qfree".
|queue|Active project|Project resources|Nodes|priority|authorization|walltime |
|---|---|
|---|---|---|---|---|---|---|
|**qexp** Express queue|no|none required|32 nodes, max 8 per user|150|no|1 / 1h|
|**qprace** Production queue|yes|>0|1006 nodes, max 86 per job|0|no|24 / 48h|
|**qfree** Free resource queue|yes|none required|752 nodes, max 86 per job|-1024|no|12 / 12h|
......
......@@ -5,10 +5,10 @@ Resources Allocation Policy
---------------------------
The resources are allocated to the job in a fairshare fashion, subject to constraints set by the queue and resources available to the Project. The Fairshare at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling](job-priority/) section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview:
|queue |active project |project resources |nodes|min ncpus*|priority|authorization|walltime |
| --- | --- |
|**qexe** Express queue|no |none required |32 nodes, max 8 per user |24 |>150 |no |1 / 1h |
|**qprod** Production queue|yes |&gt; 0 |>1006 nodes, max 86 per job |24 |0 |no |24 / 48h |
|queue |active project |project resources |nodes|min ncpus |priority|authorization|walltime |
| --- | --- |--- |--- |--- |--- |--- |--- |
|**qexe** Express queue|no |none required |32 nodes, max 8 per user |24 |150 |no |1 / 1h |
|**qprod** Production queue|yes |&gt; 0 |1006 nodes, max 86 per job |24 |0 |no |24 / 48h |
|**qlong** Long queue |yes |&gt; 0 |256 nodes, max 40 per job, only non-accelerated nodes allowed |24 |0 |no |72 / 144h |
|**qmpp** Massive parallel queue |yes |&gt; 0 |1006 nodes |24 |0 |yes |2 / 4h |
|**qfat** UV2000 queue |yes |&gt; 0 |1 (uv1) |8 |0 |yes |24 / 48h |
......@@ -42,7 +42,7 @@ Salomon users may check current queue configuration at <https://extranet.it4i.cz
!!! Note "Note"
Check the status of jobs, queues and compute nodes at [https://extranet.it4i.cz/rsweb/salomon/](https://extranet.it4i.cz/rsweb/salomon)
![RSWEB Salomon](../../img/rswebsalomon.png "RSWEB Salomon")
![RSWEB Salomon](../img/rswebsalomon.png "RSWEB Salomon")
Display the queue status on Salomon:
......@@ -118,7 +118,8 @@ The resources that are currently subject to accounting are the core-hours. The c
### Check consumed resources
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
!!! Note "Note"
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
User may check at any time, how many core-hours have been consumed by himself/herself and his/her projects. The command is available on clusters' login nodes.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment