Commit faf64a14 authored by David Hrbáč's avatar David Hrbáč

Merge branch 'dgx' into 'master'

Dgx

See merge request !240
parents 22727d7a ed5b910c
Pipeline #6197 passed with stages
in 1 minute and 27 seconds
NICE
DGX-2
DGX
DCV
In
CAE
CUBE
......
......@@ -60,9 +60,9 @@ The parameters are summarized in the following tables:
For more details refer to [Compute nodes][1], [Storage][4], and [Network][3].
[1]: compute-nodes.md
[2]: resources-allocation-policy.md
[2]: ../general/resources-allocation-policy.md
[3]: network.md
[4]: storage.md
[5]: shell-and-data-access.md
[5]: ../general/shell-and-data-access.md
[a]: https://support.it4i.cz/rt
# Introduction
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totalling 3344 compute cores with 15 TB RAM, giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB of RAM, and a 500 GB hard disk drive. Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview][1].
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totaling 3344 compute cores with 15 TB RAM, giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB of RAM, and a 500 GB hard disk drive. Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview][1].
The cluster runs with an operating system which is compatible with the RedHat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
......@@ -12,9 +12,9 @@ Read more on how to [apply for resources][4], [obtain login credentials][5] and
[1]: hardware-overview.md
[2]: ../environment-and-modules.md
[3]: resources-allocation-policy.md
[3]: ../general/resources-allocation-policy.md
[4]: ../general/applying-for-resources.md
[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[6]: shell-and-data-access.md
[6]: ../general/shell-and-data-access.md
[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
# Resources Allocation Policy
## Job Queue Policies
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and the resources available to the Project. The Fair-share system of Anselm ensures that individual users may consume approximately equal amounts of resources per week. Detailed information can be found in the [Job scheduling][1] section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. The following table provides the queue partitioning overview:
!!! note
Check the queue status at <https://extranet.it4i.cz/rsweb/anselm/>
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------- | -------------- | -------------------- | ---------------------------------------------------- | --------- | -------- | ------------- | -------- |
| qexp | no | none required | 209 nodes | 1 | 150 | no | 1 h |
| qprod | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 24/48 h |
| qlong | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 72/144 h |
| qnvidia, qmic | yes | > 0 | 23 nvidia nodes, 4 mic nodes | 16 | 200 | yes | 24/48 h |
| qfat | yes | > 0 | 2 fat nodes | 16 | 200 | yes | 24/144 h |
| qfree | yes | < 120% of allocation | 180 w/o accelerator | 16 | -1024 | no | 12 h |
!!! note
**The qfree queue is not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of qfree after exhaustion of DD projects' computational resources is allowed after request for this queue.
**The qexp queue is equipped with nodes which do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
* **qexp**, the Express queue: This queue is dedicated to testing and running very small jobs. It is not required to specify a project to enter the qexp. There are always 2 nodes reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user, from a pool of nodes containing Nvidia accelerated nodes (cn181-203), MIC accelerated nodes (cn204-207) and Fat nodes with 512GB of RAM (cn208-209). This enables us to test and tune accelerated code and code with higher RAM requirements. The nodes may be allocated on a per core basis. No special authorization is required to use qexp. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 178 nodes without accelerators are included. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h).
* **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to accessing the Nvidia accelerated nodes, the qmic to accessing MIC nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic, and 2 fat nodes are included. Full nodes, 16 cores per node, are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with her/his project.
* **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request persmission to use qfree after exhaustion of computational resources). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 180 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
## Queue Notes
The job wall clock time defaults to **half the maximum time**, see the table above. Longer wall time limits can be [set manually, see examples][3].
Jobs that exceed the reserved wall clock time (Req'd Time) get killed automatically. The wall clock time limit can be changed for queuing jobs (state Q) using the qalter command, however it cannot be changed for a running job (state R).
Anselm users may check the current queue configuration [here][b].
## Queue Status
!!! tip
Check the status of jobs, queues and compute nodes [here][c].
![rspbs web interface](../img/rsweb.png)
Display the queue status on Anselm:
```console
$ qstat -q
```
The PBS allocation overview may be obtained also using the rspbs command:
```console
$ rspbs
Usage: rspbs [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--get-node-ncpu-chart
Print chart of allocated ncpus per node
--summary Print summary
--get-server-details Print server
--get-queues Print queues
--get-queues-details Print queues details
--get-reservations Print reservations
--get-reservations-details
Print reservations details
--get-nodes Print nodes of PBS complex
--get-nodeset Print nodeset of PBS complex
--get-nodes-details Print nodes details
--get-jobs Print jobs
--get-jobs-details Print jobs details
--get-jobs-check-params
Print jobid, job state, session_id, user, nodes
--get-users Print users of jobs
--get-allocated-nodes
Print allocated nodes of jobs
--get-allocated-nodeset
Print allocated nodeset of jobs
--get-node-users Print node users
--get-node-jobs Print node jobs
--get-node-ncpus Print number of ncpus per node
--get-node-allocated-ncpus
Print number of allocated ncpus per node
--get-node-qlist Print node qlist
--get-node-ibswitch Print node ibswitch
--get-user-nodes Print user nodes
--get-user-nodeset Print user nodeset
--get-user-jobs Print user jobs
--get-user-jobc Print number of jobs per user
--get-user-nodec Print number of allocated nodes per user
--get-user-ncpus Print number of allocated ncpus per user
--get-qlist-nodes Print qlist nodes
--get-qlist-nodeset Print qlist nodeset
--get-ibswitch-nodes Print ibswitch nodes
--get-ibswitch-nodeset
Print ibswitch nodeset
--state=STATE Only for given job state
--jobid=JOBID Only for given job ID
--user=USER Only for given user
--node=NODE Only for given node
--nodestate=NODESTATE
Only for given node state (affects only --get-node*
--get-qlist-* --get-ibswitch-* actions)
--incl-finished Include finished jobs
```
---8<--- "resource_accounting.md"
---8<--- "mathjax.md"
[1]: job-priority.md
[2]: #resources-accounting-policy
[3]: job-submission-and-execution.md
[a]: https://support.it4i.cz/rt/
[b]: https://extranet.it4i.cz/rsweb/anselm/queues
[c]: https://extranet.it4i.cz/rsweb/anselm/
......@@ -4,7 +4,7 @@ There are two main shared file systems on Anselm cluster, the [HOME][1] and [SCR
## Archiving
Please don't use shared filesystems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][3], which is available via SSHFS.
Don't use shared filesystems as a backup for large amount of data or long-term archiving mean. The academic staff and students of research institutions in the Czech Republic can use [CESNET storage service][3], which is available via SSHFS.
## Shared Filesystems
......@@ -333,7 +333,7 @@ The procedure to obtain the CESNET access is quick and trouble-free.
### Understanding CESNET Storage
!!! note
It is very important to understand the CESNET storage before uploading data. [Please read][i] first.
It is very important to understand the CESNET storage before uploading data. [Read][i] first.
Once registered for CESNET Storage, you may [access the storage][j] in number of ways. We recommend the SSHFS and RSYNC methods.
......
# NVIDIA DGX-2
[DGX-2][a] builds upon [DGX-1][b] in several ways. Introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe.
With NVLink2, enables sixteen GPUs to be grouped together in a single system, for a total bandwidth going beyond 14 TB/s. Pair of Xeon CPUs, 1.5 TB of memory, and 30 TB of NVMe storage, and we get a system that consumes 10 kW, weighs 163.29 kg, but offers easily double the performance of the DGX-1.
NVIDIA likes to tout that this means it offers a total of ~2 PFLOPs of compute performance in a single system, when using the tensor cores.
<div align="center">
<iframe src="https://www.youtube.com/embed/OTOGw0BRqK0" width="50%" height="195" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</div>
![](../img/dgx1.png)
| NVIDIA DGX-2 | |
| --- | --- |
| CPUs | 2 x Intel Xeon Platinum |
| GPUs | 16 x NVIDIA Tesla V100 32GB HBM2 |
| System Memory | Up to 1.5 TB DDR4 |
| GPU Memory | 512 GB HBM2 (16 x 32 GB) |
| Storage | 30 TB NVMe, Up to 60 TB |
| Networking | 8 x Infiniband or 8 x 100 GbE |
| Power | 10 kW |
| Size | 350 lbs |
| GPU Throughput | Tensor: 1920 TFLOPs, FP16: 480 TFLOPs, FP32: 240 TFLOPs, FP64: 120 TFLOPs |
![](../img/dgx2.png)
AlexNET, the network that 'started' the latest machine learning revolution, now takes 18 minutes
The topology of the DGX-2 means that all 16 GPUs are able to pool their memory into a unified memory space, though with the usual tradeoffs involved if going off-chip.
Not unlike the Tesla V100 memory capacity increase then, one of NVIDIA’s goals here is to build a system that can keep in-memory workloads that would be too large for an 8 GPU cluster. Providing one such example, NVIDIA is saying that the DGX-2 is able to complete the training process for FAIRSEQ – a neural network model for language translation – 10x faster than a DGX-1 system, bringing it down to less than two days total rather than 15.
![](../img/dgx3.png)
Otherwise, similar to its DGX-1 counterpart, the DGX-2 is designed to be a powerful server in its own right. On the storage side the DGX-2 comes with 30TB of NVMe-based solid state storage. And for clustering or further inter-system communications, it also offers InfiniBand and 100GigE connectivity, up to eight of them.
![](../img/dgx4.png)
The new NVSwitches means that the PCIe lanes of the CPUs can be redirected elsewhere, most notably towards storage and networking connectivity.
[a]: https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/dgx-2/nvidia-dgx-2-datasheet.pdf
[b]: https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/dgx-1/dgx-1-ai-supercomputer-datasheet-v4.pdf
......@@ -42,7 +42,7 @@ Also remember that display number should be less or equal 99.
Based on this **we have choosen display number 61** for us, so this number you can see in examples below.
!!! note
Your situation may be different so also choose of your number may be different. **Please choose and use your own display number accordingly!**
Your situation may be different so also choose of your number may be different. **Choose and use your own display number accordingly!**
Start your VNC server on choosen display number (61):
......@@ -76,7 +76,7 @@ username :102
```
!!! note
The VNC server runs on port 59xx, where xx is the display number. So, you get your port number simply as 5900 + display number, in our example 5900 + 61 = 5961. Another example for display number 102 is calculation of TCP port 5900 + 102 = 6002 but be aware, that TCP ports above 6000 are often used by X11. **Please, calculate your own port number and use it instead of 5961 from examples below!**
The VNC server runs on port 59xx, where xx is the display number. So, you get your port number simply as 5900 + display number, in our example 5900 + 61 = 5961. Another example for display number 102 is calculation of TCP port 5900 + 102 = 6002 but be aware, that TCP ports above 6000 are often used by X11. **Calculate your own port number and use it instead of 5961 from examples below!**
To access the VNC server you have to create a tunnel between the login node using TCP port 5961 and your machine using a free TCP port (for simplicity the very same) in next step. See examples for [Linux/Mac OS][2] and [Windows][3].
......@@ -260,4 +260,4 @@ Example described above:
[1]: x-window-system.md
[2]: #linuxmac-os-example-of-creating-a-tunnel
[3]: #windows-example-of-creating-a-tunnel
[4]: ../../../anselm/job-submission-and-execution.md
[4]: ../../job-submission-and-execution.md
......@@ -93,7 +93,7 @@ local $ ssh-keygen -C 'username@organization.example.com' -f additional_key
```
!!! note
Please, enter **strong** **passphrase** for securing your private key.
Enter **strong** **passphrase** for securing your private key.
You can insert additional public key into authorized_keys file for authentication with your own private key. Additional records in authorized_keys file must be delimited by new line. Users are not advised to remove the default public key from authorized_keys file.
......
......@@ -7,7 +7,7 @@ In many cases, it is useful to submit a huge (>100+) number of computational job
However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
!!! note
Please follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
* Use [Job arrays][1] when running a huge number of [multithread][2] (bound to one node only) or multinode (multithread across several nodes) jobs
* Use [GNU parallel][3] when running single core jobs
......@@ -45,7 +45,7 @@ First, we create a tasklist file (or subjobs list), listing all tasks (subjobs)
$ find . -name 'file*' > tasklist
```
Then we create the jobscript:
Then we create the jobscript for Anselm cluster:
```bash
#!/bin/bash
......@@ -70,6 +70,31 @@ cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
cp output $PBS_O_WORKDIR/$TASK.out
```
Then we create jobscript for Salomon cluster:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=24,walltime=02:00:00
# change to scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory holds the 900 input files, the executable myprog.x, and the jobscript file. As an input for each run, we take the filename of the input file from the created tasklist file. We copy the input file to the local scratch memory /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in a sequential** manner. Due to the allocation of the whole node, the accounted time is equal to the usage of the whole node, while using only 1/16 of the node!
If running a huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed, then a job array approach should be used. The main difference as compared to previous examples using one node is that the local scratch memory should not be used (as it's not shared between nodes) and MPI or other techniques for parallel multinode processing has to be used properly.
......@@ -78,11 +103,20 @@ If running a huge number of parallel multicore (in means of multinode multithrea
To submit the job array, use the qsub -J command. The 900 jobs of the [example above][5] may be submitted like this:
#### Anselm
```console
$ qsub -N JOBNAME -J 1-900 jobscript
12345[].dm2
```
#### Salomon
```console
$ qsub -N JOBNAME -J 1-900 jobscript
506493[].isrv5
```
In this example, we submit a job array of 900 subjobs. Each subjob will run on one full node and is assumed to take less than 2 hours (note the #PBS directives in the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue).
Sometimes for testing purposes, you may need to submit a one-element only array. This is not allowed by PBSPro, but there's a workaround:
......@@ -152,7 +186,7 @@ Read more on job arrays in the [PBSPro Users guide][6].
!!! note
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful when running single core jobs via the queue system on Anselm.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful when running single core jobs via the queue systems.
For more information and examples see the parallel man page:
......@@ -308,7 +342,7 @@ In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**,
Download the examples in [capacity.zip][9], illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs.
Unzip the archive in an empty directory on Anselm and follow the instructions in the README file
Unzip the archive in an empty directory on cluster and follow the instructions in the README file
```console
$ unzip capacity.zip
......
......@@ -56,7 +56,7 @@ Job execution priority (job sort formula) is calculated as:
### Job Backfilling
The Anselm cluster uses job backfilling.
The scheduler uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (the job with the highest execution priority) cannot run.
......@@ -71,5 +71,11 @@ Specifying more accurate walltime enables better scheduling, better execution ti
---8<--- "mathjax.md"
### Job Placement
Job [placement can be controlled by flags during submission][1].
[1]: job-submission-and-execution.md#job_placement
[a]: https://extranet.it4i.cz/rsweb/anselm/queues
[b]: https://extranet.it4i.cz/rsweb/anselm/projects
......@@ -27,6 +27,9 @@ The qsub command submits the job to the queue, i.e. the qsub command creates a r
### Job Submission Examples
!!! note
Anselm ... ncpus=16, Salomon ... ncpus=24
```console
$ qsub -A OPEN-0-0 -q qprod -l select=64:ncpus=16,walltime=03:00:00 ./myjob
```
......@@ -63,6 +66,91 @@ By default, the PBS batch system sends an e-mail only when the job is aborted. D
$ qsub -m n
```
### Salomon - Intel Xeon Phi Co-Processors
To allocate a node with Xeon Phi co-processor, user needs to specify that in select statement. Currently only allocation of whole nodes with both Phi cards as the smallest chunk is supported. Standard PBSPro approach through attributes "accelerator", "naccelerators" and "accelerator_model" is used. The "accelerator_model" can be omitted, since on Salomon only one type of accelerator type/model is available.
The absence of specialized queue for accessing the nodes with cards means, that the Phi cards can be utilized in any queue, including qexp for testing/experiments, qlong for longer jobs, qfree after the project resources have been spent, etc. The Phi cards are thus also available to PRACE users. There's no need to ask for permission to utilize the Phi cards in project proposals.
```console
$ qsub -A OPEN-0-0 -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 ./myjob
```
In this example, we allocate 1 node, with 24 cores, with 2 Xeon Phi 7120p cards, running batch job ./myjob. The default time for qprod is used, e. g. 24 hours.
```console
$ qsub -A OPEN-0-0 -I -q qlong -l select=4:ncpus=24:accelerator=True:naccelerators=2 -l walltime=56:00:00 -I
```
In this example, we allocate 4 nodes, with 24 cores per node (totalling 96 cores), with 2 Xeon Phi 7120p cards per node (totalling 8 Phi cards), running interactive job for 56 hours. The accelerator model name was omitted.
#### Salomon - Intel Xeon Phi - Queue QMIC
Examples executions
```console
-l select=1
exec_vnode = (r21u05n581-mic0:naccelerators=1:ncpus=0)
-l select=4
(r21u05n581-mic0:naccelerators=1:ncpus=0)+(r21u05n581-mic1:naccelerators=1:ncpus=0)+(r21u06n582-mic0:naccelerators=1:ncpus=0)+(r21u06n582-mic1:naccelerators=1:ncpus=0)
-l select=4:naccelerators=1
(r21u05n581-mic0:naccelerators=1:ncpus=0)+(r21u05n581-mic1:naccelerators=1:ncpus=0)+(r21u06n582-mic0:naccelerators=1:ncpus=0)+(r21u06n582-mic1:naccelerators=1:ncpus=0)
-l select=1:naccelerators=2
(r21u05n581-mic0:naccelerators=1+r21u05n581-mic1:naccelerators=1)
-l select=2:naccelerators=2
(r21u05n581-mic0:naccelerators=1+r21u05n581-mic1:naccelerators=1)+(r21u06n582-mic0:naccelerators=1+r21u06n582-mic1:naccelerators=1)
-l select=1:ncpus=24:naccelerators=2
(r22u32n610:ncpus=24+r22u32n610-mic0:naccelerators=1+r22u32n610-mic1:naccelerators=1)
-l select=1:ncpus=24:naccelerators=0+4
(r33u17n878:ncpus=24:naccelerators=0)+(r33u13n874-mic0:naccelerators=1:ncpus=0)+(r33u13n874-mic1:naccelerators=1:ncpus=0)+(r33u16n877-mic0:naccelerators=1:ncpus=0)+(r33u16n877-mic1:naccelerators=1:ncpus=0)
```
### Salomon - UV2000 SMP
!!! note
13 NUMA nodes available on UV2000
Per NUMA node allocation.
Jobs are isolated by cpusets.
The UV2000 (node uv1) offers 3TB of RAM and 104 cores, distributed in 13 NUMA nodes. A NUMA node packs 8 cores and approx. 247GB RAM (with exception, node 11 has only 123GB RAM). In the PBS the UV2000 provides 13 chunks, a chunk per NUMA node (see [Resource allocation policy][1]). The jobs on UV2000 are isolated from each other by cpusets, so that a job by one user may not utilize CPU or memory allocated to a job by other user. Always, full chunks are allocated, a job may only use resources of the NUMA nodes allocated to itself.
```console
$ qsub -A OPEN-0-0 -q qfat -l select=13 ./myjob
```
In this example, we allocate all 13 NUMA nodes (corresponds to 13 chunks), 104 cores of the SGI UV2000 node for 24 hours. Jobscript myjob will be executed on the node uv1.
```console
$ qsub -A OPEN-0-0 -q qfat -l select=1:mem=2000GB ./myjob
```
In this example, we allocate 2000GB of memory on the UV2000 for 24 hours. By requesting 2000GB of memory, memory from 10 chunks and 8 cores are allocated. Jobscript myjob will be executed on the node uv1.
```console
$ qsub -A OPEN-0-0 -q qfat -l select=1:mem=3099GB,walltime=48:00:00 ./myjob
```
In this example, we allocate 3099GB of memory on the UV2000 for 48 hours. By requesting 3099GB of memory, memory from all 13 chunks and 8 cores are allocated. Jobscript myjob will be executed on the node uv1.
```console
$ qsub -A OPEN-0-0 -q qfat -l select=2:mem=1000GB,walltime=48:00:00 ./myjob
```
In this example, we allocate 2000GB of memory and 16 cores on the UV2000 for 48 hours. By requesting 1000GB of memory per chunk, 2000GB of memory and 16 cores are allocated. Jobscript myjob will be executed on the node uv1.
### Useful Tricks
All qsub options may be [saved directly into the jobscript][2]. In such a case, no options to qsub are needed.
```console
$ qsub ./myjob
```
By default, the PBS batch system sends an e-mail only when the job is aborted. Disabling mail events completely can be done like this:
```console
$ qsub -m n
```
## Advanced Job Placement
### Placement by Name
......@@ -73,12 +161,18 @@ Specific nodes may be allocated via the PBS
$ qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=16:host=cn171+1:ncpus=16:host=cn172 -I
```
```console
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=r24u35n680+1:ncpus=24:host=r24u36n681 -I
```
In this example, we allocate nodes cn171 and cn172, all 16 cores per node, for 24 hours. Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively.
### Placement by CPU Type
### Anselm - Placement by CPU Type
Nodes equipped with an Intel Xeon E5-2665 CPU have a base clock frequency of 2.4GHz, nodes equipped with an Intel Xeon E5-2470 CPU have a base frequency of 2.3 GHz (see the section Compute Nodes for details). Nodes may be selected via the PBS resource attribute cpu_freq .
#### Anselm
| CPU Type | base freq. | Nodes | cpu_freq attribute |
| ------------------ | ---------- | ---------------------- | ------------------ |
| Intel Xeon E5-2665 | 2.4GHz | cn[1-180], cn[208-209] | 24 |
......@@ -90,7 +184,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
In this example, we allocate 4 nodes, 16 cores per node, selecting only the nodes with Intel Xeon E5-2665 CPU.
### Placement by IB Switch
### Anselm - Placement by IB Switch
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network][2] fat tree topology. Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and facilitates unbiased, highly efficient network communication.
......@@ -104,6 +198,111 @@ $ qsub -A OPEN-0-0 -q qprod -l select=18:ncpus=16:ibswitch=isw11 ./myjob
In this example, we request all of the 18 nodes sharing the isw11 switch for 24 hours. a full chassis will be allocated.
### Salomon - Placement by Network Location
Network location of allocated nodes in the [InifiBand network][3] influences efficiency of network communication between nodes of job. Nodes on the same InifiBand switch communicate faster with lower latency than distant nodes. To improve communication efficiency of jobs, PBS scheduler on Salomon is configured to allocate nodes - from currently available resources - which are as close as possible in the network topology.
For communication intensive jobs it is possible to set stricter requirement - to require nodes directly connected to the same InifiBand switch or to require nodes located in the same dimension group of the InifiBand network.
### Salomon - Placement by InifiBand Switch
Nodes directly connected to the same InifiBand switch can communicate most efficiently. Using the same switch prevents hops in the network and provides for unbiased, most efficient network communication. There are 9 nodes directly connected to every InifiBand switch.
!!! note
We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently.
Nodes directly connected to the one InifiBand switch can be allocated using node grouping on PBS resource attribute switch.
In this example, we request all 9 nodes directly connected to the same switch using node grouping placement.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
```
### Salomon - Placement by Specific InifiBand Switch
!!! note
Not useful for ordinary computing, suitable for testing and management tasks.
Nodes directly connected to the specific InifiBand switch can be selected using the PBS resource attribute _switch_.
In this example, we request all 9 nodes directly connected to r4i1s0sw1 switch.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob
```
List of all InifiBand switches:
```console
$ qmgr -c 'print node @a' | grep switch | awk '{print $6}' | sort -u
r1i0s0sw0
r1i0s0sw1
r1i1s0sw0
r1i1s0sw1
r1i2s0sw0
...
```
List of all all nodes directly connected to the specific InifiBand switch:
```console
$ qmgr -c 'p n @d' | grep 'switch = r36sw3' | awk '{print $3}' | sort
r36u31n964
r36u32n965
r36u33n966
r36u34n967
r36u35n968
r36u36n969
r37u32n970
r37u33n971
r37u34n972
```
### Salomon - Placement by Hypercube Dimension
Nodes located in the same dimension group may be allocated using node grouping on PBS resource attribute ehc\_[1-7]d .
| Hypercube dimension | node_group_key | #nodes per group |
| ------------------- | -------------- | ---------------- |
| 1D | ehc_1d | 18 |
| 2D | ehc_2d | 36 |
| 3D | ehc_3d | 72 |
| 4D | ehc_4d | 144 |
| 5D | ehc_5d | 144,288 |
| 6D | ehc_6d | 432,576 |
| 7D | ehc_7d | all |
In this example, we allocate 16 nodes in the same [hypercube dimension][4] 1 group.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I
```
For better understanding:
List of all groups in dimension 1:
```console
$ qmgr -c 'p n @d' | grep ehc_1d | awk '{print $6}' | sort |uniq -c
18 r1i0
18 r1i1
18 r1i2
18 r1i3
...
```
List of all all nodes in specific dimension 1 group:
```console
$ $ qmgr -c 'p n @d' | grep 'ehc_1d = r1i0' | awk '{print $3}' | sort
r1i0n0
r1i0n1
r1i0n10
r1i0n11
...
```
## Advanced Job Handling
### Selecting Turbo Boost Off
......@@ -409,9 +608,9 @@ In this example, a directory in /home holds the input file input and executable
Further jobscript examples may be found in the software section and the [Capacity computing][7] section.
[1]: #example-jobscript-for-mpi-calculation-with-preloaded-inputs
[2]: network.md
[3]: hardware-overview.md
[4]: storage.md
[2]: ../anselm/network.md
[3]: ../anselm/hardware-overview.md
[4]: ../anselm/storage.md
[5]: ../software/mpi/running_openmpi.md
[6]: ../software/mpi/running-mpich2.md
[7]: capacity-computing.md
......@@ -40,9 +40,9 @@ Read more on [Capacity computing][6] page.
[1]: #terminology-frequently-used-on-these-pages
[2]: ../pbspro.md
[3]: ../salomon/job-priority.md#fair-share-priority
[4]: ../salomon/resources-allocation-policy.md
[5]: ../salomon/job-submission-and-execution.md
[6]: ../salomon/capacity-computing.md
[3]: job-priority.md#fair-share-priority
[4]: resources-allocation-policy.md
[5]: job-submission-and-execution.md
[6]: capacity-computing.md
[a]: https://extranet.it4i.cz/rsweb/salomon/queues
......@@ -2,10 +2,34 @@
## Job Queue Policies
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. The fair-share at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling][1] section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview:
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and the resources available to the Project. The Fair-share system of Anselm ensures that individual users may consume approximately equal amounts of resources per week. Detailed information can be found in the [Job scheduling][1] section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. The following table provides the queue partitioning overview:
!!! note
Check the queue status [here][a].
Check the queue status [here][z].
### Anselm
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------- | -------------- | -------------------- | ---------------------------------------------------- | --------- | -------- | ------------- | -------- |
| qexp | no | none required | 209 nodes | 1 | 150 | no | 1 h |
| qprod | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 24/48 h |
| qlong | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 72/144 h |
| qnvidia | yes | > 0 | 23 nvidia nodes | 16 | 200 | yes | 24/48 h |
| qfat | yes | > 0 | 2 fat nodes | 16 | 200 | yes | 24/144 h |
| qfree | yes | < 120% of allocation | 180 w/o accelerator | 16 | -1024 | no | 12 h |
!!! note
**The qfree queue is not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of qfree after exhaustion of DD projects' computational resources is allowed after request for this queue.
**The qexp queue is equipped with nodes which do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
* **qexp**, the Express queue: This queue is dedicated to testing and running very small jobs. It is not required to specify a project to enter the qexp. There are always 2 nodes reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user, from a pool of nodes containing Nvidia accelerated nodes (cn181-203), MIC accelerated nodes (cn204-207) and Fat nodes with 512GB of RAM (cn208-209). This enables us to test and tune accelerated code and code with higher RAM requirements. The nodes may be allocated on a per core basis. No special authorization is required to use qexp. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 178 nodes without accelerators are included. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h).
* **qnvidia**, **qmic**, **qfat**, the Dedicated queues: The queue qnvidia is dedicated to accessing the Nvidia accelerated nodes, the qmic to accessing MIC nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic, and 2 fat nodes are included. Full nodes, 16 cores per node, are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with her/his project.
* **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request persmission to use qfree after exhaustion of computational resources). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 180 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
### Salomon
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------------------- | -------------- | -------------------- | ------------------------------------------------------------- | --------- | -------- | ------------- | --------- |
......@@ -35,26 +59,26 @@ The resources are allocated to the job in a fair-share fashion, subject to const
## Queue Notes
The job wall-clock time defaults to **half the maximum time**, see table above. Longer wall time limits can be [set manually, see examples][3].
The job wall clock time defaults to **half the maximum time**, see the table above. Longer wall time limits can be [set manually, see examples][3].
Jobs that exceed the reserved wall-clock time (Req'd Time) get killed automatically. Wall-clock time limit can be changed for queuing jobs (state Q) using the qalter command, however can not be changed for a running job (state R).
Jobs that exceed the reserved wall clock time (Req'd Time) get killed automatically. The wall clock time limit can be changed for queuing jobs (state Q) using the qalter command, however it cannot be changed for a running job (state R).
Salomon users may check current queue configuration [here][b].
Anselm users may check the current queue configuration [here][b].
## Queue Status
!!! note
Check the status of jobs, queues and compute nodes [here][a].
!!! tip
Check the status of jobs, queues and compute nodes [here][c].
![RSWEB Salomon](../img/rswebsalomon.png "RSWEB Salomon")
![rspbs web interface](../img/rsweb.png)
Display the queue status on Salomon: