Commit c8114c65 authored by Pavel Jirásek's avatar Pavel Jirásek
Browse files

Changes in Salomon Job submission and execution after LK migration

parent 79dd7b8c
Pipeline #1624 passed with stages
in 57 seconds
......@@ -110,13 +110,16 @@ Advanced job placement
### Placement by name
Specific nodes may be allocated via the PBS
!!! Note "Note"
Not useful for ordinary computing, suitable for node testing/bechmarking and management tasks.
Specific nodes may be selected using PBS resource attribute host (for hostnames):
```bash
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=r24u35n680+1:ncpus=24:host=r24u36n681 -I
```
Or using short names
Specific nodes may be selected using PBS resource attribute cname (for short names in cns[0-1]+ format):
```bash
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns681 -I
......@@ -124,74 +127,111 @@ qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns68
In this example, we allocate nodes r24u35n680 and r24u36n681, all 24 cores per node, for 24 hours.  Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively.
### Placement by |Hypercube|dimension|
### Placement by network location
Network location of allocated nodes in the [Infiniband network](network/) influences efficiency of network communication between nodes of job. Nodes on the same Infiniband switch communicate faster with lower latency than distant nodes. To improve communication efficiency of jobs, PBS scheduler on Salomon is configured to allocate nodes - from currently available resources - which are as close as possible in the network topology.
For communication intensive jobs it is possible to set stricter requirement - to require nodes directly connected to the same Infiniband switch or to require nodes located in the same dimension group of the Infiniband network.
### Placement by Infiniband switch
Nodes directly connected to the same Infiniband switch can communicate most efficiently. Using the same switch prevents hops in the network and provides for unbiased, most efficient network communication. There are 9 nodes directly connected to every Infiniband switch.
Nodes may be selected via the PBS resource attribute ehc_[1-7]d .
!!! Note "Note"
We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently.
Nodes directly connected to the one Infiniband switch can be allocated using node grouping on PBS resource attribute switch.
|Hypercube|dimension|
|---|---|
|1D|ehc_1d|
|2D|ehc_2d|
|3D|ehc_3d|
|4D|ehc_4d|
|5D|ehc_5d|
|6D|ehc_6d|
|7D|ehc_7d|
In this example, we request all 9 nodes directly connected to the same switch using node grouping placement.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=24 -l place=group=ehc_1d -I
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
```
In this example, we allocate 4 nodes, 24 cores, selecting only the nodes with [hypercube dimension](../network/7d-enhanced-hypercube/) 1.
### Placement by specific Infiniband switch
### Placement by IB switch
!!! Note "Note"
Not useful for ordinary computing, suitable for testing and management tasks.
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband  network](../network/) . Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and provides for unbiased, most efficient network communication.
There are at most 9 nodes sharing the same Infiniband switch.
Nodes directly connected to the specific Infiniband switch can be selected using the PBS resource attribute switch.
In this example, we request all 9 nodes directly connected to r4i1s0sw1 switch.
Infiniband switch list:
```bash
$ qmgr -c "print node @a" | grep switch
set node r4i1n11 resources_available.switch = r4i1s0sw1
set node r2i0n0 resources_available.switch = r2i0s0sw1
set node r2i0n1 resources_available.switch = r2i0s0sw1
...
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob
```
List of all nodes per Infiniband switch:
List of all Infiniband switches:
```bash
$ qmgr -c "print node @a" | grep r36sw3
set node r36u31n964 resources_available.switch = r36sw3
set node r36u32n965 resources_available.switch = r36sw3
set node r36u33n966 resources_available.switch = r36sw3
set node r36u34n967 resources_available.switch = r36sw3
set node r36u35n968 resources_available.switch = r36sw3
set node r36u36n969 resources_available.switch = r36sw3
set node r37u32n970 resources_available.switch = r36sw3
set node r37u33n971 resources_available.switch = r36sw3
set node r37u34n972 resources_available.switch = r36sw3
$ qmgr -c 'print node @a' | grep switch | awk '{print $6}' | sort -u
r1i0s0sw0
r1i0s0sw1
r1i1s0sw0
r1i1s0sw1
r1i2s0sw0
...
...
```
Nodes sharing the same switch may be selected via the PBS resource attribute switch.
List of all all nodes directly connected to the specific Infiniband switch:
```bash
$ qmgr -c 'p n @d' | grep 'switch = r36sw3' | awk '{print $3}' | sort
r36u31n964
r36u32n965
r36u33n966
r36u34n967
r36u35n968
r36u36n969
r37u32n970
r37u33n971
r37u34n972
```
We recommend allocating compute nodes of a single switch when best possible computational network performance is required to run the job efficiently:
### Placement by Hypercube dimension
Nodes located in the same dimension group may be allocated using node grouping on PBS resource attribute ehc_[1-7]d .
|Hypercube dimension|node_group_key|#nodes per group|
|---|---|---|
|1D|ehc_1d|18|
|2D|ehc_2d|36|
|3D|ehc_3d|72|
|4D|ehc_4d|144|
|5D|ehc_5d|144,288|
|6D|ehc_6d|432,576|
|7D|ehc_7d|all|
In this example, we allocate 16 nodes in the same [hypercube dimension](7d-enhanced-hypercube/) 1 group.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob
$ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I
```
In this example, we request all the 9 nodes sharing the r4i1s0sw1 switch for 24 hours.
For better understanding:
List of all groups in dimension 1:
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
$ qmgr -c 'p n @d' | grep ehc_1d | awk '{print $6}' | sort |uniq -c
18 r1i0
18 r1i1
18 r1i2
18 r1i3
...
```
In this example, we request 9 nodes placed on the same switch using node grouping placement for 24 hours.
HTML commented section #1 (turbo boost is to be implemented)
List of all all nodes in specific dimension 1 group:
```bash
$ $ qmgr -c 'p n @d' | grep 'ehc_1d = r1i0' | awk '{print $3}' | sort
r1i0n0
r1i0n1
r1i0n10
r1i0n11
...
```
Job Management
--------------
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment