diff --git a/docs.it4i/anselm-cluster-documentation/resources-allocation-policy.md b/docs.it4i/anselm-cluster-documentation/resources-allocation-policy.md index f84fedbe5d78db31ad662e585efb362ad293b0e3..c8060b67d0e8f9393e2bebc1f890fe88dd271a98 100644 --- a/docs.it4i/anselm-cluster-documentation/resources-allocation-policy.md +++ b/docs.it4i/anselm-cluster-documentation/resources-allocation-policy.md @@ -5,12 +5,12 @@ Resources Allocation Policy --------------------------- The resources are allocated to the job in a fairshare fashion, subject to constraints set by the queue and resources available to the Project. The Fairshare at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling](job-priority/) section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview: - |queue |active project |project resources |nodes|min ncpus*|priority|authorization|>walltime | - | --- | --- | - |<strong>qexp</strong> |no |none required |2 reserved, 31 totalincluding MIC, GPU and FAT nodes |1 |><em>150</em> |no |1h | - |<strong>qprod</strong> |yes |> 0 |><em>178 nodes w/o accelerator</em> |16 |0 |no |24/48h | + |queue |active project |project resources |nodes|min ncpus|priority|authorization|walltime | + | --- | --- | --- | --- | --- | --- | --- | --- | + |<strong>qexp</strong> |no |none required |2 reserved, 31 totalincluding MIC, GPU and FAT nodes |1 |<em>150</em> |no |1h | + |<strong>qprod</strong> |yes |> 0 |<em>178 nodes w/o accelerator</em> |16 |0 |no |24/48h | |<strong>qlong</strong>Long queue |yes |> 0 |60 nodes w/o accelerator |16 |0 |no |72/144h | - |<strong>qnvidia, qmic, qfat</strong>Dedicated queues |yes |<p>> 0 |23 total qnvidia4 total qmic2 total qfat |16 |><em>200</em> |yes |24/48h | + |<strong>qnvidia, qmic, qfat</strong>Dedicated queues |yes |<p>> 0 |23 total qnvidia4 total qmic2 total qfat |16 |<em>200</em> |yes |24/48h | |<strong>qfree</strong> |yes |none required |178 w/o accelerator |16 |-1024 |no |12h | !!! Note "Note" @@ -21,7 +21,7 @@ The resources are allocated to the job in a fairshare fashion, subject to constr - **qexp**, the Express queue: This queue is dedicated for testing and running very small jobs. It is not required to specify a project to enter the qexp. There are 2 nodes always reserved for this queue (w/o accelerator), maximum 8 nodes are available via the qexp for a particular user, from a pool of nodes containing Nvidia accelerated nodes (cn181-203), MIC accelerated nodes (cn204-207) and Fat nodes with 512GB RAM (cn208-209). This enables to test and tune also accelerated code or code with higher RAM requirements. The nodes may be allocated on per core basis. No special authorization is required to use it. The maximum runtime in qexp is 1 hour. - **qprod**, the Production queue: This queue is intended for normal production runs. It is required that active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 178 nodes without accelerator are included. Full nodes, 16 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours. - **qlong**, the Long queue: This queue is intended for long production runs. It is required that active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 16 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times of the standard qprod time - 3 * 48 h). -- **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to access the Nvidia accelerated nodes, the qmic to access MIC nodes and qfat the Fat nodes. It is required that active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic and 2 fat nodes are included. Full nodes, 16 cores per node are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs explicitly ask support for authorization to enter the dedicated queues for all users associated to her/his Project. +- **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to access the Nvidia accelerated nodes, the qmic to access MIC nodes and qfat the Fat nodes. It is required that active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic and 2 fat nodes are included. Full nodes, 16 cores per node are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs explicitly ask [support](https://support.it4i.cz/rt/) for authorization to enter the dedicated queues for all users associated to her/his Project. - **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a Project exhausted all its allocated computational resources (Does not apply to DD projects by default. DD projects have to request for persmission on qfree after exhaustion of computational resources.). It is required that active project is specified to enter the queue, however no remaining resources are required. Consumed resources will be accounted to the Project. Only 178 nodes without accelerator may be accessed from this queue. Full nodes, 16 cores per node are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours. ### Notes @@ -36,7 +36,7 @@ Anselm users may check current queue configuration at <https://extranet.it4i.cz/ >Check the status of jobs, queues and compute nodes at <https://extranet.it4i.cz/anselm/> - + Display the queue status on Anselm: diff --git a/docs.it4i/salomon/hardware-overview.md b/docs.it4i/salomon/hardware-overview.md index 3ab8ea22f052e450965c8560416e5de865793f4e..a16890dfe6c27587d826ed7027fcce5946ea25c9 100644 --- a/docs.it4i/salomon/hardware-overview.md +++ b/docs.it4i/salomon/hardware-overview.md @@ -35,7 +35,7 @@ Compute nodes ------------- |Node|Count|Processor|Cores|Memory|Accelerator| -|---|---| +|---|---|---|---|---|---| |w/o accelerator|576|2x Intel Xeon E5-2680v3, 2.5GHz|24|128GB|-| |MIC accelerated|432|2x Intel Xeon E5-2680v3, 2.5GHz|24|128GB|2x Intel Xeon Phi 7120P, 61cores, 16GB RAM| @@ -46,7 +46,7 @@ Remote visualization nodes For remote visualization two nodes with NICE DCV software are available each configured: |Node|Count|Processor|Cores|Memory|GPU Accelerator| -|---|---| +|---|---|---|---|---|---| |visualization|2|2x Intel Xeon E5-2695v3, 2.3GHz|28|512GB|NVIDIA QUADRO K5000, 4GB RAM| SGI UV 2000 @@ -54,7 +54,7 @@ SGI UV 2000 For large memory computations a special SMP/NUMA SGI UV 2000 server is available: |Node |Count |Processor |Cores|Memory|Extra HW | -| --- | --- | -|UV2000 |1 |14x Intel Xeon E5-4627v2, 3.3GHz, 8cores |112 |3328GB DDR3@1866MHz |2x 400GB local SSD1x NVIDIA GM200(GeForce GTX TITAN X),12GB RAM\ | +| --- | --- | --- | --- | --- | --- | +|UV2000 |1 |14x Intel Xeon E5-4627v2, 3.3GHz, 8cores |112 |3328GB DDR3@1866MHz |2x 400GB local SSD1x NVIDIA GM200(GeForce GTX TITAN X),12GB RAM |  diff --git a/docs.it4i/salomon/job-submission-and-execution.md b/docs.it4i/salomon/job-submission-and-execution.md index 9738861315dadced3de4f71a1901c3765ee67bbd..3f81a3728bd80d287fcbb8e425fa8ab1d09e9b14 100644 --- a/docs.it4i/salomon/job-submission-and-execution.md +++ b/docs.it4i/salomon/job-submission-and-execution.md @@ -93,7 +93,7 @@ In this example, we allocate 2000GB of memory on the UV2000 for 72 hours. By req ### Useful tricks -All qsub options may be [saved directly into the jobscript](job-submission-and-execution/#PBSsaved). In such a case, no options to qsub are needed. +All qsub options may be [saved directly into the jobscript](#example-jobscript-for-mpi-calculation-with-preloaded-inputs). In such a case, no options to qsub are needed. ```bash $ qsub ./myjob @@ -110,13 +110,16 @@ Advanced job placement ### Placement by name -Specific nodes may be allocated via the PBS +!!! Note "Note" + Not useful for ordinary computing, suitable for node testing/bechmarking and management tasks. + +Specific nodes may be selected using PBS resource attribute host (for hostnames): ```bash qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=r24u35n680+1:ncpus=24:host=r24u36n681 -I ``` -Or using short names +Specific nodes may be selected using PBS resource attribute cname (for short names in cns[0-1]+ format): ```bash qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns681 -I @@ -124,74 +127,111 @@ qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=24:host=cns680+1:ncpus=24:host=cns68 In this example, we allocate nodes r24u35n680 and r24u36n681, all 24 cores per node, for 24 hours. Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively. -### Placement by |Hypercube|dimension| +### Placement by network location + +Network location of allocated nodes in the [Infiniband network](network/) influences efficiency of network communication between nodes of job. Nodes on the same Infiniband switch communicate faster with lower latency than distant nodes. To improve communication efficiency of jobs, PBS scheduler on Salomon is configured to allocate nodes - from currently available resources - which are as close as possible in the network topology. + +For communication intensive jobs it is possible to set stricter requirement - to require nodes directly connected to the same Infiniband switch or to require nodes located in the same dimension group of the Infiniband network. -Nodes may be selected via the PBS resource attribute ehc_[1-7]d . -|Hypercube|dimension| -|---|---| -|1D|ehc_1d| -|2D|ehc_2d| -|3D|ehc_3d| -|4D|ehc_4d| -|5D|ehc_5d| -|6D|ehc_6d| -|7D|ehc_7d| +### Placement by Infiniband switch + +Nodes directly connected to the same Infiniband switch can communicate most efficiently. Using the same switch prevents hops in the network and provides for unbiased, most efficient network communication. There are 9 nodes directly connected to every Infiniband switch. + +!!! Note "Note" + We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently. + +Nodes directly connected to the one Infiniband switch can be allocated using node grouping on PBS resource attribute switch. + +In this example, we request all 9 nodes directly connected to the same switch using node grouping placement. ```bash -$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=24 -l place=group=ehc_1d -I +$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob ``` -In this example, we allocate 4 nodes, 24 cores, selecting only the nodes with [hypercube dimension](../network/7d-enhanced-hypercube/) 1. +### Placement by specific Infiniband switch + +!!! Note "Note" + Not useful for ordinary computing, suitable for testing and management tasks. -### Placement by IB switch -Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network](../network/) . Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and provides for unbiased, most efficient network communication. +Nodes directly connected to the specific Infiniband switch can be selected using the PBS resource attribute *switch*. -There are at most 9 nodes sharing the same Infiniband switch. +In this example, we request all 9 nodes directly connected to r4i1s0sw1 switch. -Infiniband switch list: ```bash -$ qmgr -c "print node @a" | grep switch -set node r4i1n11 resources_available.switch = r4i1s0sw1 -set node r2i0n0 resources_available.switch = r2i0s0sw1 -set node r2i0n1 resources_available.switch = r2i0s0sw1 -... +$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob ``` -List of all nodes per Infiniband switch: +List of all Infiniband switches: ```bash -$ qmgr -c "print node @a" | grep r36sw3 -set node r36u31n964 resources_available.switch = r36sw3 -set node r36u32n965 resources_available.switch = r36sw3 -set node r36u33n966 resources_available.switch = r36sw3 -set node r36u34n967 resources_available.switch = r36sw3 -set node r36u35n968 resources_available.switch = r36sw3 -set node r36u36n969 resources_available.switch = r36sw3 -set node r37u32n970 resources_available.switch = r36sw3 -set node r37u33n971 resources_available.switch = r36sw3 -set node r37u34n972 resources_available.switch = r36sw3 +$ qmgr -c 'print node @a' | grep switch | awk '{print $6}' | sort -u +r1i0s0sw0 +r1i0s0sw1 +r1i1s0sw0 +r1i1s0sw1 +r1i2s0sw0 +... +... +``` + +List of all all nodes directly connected to the specific Infiniband switch: +```bash +$ qmgr -c 'p n @d' | grep 'switch = r36sw3' | awk '{print $3}' | sort +r36u31n964 +r36u32n965 +r36u33n966 +r36u34n967 +r36u35n968 +r36u36n969 +r37u32n970 +r37u33n971 +r37u34n972 ``` -Nodes sharing the same switch may be selected via the PBS resource attribute switch. +### Placement by Hypercube dimension -We recommend allocating compute nodes of a single switch when best possible computational network performance is required to run the job efficiently: +Nodes located in the same dimension group may be allocated using node grouping on PBS resource attribute ehc_[1-7]d . +|Hypercube dimension|node_group_key|#nodes per group| +|---|---|---| +|1D|ehc_1d|18| +|2D|ehc_2d|36| +|3D|ehc_3d|72| +|4D|ehc_4d|144| +|5D|ehc_5d|144,288| +|6D|ehc_6d|432,576| +|7D|ehc_7d|all| + +In this example, we allocate 16 nodes in the same [hypercube dimension](7d-enhanced-hypercube/) 1 group. ```bash -$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24:switch=r4i1s0sw1 ./myjob +$ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I ``` -In this example, we request all the 9 nodes sharing the r4i1s0sw1 switch for 24 hours. +For better understanding: + +List of all groups in dimension 1: ```bash -$ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob +$ qmgr -c 'p n @d' | grep ehc_1d | awk '{print $6}' | sort |uniq -c + 18 r1i0 + 18 r1i1 + 18 r1i2 + 18 r1i3 +... ``` -In this example, we request 9 nodes placed on the same switch using node grouping placement for 24 hours. - -HTML commented section #1 (turbo boost is to be implemented) +List of all all nodes in specific dimension 1 group: +```bash +$ $ qmgr -c 'p n @d' | grep 'ehc_1d = r1i0' | awk '{print $3}' | sort +r1i0n0 +r1i0n1 +r1i0n10 +r1i0n11 +... +``` Job Management -------------- @@ -409,6 +449,8 @@ In some cases, it may be impractical to copy the inputs to scratch and outputs t !!! Note "Note" Store the qsub options within the jobscript. Use **mpiprocs** and **ompthreads** qsub options to control the MPI job execution. +### Example Jobscript for MPI Calculation with preloaded inputs + Example jobscript for an MPI job with preloaded inputs and executables, options for qsub are stored within the script : ```bash diff --git a/docs.it4i/salomon/prace.md b/docs.it4i/salomon/prace.md index f3b4d4f566b4a06303b7bb944a852404bd2393ff..68da2f4ada6f1bb2b6eb39d7de088f76429b108e 100644 --- a/docs.it4i/salomon/prace.md +++ b/docs.it4i/salomon/prace.md @@ -55,7 +55,7 @@ To access Salomon cluster, two login nodes running GSI SSH service are available It is recommended to use the single DNS name salomon-prace.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are: |Login address|Port|Protocol|Login node| -|---|---| +|---|---|---|---| |salomon-prace.it4i.cz|2222|gsissh|login1, login2, login3 or login4| |login1-prace.salomon.it4i.cz|2222|gsissh|login1| |login2-prace.salomon.it4i.cz|2222|gsissh|login2| @@ -77,7 +77,7 @@ When logging from other PRACE system, the prace_service script can be used: It is recommended to use the single DNS name salomon.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are: |Login address|Port|Protocol|Login node| -|---|---| +|---|---|---|---| |salomon.it4i.cz|2222|gsissh|login1, login2, login3 or login4| |login1.salomon.it4i.cz|2222|gsissh|login1| |login2-prace.salomon.it4i.cz|2222|gsissh|login2| @@ -132,7 +132,7 @@ There's one control server and three backend servers for striping and/or backup **Access from PRACE network:** |Login address|Port|Node role| -|---|---| +|---|---|---| |gridftp-prace.salomon.it4i.cz|2812|Front end /control server| |lgw1-prace.salomon.it4i.cz|2813|Backend / data mover server| |lgw2-prace.salomon.it4i.cz|2813|Backend / data mover server| @@ -165,7 +165,7 @@ Or by using prace_service script: **Access from public Internet:** |Login address|Port|Node role| -|---|---| +|---|---|---|---| |gridftp.salomon.it4i.cz|2812|Front end /control server| |lgw1.salomon.it4i.cz|2813|Backend / data mover server| |lgw2.salomon.it4i.cz|2813|Backend / data mover server| @@ -198,11 +198,11 @@ Or by using prace_service script: Generally both shared file systems are available through GridFTP: |File system mount point|Filesystem|Comment| -|---|---| +|---|---|---| |/home|Lustre|Default HOME directories of users in format /home/prace/login/| |/scratch|Lustre|Shared SCRATCH mounted on the whole cluster| -More information about the shared file systems is available [here](storage/storage/). +More information about the shared file systems is available [here](storage/). Please note, that for PRACE users a "prace" directory is used also on the SCRATCH file system. @@ -234,7 +234,7 @@ General information about the resource allocation, job queuing and job execution For PRACE users, the default production run queue is "qprace". PRACE users can also use two other queues "qexp" and "qfree". |queue|Active project|Project resources|Nodes|priority|authorization|walltime | - |---|---| + |---|---|---|---|---|---|---| |**qexp** Express queue|no|none required|32 nodes, max 8 per user|150|no|1 / 1h| |**qprace** Production queue|yes|>0|1006 nodes, max 86 per job|0|no|24 / 48h| |**qfree** Free resource queue|yes|none required|752 nodes, max 86 per job|-1024|no|12 / 12h| diff --git a/docs.it4i/salomon/resources-allocation-policy.md b/docs.it4i/salomon/resources-allocation-policy.md index be6679d155400b9c5be9696aa70874ee9d05f7ac..fa7f188a8a773b278748b9290cba550fde9a7c2b 100644 --- a/docs.it4i/salomon/resources-allocation-policy.md +++ b/docs.it4i/salomon/resources-allocation-policy.md @@ -5,10 +5,10 @@ Resources Allocation Policy --------------------------- The resources are allocated to the job in a fairshare fashion, subject to constraints set by the queue and resources available to the Project. The Fairshare at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling](job-priority/) section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview: - |queue |active project |project resources |nodes|min ncpus*|priority|authorization|walltime | - | --- | --- | - |**qexe** Express queue|no |none required |32 nodes, max 8 per user |24 |>150 |no |1 / 1h | - |**qprod** Production queue|yes |> 0 |>1006 nodes, max 86 per job |24 |0 |no |24 / 48h | + |queue |active project |project resources |nodes|min ncpus |priority|authorization|walltime | + | --- | --- |--- |--- |--- |--- |--- |--- | + |**qexe** Express queue|no |none required |32 nodes, max 8 per user |24 |150 |no |1 / 1h | + |**qprod** Production queue|yes |> 0 |1006 nodes, max 86 per job |24 |0 |no |24 / 48h | |**qlong** Long queue |yes |> 0 |256 nodes, max 40 per job, only non-accelerated nodes allowed |24 |0 |no |72 / 144h | |**qmpp** Massive parallel queue |yes |> 0 |1006 nodes |24 |0 |yes |2 / 4h | |**qfat** UV2000 queue |yes |> 0 |1 (uv1) |8 |0 |yes |24 / 48h | @@ -42,7 +42,7 @@ Salomon users may check current queue configuration at <https://extranet.it4i.cz !!! Note "Note" Check the status of jobs, queues and compute nodes at [https://extranet.it4i.cz/rsweb/salomon/](https://extranet.it4i.cz/rsweb/salomon) - + Display the queue status on Salomon: @@ -118,7 +118,8 @@ The resources that are currently subject to accounting are the core-hours. The c ### Check consumed resources -The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> +!!! Note "Note" + The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> User may check at any time, how many core-hours have been consumed by himself/herself and his/her projects. The command is available on clusters' login nodes.