From 13ca779c24eff4541ad0147d29b55c782048d165 Mon Sep 17 00:00:00 2001 From: Jan Siwiec <jan.siwiec@vsb.cz> Date: Tue, 20 Sep 2022 07:59:30 +0200 Subject: [PATCH] Qfree update --- .spelling | 4 + .../resource_allocation_and_job_execution.md | 20 +- .../general/resources-allocation-policy.md | 236 +++++++----------- docs.it4i/general/vnode-allocation.md | 147 +++++++++++ mkdocs.yml | 1 + 5 files changed, 251 insertions(+), 157 deletions(-) create mode 100644 docs.it4i/general/vnode-allocation.md diff --git a/.spelling b/.spelling index d33f6426b..276d1e590 100644 --- a/.spelling +++ b/.spelling @@ -21,6 +21,8 @@ Anselm IT4I IT4Innovations PBS +vnode +vnodes Salomon TurboVNC VNC @@ -812,3 +814,5 @@ PROJECT e-INFRA e-INFRA CZ DICE +qgpu +qcpu diff --git a/docs.it4i/general/resource_allocation_and_job_execution.md b/docs.it4i/general/resource_allocation_and_job_execution.md index c3c8d4b1f..854ead65e 100644 --- a/docs.it4i/general/resource_allocation_and_job_execution.md +++ b/docs.it4i/general/resource_allocation_and_job_execution.md @@ -4,16 +4,7 @@ To run a [job][1], computational resources for this particular job must be alloc ## Resources Allocation Policy -Resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. [The Fair-share][3] ensures that individual users may consume approximately equal amount of resources per week. The resources are accessible via queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following queues are the most important: - -* **qexp** - Express queue -* **qprod** - Production queue -* **qlong** - Long queue -* **qmpp** - Massively parallel queue -* **qnvidia**, **qfat** - Dedicated queues -* **qcpu_biz**, **qgpu_biz** - Queues for commercial users -* **qcpu_eurohpc**, **qgpu_eurohpc** - Queues for EuroHPC users -* **qfree** - Free resource utilization queue +Resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. [The Fair-share][3] ensures that individual users may consume approximately equal amount of resources per week. The resources are accessible via queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. !!! note See the queue status for [Karolina][a] or [Barbora][c]. @@ -38,7 +29,13 @@ Use GNU Parallel and/or Job arrays when running (many) single core jobs. In many cases, it is useful to submit a huge (100+) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute parallel calculations, achieving best runtime, throughput and computer utilization. In this chapter, we discuss the recommended way to run huge numbers of jobs, including **ways to run huge numbers of single core jobs**. -Read more on [Capacity Computing][6] page. +Read more on the [Capacity Computing][6] page. + +## Vnode Allocation + +The `qgpu` queue on Karolina takes advantage of the division of nodes into vnodes. Accelerated node equipped with two 64-core processors and eight GPU cards is treated as eight vnodes, each containing 16 CPU cores and 1 GPU card. Vnodes can be allocated to jobs individually –â through precise definition of resource list at job submission, you may allocate varying number of resources/GPU cards according to your needs. + +Red more on the [Vnode Allocation][7] page. [1]: ../index.md#terminology-frequently-used-on-these-pages [2]: ../pbspro.md @@ -46,6 +43,7 @@ Read more on [Capacity Computing][6] page. [4]: resources-allocation-policy.md [5]: job-submission-and-execution.md [6]: capacity-computing.md +[7]: vnode-allocation.md [a]: https://extranet.it4i.cz/rsweb/karolina/queues [b]: https://www.altair.com/pbs-works/ diff --git a/docs.it4i/general/resources-allocation-policy.md b/docs.it4i/general/resources-allocation-policy.md index 1cc4af876..4b20841ec 100644 --- a/docs.it4i/general/resources-allocation-policy.md +++ b/docs.it4i/general/resources-allocation-policy.md @@ -2,74 +2,100 @@ ## Job Queue Policies -Resources are allocated to jobs in a fair-share fashion, subject to constraints set by the queue and the resources available to the project. The fair-share system ensures that individual users may consume approximately equal amounts of resources per week. Detailed information can be found in the [Job scheduling][1] section. Resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. The following table provides the queue partitioning overview: - -!!! note "New queues" - As a part of a larger, gradual update of the queues, we are introducing new queues:<br><br> - **qcpu_preempt**, **qgpu_preempt** - Free queues with the lowest priority (LP). The queues require a project with allocation of the respective resource type. There is no limit on resource overdraft. Jobs are killed if other jobs with a higher priority (HP) request the nodes and there are no other nodes available. LP jobs are automatically requeued once HP jobs finish, so make sure your jobs are rerunnable<br><br> - **qcpu_free**, **qgpu_free** - limit increased from 120% to 150% of project's resources allocation, max walltime 18h, resources load reduced from 90% to 65%. - -!!! important - **The qfree, qcpu_free and qgpu_free queues are not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of the queues after exhaustion of DD projects' computational resources is allowed upon request. - -!!! note - The qexp queue is configured to run one job and accept five jobs in a queue per user. +Resources are allocated to jobs in a fair-share fashion, +subject to constraints set by the queue and the resources available to the project. +The fair-share system ensures that individual users may consume approximately equal amounts of resources per week. +Detailed information can be found in the [Job scheduling][1] section. + +Resources are accessible via several queues for queueing the jobs. +Queues provide prioritized and exclusive access to the computational resources. + +!!! important "Queues update" + We are introducing updated queues. + These have the same parameters as the legacy queues but are divided based on resource type (`qcpu_` for non-accelerated nodes and `qgpu_` for accelerated nodes).<br><br> + Note that on the Karolina's `qgpu` queue, **you can now allocate 1/8 of the node - 1 GPU and 16 cores**. For more information, see [Allocation of vnodes on qgpu][4].<br><br> + We have also added completely new queues `qcpu_preempt` and `qgpu_preempt`. For more information, see the table below. + +### New Queues + +| <div style="width:86px">Queue</div>| Description | +| -------------------------------- | ----------- | +| `qcpu` | Production queue for non-accelerated nodes intended for standard production runs. Requires an active project with nonzero remaining resources. Full nodes are allocated. Identical to `qprod`. | +| `qgpu` | Dedicated queue for accessing the NVIDIA accelerated nodes. Requires an active project with nonzero remaining resources. It utilizes 8x NVIDIA A100 with 320GB HBM2 memory per node. The PI needs to explicitly ask support for authorization to enter the queue for all users associated with their project. **On Karolina, you can allocate 1/8 of the node - 1 GPU and 16 cores**. For more information, see [Allocation of vnodes on qgpu][4]. | +| `qcpu_biz`<br>`qgpu_biz` | Commercial queues, slightly higher priority. | +| `qcpu_eurohpc`<br>`qgpu_eurohpc` | EuroHPC queues, slightly higher priority, **Karolina only**. | +| `qcpu_exp`<br>`qgpu_exp` | Express queues for testing and running very small jobs. Doesn't require a project. There are 2 nodes always reserved (w/o accelerators), max 8 nodes available per user. The nodes may be allocated on a per core basis. It is configured to run one job and accept five jobs in a queue per user. | +| `qcpu_free`<br>`qgpu_free` | Intended for utilization of free resources, after a project exhausted all its allocated resources. Note that the queue is **not free of charge**. [Normal accounting][2] applies. (Does not apply to DD projects by default. DD projects have to request for permission after exhaustion of computational resources.). Consumed resources will be accounted to the Project. Access to the queue is removed if consumed resources exceed 150% of the allocation. Full nodes are allocated. | +| `qcpu_long`<br>`qgpu_long` | Queues for long production runs. Require an active project with nonzero remaining resources. Only 200 nodes without acceleration may be accessed. Full nodes are allocated. | +| `qcpu_preempt`<br>`qgpu_preempt` | Free queues with the lowest priority (LP). The queues require a project with allocation of the respective resource type. There is no limit on resource overdraft. Jobs are killed if other jobs with a higher priority (HP) request the nodes and there are no other nodes available. LP jobs are automatically re-queued once HP jobs finish, so **make sure your jobs are re-runnable**. | +| `qdgx` | Queue for DGX-2, accessible from Barbora. | +| `qfat` | Queue for fat node, PI must request authorization to enter the queue for all users associated to their project. | +| `qviz` | Visualization queue Intended for pre-/post-processing using OpenGL accelerated graphics. Each user gets 8 cores of a CPU allocated (approx. 64 GB of RAM and 1/8 of the GPU capacity (default "chunk")). If more GPU power or RAM is required, it is recommended to allocate more chunks (with 8 cores each) up to one whole node per user. This is currently also the maximum allowed allocation per one user. One hour of work is allocated by default, the user may ask for 2 hours maximum. | + +### Legacy Queues + +Legacy queues stay in production until the end of 2022. + +| Legacy queue | Replaced by | +| ------------ | ------------------------- | +| `qexp` | `qcpu_exp` & `qgpu_exp` | +| `qprod` | `qcpu` | +| `qlong` | `qcpu_long` & `qgpu_long` | +| `nvidia` | `qgpu` Note that unlike in new queues, only full nodes can be allocated. | +| `qfree` | `qcpu_free` & `qgpu_free` | + +The following table provides the queue partitioning per cluster overview: ### Karolina -| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime (default/max) | +| Queue | Active project | Project resources | Nodes | Min ncpus | Priority | Authorization | Walltime (default/max) | | ---------------- | -------------- | -------------------- | ------------------------------------------------------------- | --------- | -------- | ------------- | ----------------------- | -| **qexp** | no | none required | 32 nodes<br>max 2 nodes per job | 128 | 150 | no | 1 / 1h | -| **qprod** | yes | > 0 | 754 nodes | 128 | 0 | no | 24 / 48h | -| **qlong** | yes | > 0 | 200 nodes, max 20 nodes per job, only non-accelerated nodes allowed | 128 | 0 | no | 72 / 144h | -| **qnvidia** | yes | > 0 | 72 nodes | 128 | 0 | yes | 24 / 48h | -| **qfat** | yes | > 0 | 1 (sdf1) | 24 | 200 | yes | 24 / 48h | -| **qcpu_biz** | yes | > 0 | 754 nodes | 128 | 50 | no | 24 / 48h | -| **qgpu_biz** | yes | > 0 | 72 nodes | 128 | 50 | yes | 24 / 48h | -| **qcpu_eurohpc** | yes | > 0 | 754 nodes | 128 | 50 | no | 24 / 48h | -| **qgpu_eurohpc** | yes | > 0 | 72 nodes | 128 | 50 | yes | 24 / 48h | -| **qcpu_preempt** | yes | > 0 | 491 nodes<br>max 4 nodes per job | 128 | -200 | no | 12 / 12h | +| **qcpu** | yes | > 0 | 756 nodes | 128 | 0 | no | 24 / 48h | +| **qcpu_biz** | yes | > 0 | 756 nodes | 128 | 50 | no | 24 / 48h | +| **qcpu_eurohpc** | yes | > 0 | 756 nodes | 128 | 50 | no | 24 / 48h | +| **qcpu_exp** | yes | none required | 756 nodes<br>max 2 nodes per user | 128 | 150 | no | 1 / 1h | +| **qcpu_free** | yes | < 150% of allocation | 756 nodes<br>max 4 nodes per job | 128 | -100 | no | 12 / 12h | +| **qcpu_long** | yes | > 0 | 200 nodes<br>max 20 nodes per job, only non-accelerated nodes allowed | 128 | 0 | no | 72 / 144h | +| **qcpu_preempt** | yes | > 0 | 756 nodes<br>max 4 nodes per job | 128 | -200 | no | 12 / 12h | +| **qgpu** | yes | > 0 | 72 nodes | 16 cpus<br>1 gpu | 0 | yes | 24 / 48h | +| **qgpu_biz** | yes | > 0 | 70 nodes | 128 | 50 | yes | 24 / 48h | +| **qgpu_eurohpc** | yes | > 0 | 70 nodes | 128 | 50 | yes | 24 / 48h | +| **qgpu_exp** | yes | none required | 4 nodes<br>max 1 node per job | 16 cpus<br>1 gpu | 0 | no | 1 / 1h | +| **qgpu_free** | yes | < 150% of allocation | 46 nodes<br>max 2 nodes per job | 16 cpus<br>1 gpu|-100| no | 12 / 12h | | **qgpu_preempt** | yes | > 0 | 72 nodes<br>max 2 nodes per job | 16 cpus<br>1 gpu|-200| no | 12 / 12h | -| **qcpu_free** | yes | < 150% of allocation | 491 nodes<br>max 4 nodes per job | 128 | -100 | no | 12 / 18h | -| **qgpu_free** | yes | < 150% of allocation | 46 nodes<br>max 2 nodes per job | 16 cpus<br>1 gpu|-100| no | 12 / 18h | -| **qfree** | yes | < 150% of allocation | 491 nodes, max 4 nodes per job | 128 | -1024 | no | 12 / 12h | -| **qviz** | yes | none required | 2 nodes (with NVIDIA® Quadro RTX™ 6000) | 8 | 150 | no | 1 / 8h | - -* **qexp** Express queue: This queue is dedicated for testing and running very small jobs. It is not required to specify a project to enter the qexp. There are 2 nodes always reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user. The nodes may be allocated on a per core basis. No special authorization is required to use the queue. Maximum runtime is 1 hour. -* **qprod** Production queue: This queue is intended for normal production runs. It is required that active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue. Full nodes, 128 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. Maximum runtime is 48 hours. -* **qlong** Long queue: This queue is intended for long production runs. It is required that active project with nonzero remaining resources is specified to enter the qlong. Only 200 nodes without acceleration may be accessed via the qlong queue. Full nodes, 128 cores per node are allocated. The queue runs with medium priority and no special authorization is required to use it. Maximum runtime is 144 hours (3 \* qprod time) -* **qnvidia** Dedicated queue: This queue is dedicated to accessing the NVIDIA accelerated nodes. It is required that an active project with nonzero remaining resources is specified to enter this queue. It utilizes 8x NVIDIA A100 with 320GB HBM2 memory per node. Full nodes, 128 cores and 8 GPUs per node, are allocated. The PI needs to explicitly ask [support][a] for authorization to enter the queue for all users associated with their project. -* **qfat** HPE Superdome Flex queue. This queue is dedicated to access the fat HPE Superdome Flex machine. The machine (sdf1) has 768 Intel® Xeon® Platinum cores at 2.9GHz and 24TB RAM. The PI needs to explicitly ask support for authorization to enter the queue for all users associated to their Project. -* **qcpu_biz**, **qgpu_biz** Commercial queues similar to qprod and qnvidia. These queues are reserved for commercial customers and have slightly higher priorities. -* **qcpu_eurohpc**, **qgpu_eurohpc** EuroHPC queues similar to qprod and qnvidia. These queues are reserved for EuroHPC users. -* **qcpu_preempt**, **qgpu_preempt** free queues with a lower priority (LP), requires allocation of the resource type, jobs are killed if other jobs with a higher priority (HP) request the nodes, LP jobs are automatically requeued once HP jobs finish. -* **qcpu_free**, **qgpu_free** queues similar to **qfree** -* **qfree** Free resource queue: The queue qfree is intended for utilization of free resources, after a Project exhausted all its allocated computational resources. (Does not apply to DD projects by default. DD projects have to request for permission on qfree after exhaustion of computational resources.) It is required that an active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 150% of the resources allocated to the Project. Only 756 nodes without accelerator may be accessed from this queue. Full nodes, 128 cores per node are allocated. The queue runs with a very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours. -* **qviz** Visualization queue: Intended for pre-/post-processing using OpenGL accelerated graphics. Currently when accessing the node, each user gets 8 cores of a CPU allocated, thus approximately 64 GB of RAM and 1/8 of the GPU capacity (default "chunk"). If more GPU power or RAM is required, it is recommended to allocate more chunks (with 8 cores each) up to one whole node per user, so that all 64 cores, 256 GB RAM and a whole GPU is exclusive. This is currently also the maximum allowed allocation per one user. One hour of work is allocated by default, the user may ask for 2 hours maximum. +| **qviz** | yes | none required | 2 nodes (with NVIDIA® Quadro RTX™ 6000) | 8 | 0 | no | 1 / 8h | +| **qfat** | yes | > 0 | 1 (sdf1) | 24 | 0 | yes | 24 / 48h | +| **Legacy Queues** | +| **qfree** | yes | < 150% of allocation | 756 nodes<br>max 4 nodes per job | 128 | -100 | no | 12 / 12h | +| **qexp** | no | none required | 756 nodes<br>max 2 nodes per job | 128 | 150 | no | 1 / 1h | +| **qprod** | yes | > 0 | 756 nodes | 128 | 0 | no | 24 / 48h | +| **qlong** | yes | > 0 | 200 nodes<br>max 20 nodes per job, only non-accelerated nodes allowed | 128 | 0 | no | 72 / 144h | +| **qnvidia** | yes | > 0 | 72 nodes | 128 | 0 | yes | 24 / 48h | ### Barbora -| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime (default/max) | -| ---------------- | -------------- | -------------------- | ------------------------- | --------- | -------- | ------------- | ---------------------- | -| **qexp** | no | none required | 16 nodes<br>max 4 nodes per job | 36 | 150 | no | 1 / 1h | -| **qprod** | yes | > 0 | 190 nodes w/o accelerator | 36 | 0 | no | 24 / 48h | -| **qlong** | yes | > 0 | 20 nodes w/o accelerator | 36 | 0 | no | 72 / 144h | -| **qnvidia** | yes | > 0 | 8 NVIDIA nodes | 24 | 0 | yes | 24 / 48h | -| **qfat** | yes | > 0 | 1 fat node | 8 | 200 | yes | 24 / 144h | -| **qcpu_biz** | yes | > 0 | 187 nodes w/o accelerator | 36 | 50 | no | 24 / 48h | -| **qgpu_biz** | yes | > 0 | 8 NVIDIA nodes | 24 | 50 | yes | 24 / 48h | -| **qcpu_preempt** | yes | > 0 | 190 nodes<br>max 4 nodes per job | 36 | -200 | no | 12 / 12h | -| **qgpu_preempt** | yes | > 0 | 8 nodes<br>max 2 nodes per job | 24 | -200 | no | 12 / 12h | -| **qcpu_free** | yes | < 150% of allocation | 124 nodes<br>max 4 nodes per job | 36 | -100 | no | 12 / 18h | -| **qgpu_free** | yes | < 150% of allocation | 5 nodes<br>max 2 nodes per job | 24 | -100 | no | 12 / 18h | -| **qfree** | yes | < 150% of allocation | 192 w/o accelerator | 36 | -1024 | no | 12 / 12h | - -* **qexp**, Express queue: This queue is dedicated for testing and running very small jobs. It is not required to specify a project to enter the qexp. There are 2 nodes always reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user. The nodes may be allocated on a per core basis. No special authorization is required to use the queue. The maximum runtime in qexp is 1 hour. -* **qprod**, Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 187 nodes without accelerators are included. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours. -* **qlong**, Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 20 nodes without acceleration may be accessed via the qlong queue. Full nodes, 36 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h). -* **qnvidia**, **qfat**, Dedicated queues: The queue qnvidia is dedicated to accessing the NVIDIA accelerated nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. Included are 8 NVIDIA (4 NVIDIA cards per node) and 1 fat nodes. Full nodes, 24 cores per node, are allocated. qfat runs with very high priority. The PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with their project. -* **qcpu_biz**, **qgpu_biz** Commercial queues similar to qprod and qnvidia: These queues are reserved for commercial customers and have slightly higher priorities. -* **qcpu_preempt**, **qgpu_preempt** free queues with a lower priority (LP), requires allocation of the resource type, jobs are killed if other jobs with a higher priority (HP) request the nodes, LP jobs are automatically requeued once HP jobs finish.* **qfree**, Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request permission to use qfree after exhaustion of computational resources). It is required that an active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 189 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with a very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours. +| Queue | Active project | Project resources | Nodes | Min ncpus | Priority | Authorization | Walltime (default/max) | +| ---------------- | -------------- | -------------------- | -------------------------------- | --------- | -------- | ------------- | ---------------------- | +| **qcpu** | yes | > 0 | 190 nodes | 36 | 0 | no | 24 / 48h | +| **qcpu_biz** | yes | > 0 | 190 nodes | 36 | 50 | no | 24 / 48h | +| **qcpu_exp** | yes | none required | 16 nodes | 36 | 150 | no | 1 / 1h | +| **qcpu_free** | yes | < 150% of allocation | 124 nodes<br>max 4 nodes per job | 36 | -100 | no | 12 / 18h | +| **qcpu_long** | yes | > 0 | 60 nodes<br>max 20 nodes per job | 36 | 0 | no | 72 / 144h | +| **qcpu_preempt** | yes | > 0 | 190 nodes<br>max 4 nodes per job | 36 | -200 | no | 12 / 12h | +| **qgpu** | yes | > 0 | 8 nodes | 24 | 0 | yes | 24 / 48h | +| **qgpu_biz** | yes | > 0 | 8 nodes | 24 | 50 | yes | 24 / 48h | +| **qgpu_exp** | yes | none required | 4 nodes<br>max 1 node per job | 24 | 0 | no | 1 / 1h | +| **qgpu_free** | yes | < 150% of allocation | 5 nodes<br>max 2 nodes per job | 24 | -100 | no | 12 / 18h | +| **qgpu_preempt** | yes | > 0 | 4 nodes<br>max 2 nodes per job | 24 | -200 | no | 12 / 12h | +| **qdgx** | yes | > 0 | cn202 | 96 | 0 | yes | 4 / 48h | +| **qviz** | yes | none required | 2 nodes with NVIDIA Quadro P6000 | 4 | 0 | no | 1 / 8h | +| **qfat** | yes | > 0 | 1 fat node | 128 | 0 | yes | 24 / 48h | +| **Legacy Queues** | +| **qexp** | no | none required | 16 nodes<br>max 4 nodes per job | 36 | 150 | no | 1 / 1h | +| **qprod** | yes | > 0 | 190 nodes w/o accelerator | 36 | 0 | no | 24 / 48h | +| **qlong** | yes | > 0 | 60 nodes w/o accelerator<br>max 20 nodes per job | 36 | 0 | no | 72 / 144h | +| **qnvidia** | yes | > 0 | 8 NVIDIA nodes | 24 | 0 | yes | 24 / 48h | +| **qfree** | yes | < 150% of allocation | 192 w/o accelerator<br>max 32 nodes per job | 36 | -100 | no | 12 / 12h | ## Queue Notes @@ -105,92 +131,9 @@ Options: --get-reservations Print reservations --get-reservations-details Print reservations details - --get-nodes Print nodes of PBS complex - --get-nodeset Print nodeset of PBS complex - --get-nodes-details Print nodes details - --get-vnodes Print vnodes of PBS complex - --get-vnodeset Print vnodes nodeset of PBS complex - --get-vnodes-details Print vnodes details - --get-jobs Print jobs - --get-jobs-details Print jobs details - --get-job-nodes Print job nodes - --get-job-nodeset Print job nodeset - --get-job-vnodes Print job vnodes - --get-job-vnodeset Print job vnodes nodeset - --get-jobs-check-params - Print jobid, job state, session_id, user, nodes - --get-users Print users of jobs - --get-allocated-nodes - Print nodes allocated by jobs - --get-allocated-nodeset - Print nodeset allocated by jobs - --get-allocated-vnodes - Print vnodes allocated by jobs - --get-allocated-vnodeset - Print vnodes nodeset allocated by jobs - --get-node-users Print node users - --get-node-jobs Print node jobs - --get-node-ncpus Print number of cpus per node - --get-node-naccelerators - Print number of accelerators per node - --get-node-allocated-ncpus - Print number of allocated cpus per node - --get-node-allocated-naccelerators - Print number of allocated accelerators per node - --get-node-qlist Print node qlist - --get-node-ibswitch Print node ibswitch - --get-vnode-users Print vnode users - --get-vnode-jobs Print vnode jobs - --get-vnode-ncpus Print number of cpus per vnode - --get-vnode-naccelerators - Print number of naccelerators per vnode - --get-vnode-allocated-ncpus - Print number of allocated cpus per vnode - --get-vnode-allocated-naccelerators - Print number of allocated accelerators per vnode - --get-vnode-qlist Print vnode qlist - --get-vnode-ibswitch Print vnode ibswitch - --get-user-nodes Print user nodes - --get-user-nodeset Print user nodeset - --get-user-vnodes Print user vnodes - --get-user-vnodeset Print user vnodes nodeset - --get-user-jobs Print user jobs - --get-user-job-count Print number of jobs per user - --get-user-node-count - Print number of allocated nodes per user - --get-user-vnode-count - Print number of allocated vnodes per user - --get-user-ncpus Print number of allocated ncpus per user - --get-qlist-nodes Print qlist nodes - --get-qlist-nodeset Print qlist nodeset - --get-qlist-vnodes Print qlist vnodes - --get-qlist-vnodeset Print qlist vnodes nodeset - --get-ibswitch-nodes Print ibswitch nodes - --get-ibswitch-nodeset - Print ibswitch nodeset - --get-ibswitch-vnodes - Print ibswitch vnodes - --get-ibswitch-vnodeset - Print ibswitch vnodes nodeset - --last-job Print expected time of last running job - --summary Print summary - --get-node-ncpu-chart - Obsolete. Print chart of allocated ncpus per node - --server=SERVER Use given PBS server - --state=STATE Only for given job state - --jobid=JOBID Only for given job ID - --user=USER Only for given user - --node=NODE Only for given node - --vnode=VNODE Only for given vnode - --nodestate=NODESTATE - Only for given node state (affects only --get-node* - --get-vnode* --get-qlist-* --get-ibswitch-* actions) - --incl-finished Include finished jobs - --walltime-exceeded-used-walltime - Job walltime exceeded - resources_used.walltime - --walltime-exceeded-real-runtime - Job walltime exceeded - real runtime - --backend-sqlite Use SQLite backend - experimental + ... + .. + . ``` ---8<--- "resource_accounting.md" @@ -200,6 +143,7 @@ Options: [1]: job-priority.md [2]: #resource-accounting-policy [3]: job-submission-and-execution.md +[4]: ./vnode-allocation.md [a]: https://support.it4i.cz/rt/ [c]: https://extranet.it4i.cz/rsweb diff --git a/docs.it4i/general/vnode-allocation.md b/docs.it4i/general/vnode-allocation.md new file mode 100644 index 000000000..413358f7d --- /dev/null +++ b/docs.it4i/general/vnode-allocation.md @@ -0,0 +1,147 @@ +# Allocation of vnodes on qgpu + +## Introduction + +The `qgpu` queue on Karolina takes advantage of the division of nodes into vnodes. +Accelerated node equipped with two 64-core processors and eight GPU cards is treated as eight vnodes, +each containing 16 CPU cores and 1 GPU card. +Vnodes can be allocated to jobs individually –â +through precise definition of resource list at job submission, +you may allocate varying number of resources/GPU cards according to your needs. + +!!! important "Vnodes and Security" + Division of nodes into vnodes was implemented to be as secure as possible, but it is still a "multi-user mode", + which means that if two users allocate a portion of the same node, they can see each other's running processes. + If this solution is inconvenient for you, consider allocating a whole node. + +## Selection Statement and Chunks + +Requested resources are specified using a selection statement: + +``` +-l select=[<N>:]<chunk>[+[<N>:]<chunk> ...] +``` + +`N` specifies the number of chunks; if not specified then `N = 1`.<br> +`chunk` declares the value of each resource in a set of resources which are to be allocated as a unit to a job. + +* `chunk` is seen by the MPI as one node. +* Multiple chunks are then seen as multiple nodes. +* Maximum chunk size is equal to the size of a full physical node (8 GPU cards, 128 cores) + +Default chunk for the `qgpu` queue is configured to contain 1 GPU card and 16 CPU cores, i.e. `ncpus=16:ngpus=1`. + +* `ncpus` specifies number of CPU cores +* `ngpus` specifies number of GPU cards + +### Allocating Single GPU + +Single GPU can be allocated in an interactive session using + +```console +qsub -q qgpu -A OPEN-00-00 -l select=1 -I +``` + +or simply + +```console +qsub -q qgpu -A OPEN-00-00 -I +``` + +In this case, the `ngpus` parameter is optional, since it defaults to `1`. +You can verify your allocation either in the PBS using the `qstat` command, +or by checking the number of allocated GPU cards in the `CUDA_VISIBLE_DEVICES` variable: + +```console +$ qstat -F json -f $PBS_JOBID | grep exec_vnode + "exec_vnode":"(acn53[0]:ncpus=16:ngpus=1)" + +$ echo $CUDA_VISIBLE_DEVICES +GPU-8772c06c-0e5e-9f87-8a41-30f1a70baa00 +``` + +The output shows that you have been allocated vnode acn53[0]. + +### Allocating Single Accelerated Node + +!!! tip "Security tip" + Allocating a whole node prevents other users from seeing your running processes. + +Single accelerated node can be allocated in an interactive session using + +```console +qsub -q qgpu -A OPEN-00-00 -l select=8 -I +``` + +Setting `select=8` automatically allocates a whole accelerated node and sets `mpiproc`. +So for `N` full nodes, set `select` to `N x 8`. +However, note that it may take some time before your jobs are executed +if the required amount of full nodes isn't available. + +### Allocating Multiple GPUs + +!!! important "Security risk" + If two users allocate a portion of the same node, they can see each other's running processes. + When required for security reasons, consider allocating a whole node. + +Again, the following examples use only the selection statement, so no additional setting is required. + +```console +qsub -q qgpu -A OPEN-00-00 -l select=2 -I +``` + +In this example two chunks will be allocated on the same node, if possible. + +```console +qsub -q qgpu -A OPEN-00-00 -l select=16 -I +``` + +This example allocates two whole accelerated nodes. + +Multiple vnodes within the same chunk can be allocated using the `ngpus` parameter. +For example, to allocate 2 vnodes in an interactive mode, run + +```console +qsub -q qgpu -A OPEN-00-00 -l select=1:ngpus=2:mpiprocs=2 -I +``` + +Remember to **set the number of `mpiprocs` equal to that of `ngpus`** to spawn an according number of MPI processes. + +To verify the correctness: + +```console +$ qstat -F json -f $PBS_JOBID | grep exec_vnode + "exec_vnode":"(acn53[0]:ncpus=16:ngpus=1+acn53[1]:ncpus=16:ngpus=1)" + +$ echo $CUDA_VISIBLE_DEVICES | tr ',' '\n' +GPU-8772c06c-0e5e-9f87-8a41-30f1a70baa00 +GPU-5e88c15c-e331-a1e4-c80c-ceb3f49c300e +``` + +The number of chunks to allocate is specified in the `select` parameter. +For example, to allocate 2 chunks, each with 4 GPUs, run + +```console +qsub -q qgpu -A OPEN-00-00 -l select=2:ngpus=4:mpiprocs=4 -I +``` + +To verify the correctness: + +```console +$ cat > print-cuda-devices.sh <<EOF +#!/bin/bash +echo \$CUDA_VISIBLE_DEVICES +EOF + +$ chmod +x print-cuda-devices.sh +$ ml OpenMPI/4.1.4-GCC-11.3.0 +$ mpirun ./print-cuda-devices.sh | tr ',' '\n' | sort | uniq +GPU-0910c544-aef7-eab8-f49e-f90d4d9b7560 +GPU-1422a1c6-15b4-7b23-dd58-af3a233cda51 +GPU-3dbf6187-9833-b50b-b536-a83e18688cff +GPU-3dd0ae4b-e196-7c77-146d-ae16368152d0 +GPU-93edfee0-4cfa-3f82-18a1-1e5f93e614b9 +GPU-9c8143a6-274d-d9fc-e793-a7833adde729 +GPU-ad06ab8b-99cd-e1eb-6f40-d0f9694601c0 +GPU-dc0bc3d6-e300-a80a-79d9-3e5373cb84c9 +``` diff --git a/mkdocs.yml b/mkdocs.yml index f2d76efb6..9a5eb22b9 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -77,6 +77,7 @@ nav: - Job Priority: general/job-priority.md - Job Submission and Execution: general/job-submission-and-execution.md - Capacity Computing: general/capacity-computing.md + - Vnode Allocation: general/vnode-allocation.md - Migrating from SLURM: general/slurmtopbs.md - Technical Information: - SSH Keys: -- GitLab