diff --git a/docs.it4i/general/capacity-computing.md b/docs.it4i/general/capacity-computing.md index f0d36bf30426ae7fa8389289a1c6d7ecb8336787..c5e5311466cb404ea6b5a3f4607d0012c41d33a0 100644 --- a/docs.it4i/general/capacity-computing.md +++ b/docs.it4i/general/capacity-computing.md @@ -1,24 +1,21 @@ -!!!warning - This page has not been updated yet. The page does not reflect the transition from PBS to Slurm. - # Capacity Computing ## Introduction -In many cases, it is useful to submit a huge (>100) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization. +In many cases, it is useful to submit a huge (>100) number of computational jobs into the Slurm queue system. +A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, +achieving the best runtime, throughput, and computer utilization. + +However, executing a huge number of jobs via the Slurm queue may strain the system. This strain may +result in slow response to commands, inefficient scheduling, and overall degradation of performance +and user experience for all users. -However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience for all users. For this reason, the number of jobs is **limited to 100 jobs per user, 4,000 jobs and subjobs per user, 1,500 subjobs per job array**. +[//]: # (For this reason, the number of jobs is **limited to 100 jobs per user, 4,000 jobs and subjobs per user, 1,500 subjobs per job array**.) !!! note Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time. -* Use [Job arrays][1] when running a huge number of multithread (bound to one node only) or multinode (multithread across several nodes) jobs. -* Use [HyperQueue][3] when running a huge number of multithread jobs. HyperQueue can help overcome the limits of job arrays. - -## Policy - -1. A user is allowed to submit at most 100 jobs. Each job may be [a job array][1]. -1. The array size is at most 1,000 subjobs. +You can use [HyperQueue][1] when running a huge number of jobs. HyperQueue can help efficiently +load balance a large number of jobs amongst available computing nodes. -[1]: job-arrays.md -[3]: hyperqueue.md +[1]: hyperqueue.md diff --git a/docs.it4i/general/hyperqueue.md b/docs.it4i/general/hyperqueue.md index 458ecfcb6d420842386347874e52adfae8b53480..4f0dea5b9ce3cc328839dd074f7a20cdbad4b7a5 100644 --- a/docs.it4i/general/hyperqueue.md +++ b/docs.it4i/general/hyperqueue.md @@ -1,11 +1,8 @@ -!!!warning - This page has not been updated yet. The page does not reflect the transition from PBS to Slurm. - # HyperQueue HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. -It dynamically groups tasks into PBS jobs and distributes them to fully utilize allocated nodes. -You thus do not have to manually aggregate your tasks into PBS jobs. +It dynamically groups tasks into Slurm jobs and distributes them to fully utilize allocated nodes. +You thus do not have to manually aggregate your tasks into Slurm jobs. Find more about HyperQueue in its [documentation][a]. @@ -15,25 +12,25 @@ Find more about HyperQueue in its [documentation][a]. * **Transparent task execution on top of a Slurm/PBS cluster** - * Automatic task distribution amongst jobs, nodes, and cores - * Automatic submission of PBS/Slurm jobs + * Automatic task distribution amongst jobs, nodes, and cores + * Automatic submission of PBS/Slurm jobs * **Dynamic load balancing across jobs** - * Work-stealing scheduler - * NUMA-aware, core planning, task priorities, task arrays - * Nodes and tasks may be added/removed on the fly + * Work-stealing scheduler + * NUMA-aware, core planning, task priorities, task arrays + * Nodes and tasks may be added/removed on the fly * **Scalable** - * Low overhead per task (~100ÎĽs) - * Handles hundreds of nodes and millions of tasks - * Output streaming avoids creating many files on network filesystems + * Low overhead per task (~100ÎĽs) + * Handles hundreds of nodes and millions of tasks + * Output streaming avoids creating many files on network filesystems * **Easy deployment** - * Single binary, no installation, depends only on *libc* - * No elevated privileges required + * Single binary, no installation, depends only on *libc* + * No elevated privileges required ## Installation @@ -90,35 +87,35 @@ $ hq jobs Before HyperQueue can execute your jobs, it needs to have access to some computational resources. You can provide these by starting HyperQueue *workers* which connect to the server and execute your jobs. -The workers should run on computing nodes, therefore they should be started inside PBS jobs. +The workers should run on computing nodes, therefore they should be started inside Slurm jobs. There are two ways of providing computational resources. -* **Allocate PBS jobs automatically** +* **Allocate Slurm jobs automatically** - HyperQueue can automatically submit PBS jobs with workers on your behalf. This system is called + HyperQueue can automatically submit Slurm jobs with workers on your behalf. This system is called [automatic allocation][c]. After the server is started, you can add a new automatic allocation queue using the `hq alloc add` command: ```console - $ hq alloc add pbs -- -qqprod -AAccount1 + $ hq alloc add slurm -- -A<PROJECT-ID> -p qcpu_exp ``` - After you run this command, HQ will automatically start submitting PBS jobs on your behalf + After you run this command, HQ will automatically start submitting Slurm jobs on your behalf once some HQ jobs are submitted. -* **Manually start PBS jobs with HQ workers** +* **Manually start Slurm jobs with HQ workers** - With the following command, you can submit a PBS job that will start a single HQ worker which + With the following command, you can submit a Slurm job that will start a single HQ worker which will connect to a running HQ server. ```console - $ qsub <qsub-params> -- /bin/bash -l -c "$(which hq) worker start" + $ salloc <salloc-params> -- /bin/bash -l -c "$(which hq) worker start" ``` !!! tip For debugging purposes, you can also start the worker e.g. on a login node, simply by running - `$ hq worker start`. Do not use such worker for any long-running computations though. + `$ hq worker start`. Do not use such worker for any long-running computations though! ## Architecture diff --git a/mkdocs.yml b/mkdocs.yml index dd9e663d89c5750228766a9b0e40058f39b0087e..423cabe15f9344286e02cae0a82ae87da85ab083 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -106,10 +106,10 @@ nav: - Resource Accounting Policy: general/resource-accounting.md - Job Priority: general/job-priority.md # - Slurm Job Submission and Execution: general/slurm-job-submission-and-execution.md -# - Capacity Computing: -# - Introduction: general/capacity-computing.md + - Capacity Computing: + - Introduction: general/capacity-computing.md # - Job Arrays: general/job-arrays.md -# - HyperQueue: general/hyperqueue.md + - HyperQueue: general/hyperqueue.md # - Parallel Computing and MPI: general/karolina-mpi.md - Other Services: - OpenCode: general/opencode.md