Skip to content
Snippets Groups Projects
Unverified Commit 7d840475 authored by Jakub Beránek's avatar Jakub Beránek
Browse files

Update HyperQueue page for Slurm

parent 8914e2f6
Branches
No related tags found
1 merge request!441Update HyperQueue page for Slurm
!!!warning
This page has not been updated yet. The page does not reflect the transition from PBS to Slurm.
# Capacity Computing # Capacity Computing
## Introduction ## Introduction
In many cases, it is useful to submit a huge (>100) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization. In many cases, it is useful to submit a huge (>100) number of computational jobs into the Slurm queue system.
A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations,
achieving the best runtime, throughput, and computer utilization.
However, executing a huge number of jobs via the Slurm queue may strain the system. This strain may
result in slow response to commands, inefficient scheduling, and overall degradation of performance
and user experience for all users.
However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience for all users. For this reason, the number of jobs is **limited to 100 jobs per user, 4,000 jobs and subjobs per user, 1,500 subjobs per job array**. [//]: # (For this reason, the number of jobs is **limited to 100 jobs per user, 4,000 jobs and subjobs per user, 1,500 subjobs per job array**.)
!!! note !!! note
Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time. Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
* Use [Job arrays][1] when running a huge number of multithread (bound to one node only) or multinode (multithread across several nodes) jobs. You can use [HyperQueue][1] when running a huge number of jobs. HyperQueue can help efficiently
* Use [HyperQueue][3] when running a huge number of multithread jobs. HyperQueue can help overcome the limits of job arrays. load balance a large number of jobs amongst available computing nodes.
## Policy
1. A user is allowed to submit at most 100 jobs. Each job may be [a job array][1].
1. The array size is at most 1,000 subjobs.
[1]: job-arrays.md [1]: hyperqueue.md
[3]: hyperqueue.md
!!!warning
This page has not been updated yet. The page does not reflect the transition from PBS to Slurm.
# HyperQueue # HyperQueue
HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS.
It dynamically groups tasks into PBS jobs and distributes them to fully utilize allocated nodes. It dynamically groups tasks into Slurm jobs and distributes them to fully utilize allocated nodes.
You thus do not have to manually aggregate your tasks into PBS jobs. You thus do not have to manually aggregate your tasks into Slurm jobs.
Find more about HyperQueue in its [documentation][a]. Find more about HyperQueue in its [documentation][a].
...@@ -15,25 +12,25 @@ Find more about HyperQueue in its [documentation][a]. ...@@ -15,25 +12,25 @@ Find more about HyperQueue in its [documentation][a].
* **Transparent task execution on top of a Slurm/PBS cluster** * **Transparent task execution on top of a Slurm/PBS cluster**
* Automatic task distribution amongst jobs, nodes, and cores * Automatic task distribution amongst jobs, nodes, and cores
* Automatic submission of PBS/Slurm jobs * Automatic submission of PBS/Slurm jobs
* **Dynamic load balancing across jobs** * **Dynamic load balancing across jobs**
* Work-stealing scheduler * Work-stealing scheduler
* NUMA-aware, core planning, task priorities, task arrays * NUMA-aware, core planning, task priorities, task arrays
* Nodes and tasks may be added/removed on the fly * Nodes and tasks may be added/removed on the fly
* **Scalable** * **Scalable**
* Low overhead per task (~100μs) * Low overhead per task (~100μs)
* Handles hundreds of nodes and millions of tasks * Handles hundreds of nodes and millions of tasks
* Output streaming avoids creating many files on network filesystems * Output streaming avoids creating many files on network filesystems
* **Easy deployment** * **Easy deployment**
* Single binary, no installation, depends only on *libc* * Single binary, no installation, depends only on *libc*
* No elevated privileges required * No elevated privileges required
## Installation ## Installation
...@@ -90,35 +87,35 @@ $ hq jobs ...@@ -90,35 +87,35 @@ $ hq jobs
Before HyperQueue can execute your jobs, it needs to have access to some computational resources. Before HyperQueue can execute your jobs, it needs to have access to some computational resources.
You can provide these by starting HyperQueue *workers* which connect to the server and execute your jobs. You can provide these by starting HyperQueue *workers* which connect to the server and execute your jobs.
The workers should run on computing nodes, therefore they should be started inside PBS jobs. The workers should run on computing nodes, therefore they should be started inside Slurm jobs.
There are two ways of providing computational resources. There are two ways of providing computational resources.
* **Allocate PBS jobs automatically** * **Allocate Slurm jobs automatically**
HyperQueue can automatically submit PBS jobs with workers on your behalf. This system is called HyperQueue can automatically submit Slurm jobs with workers on your behalf. This system is called
[automatic allocation][c]. After the server is started, you can add a new automatic allocation [automatic allocation][c]. After the server is started, you can add a new automatic allocation
queue using the `hq alloc add` command: queue using the `hq alloc add` command:
```console ```console
$ hq alloc add pbs -- -qqprod -AAccount1 $ hq alloc add slurm -- -A<PROJECT-ID> -p qcpu_exp
``` ```
After you run this command, HQ will automatically start submitting PBS jobs on your behalf After you run this command, HQ will automatically start submitting Slurm jobs on your behalf
once some HQ jobs are submitted. once some HQ jobs are submitted.
* **Manually start PBS jobs with HQ workers** * **Manually start Slurm jobs with HQ workers**
With the following command, you can submit a PBS job that will start a single HQ worker which With the following command, you can submit a Slurm job that will start a single HQ worker which
will connect to a running HQ server. will connect to a running HQ server.
```console ```console
$ qsub <qsub-params> -- /bin/bash -l -c "$(which hq) worker start" $ salloc <salloc-params> -- /bin/bash -l -c "$(which hq) worker start"
``` ```
!!! tip !!! tip
For debugging purposes, you can also start the worker e.g. on a login node, simply by running For debugging purposes, you can also start the worker e.g. on a login node, simply by running
`$ hq worker start`. Do not use such worker for any long-running computations though. `$ hq worker start`. Do not use such worker for any long-running computations though!
## Architecture ## Architecture
......
...@@ -106,10 +106,10 @@ nav: ...@@ -106,10 +106,10 @@ nav:
- Resource Accounting Policy: general/resource-accounting.md - Resource Accounting Policy: general/resource-accounting.md
- Job Priority: general/job-priority.md - Job Priority: general/job-priority.md
# - Slurm Job Submission and Execution: general/slurm-job-submission-and-execution.md # - Slurm Job Submission and Execution: general/slurm-job-submission-and-execution.md
# - Capacity Computing: - Capacity Computing:
# - Introduction: general/capacity-computing.md - Introduction: general/capacity-computing.md
# - Job Arrays: general/job-arrays.md # - Job Arrays: general/job-arrays.md
# - HyperQueue: general/hyperqueue.md - HyperQueue: general/hyperqueue.md
# - Parallel Computing and MPI: general/karolina-mpi.md # - Parallel Computing and MPI: general/karolina-mpi.md
- Other Services: - Other Services:
- OpenCode: general/opencode.md - OpenCode: general/opencode.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment