diff --git a/docs.it4i/general/capacity-computing.md b/docs.it4i/general/capacity-computing.md index 83eed0879f908ef66febb9f6f9f84a99e117f36f..bf18e17c28588e61bef2ce940cccd1f7449442a0 100644 --- a/docs.it4i/general/capacity-computing.md +++ b/docs.it4i/general/capacity-computing.md @@ -10,7 +10,8 @@ However, executing a huge number of jobs via the PBS queue may strain the system Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time. * Use [Job arrays][1] when running a huge number of multithread (bound to one node only) or multinode (multithread across several nodes) jobs. -* Use [HyperQueue][3] when running single core jobs. +* Use [HyperQueue][3] when running a huge number of multithread jobs. HyperQueue can help overcome +the limits of job arrays. ## Policy @@ -150,9 +151,22 @@ $ qstat -u $USER -tJ For more information on job arrays, see the [PBSPro Users guide][6]. +### Examples + +Download the examples in [capacity.zip][9], illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs. + +Unzip the archive in an empty directory on cluster and follow the instructions in the README file- + +```console +$ unzip capacity.zip +$ cat README +``` + ## HyperQueue -HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. It dynamically groups jobs into SLURM/PBS jobs and distributes them to fully utilize allocated nodes. You thus do not have to manually aggregate your tasks into SLURM/PBS jobs. See the [project repository][a]. +HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. +It dynamically groups tasks into PBS jobs and distributes them to fully utilize allocated nodes. +You thus do not have to manually aggregate your tasks into PBS jobs. See the [project repository][a].  @@ -174,62 +188,74 @@ HyperQueue lets you build a computation plan consisting of a large amount of tas Single binary, no installation, depends only on *libc*<br>No elevated privileges required -* **Open source** - -### Architecture +### Installation - +* On Barbora and Karolina, you can simply load the HyperQueue module: -### Installation + `$ ml HyperQueue` -To install/compile HyperQueue, follow the steps on the [official webpage][b]. +* If you want to install/compile HyperQueue manually, follow the steps on the [official webpage][b]. -### Submiting a Simple Task +### Usage +#### Starting the Server +To use HyperQueue, you first have to start the HyperQueue server. It is a long-lived process that +is supposed to be running on a login node. You can start it with the following command: -* Start server (e.g. on a login node or in a cluster partition) + $ hq server start - `$ hq server start &` +#### Submitting Computation +Once the HyperQueue server is running, you can submit jobs into it. Here are a few examples of +job submissions. You can find more information in the [documentation][2]. -* Submit a job (command `echo 'Hello world'` in this case) +* Submit a simple job (command `echo 'Hello world'` in this case) `$ hq submit echo 'Hello world'` -* Ask for computing resources +* Submit a job with 10000 tasks - * Start worker manually + `$ hq submit --array 1-10000 my-script.sh` - `$ hq worker start &` +Once you start some jobs, you can observe their status using the following commands: - * Automatic resource request - - [Not implemented yet] +``` +# Display status of a single job +$ hq job <job-id> - * Manual request in PBS +# Display status of all jobs +$ hq jobs +``` - * Start worker on the first node of a PBS job +!!! important + Before the jobs can start executing, you have to provide HyperQueue with some computational resources. - `$ qsub <your-params-of-qsub> -- hq worker start` +#### Providing Computational Resources +Before HyperQueue can execute your jobs, it needs to have access to some computational resources. +You can provide these by starting HyperQueue *workers*, which connect to the server and execute +your jobs. The workers should run on computing nodes, so you can start them using PBS. - * Start worker on all nodes of a PBS job +* Start a worker on a single PBS node: - ``$ qsub <your-params-of-qsub> -- `which pbsdsh` hq worker start`` + ``$ qsub <qsub-params> -- `which hq` worker start`` -* Monitor the state of jobs +* Start a worker on all allocated PBS nodes: - `$ hq jobs` + ``$ qsub <qsub-params> -- `which pbsdsh` `which hq` worker start`` -## Examples +In an upcoming version, HyperQueue will be able to automatically submit PBS jobs with workers +on your behalf. -Download the examples in [capacity.zip][9], illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs. +!!! tip + For debugging purposes, you can also start the worker e.g. on a login using simply by running + `$ hq worker start`. Do not use such worker for any long-running computations. -Unzip the archive in an empty directory on cluster and follow the instructions in the README file- +### Architecture +Here you can see the architecture of HyperQueue. The user submits jobs into the server, which +schedules them onto a set of workers running on compute nodes. -```console -$ unzip capacity.zip -$ cat README -``` + [1]: #job-arrays +[2]: https://it4innovations.github.io/hyperqueue/jobs/ [3]: #hyperqueue [5]: #shared-jobscript [6]: ../pbspro.md