documented hyperqueue
Compare changes
Files
3+ 39
− 129
@@ -10,8 +10,7 @@ However, executing a huge number of jobs via the PBS queue may strain the system
@@ -10,8 +10,7 @@ However, executing a huge number of jobs via the PBS queue may strain the system
* Use [Job arrays][1] when running a huge number of [multithread][2] (bound to one node only) or multinode (multithread across several nodes) jobs.
@@ -151,162 +150,73 @@ $ qstat -u $USER -tJ
@@ -151,162 +150,73 @@ $ qstat -u $USER -tJ
HyperQueue lets you build a computation plan consisting of a large amount of tasks and then execute it transparently over a system like SLURM/PBS. It dynamically groups jobs into SLURM/PBS jobs and distributes them to fully utilize allocated nodes. You thus do not have to manually aggregate your tasks into SLURM/PBS jobs. See the [project repository][a].
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful when running single core jobs via the queue systems.
The GNU parallel shell executes multiple instances of the jobscript using all cores on the node. The instances execute different work, controlled by the `$PARALLEL_SEQ` variable.
Assume we have 101 input files with each name beginning with "file" (e.g. file001, ..., file101). Assume we would like to use each of these input files with the myprog.x program executable, each as a separate single core job. We call these single core jobs tasks.
In this example, tasks from the tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of the jobscript is finished, a new instance starts until all entries in the tasklist are processed. Currently processed entries of the joblist may be retrieved via the `$1` variable. The `$TASK` variable expands to one of the input filenames from the tasklist. We copy the input file to the local scratch memory, execute myprog.x, and copy the output file back to the submit directory under the $TASK.out name.
To submit the job, use the `qsub` command. The 101 task job of the [example above][7] may be submitted as follows:
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
Use #PBS directives at the beginning of the jobscript file, do not forget to set your valid `PROJECT_ID` and the desired queue.
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on a single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
A combined approach, very similar to job arrays, can be taken. A job array is submitted to the queuing system. The subjobs run GNU parallel. The GNU parallel shell executes multiple instances of the jobscript using all of the cores on the node. The instances execute different work, controlled by the `$PBS_JOB_ARRAY` and the `$PARALLEL_SEQ` variables.
Assume we have 992 input files with each name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with the myprog.x program executable, each as a separate single core job. We call these single core jobs tasks.
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. The `$TASK` variable expands to one of the input filenames from the tasklist. We copy the input file to local scratch memory, execute myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once a task is finished, a new task starts, until the number of tasks in the numtasks file is reached.
1. Let n=N/16. Inequality (n+1) \* T < W should hold. N is the number of tasks per subjob, T is the expected single task walltime and W is subjob walltime. A short subjob walltime improves scheduling and job throughput.
To submit the job array, use the `qsub -J` command. The 992 task jobs of the [example above][8] may be submitted like this:
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on one full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Use #PBS directives at the beginning of the jobscript file, do not forget to set your valid PROJECT_ID and desired queue.
@@ -321,10 +231,10 @@ $ cat README
@@ -321,10 +231,10 @@ $ cat README