support.mdx
Capacity Computing
Introduction
In many cases, it is useful to submit a huge (>100) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization.
However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience for all users. For this reason, the number of jobs is limited to 100 jobs per user, 4000 jobs and subjobs per user, 1500 subjobs per job array.
!!! note Follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
- Use Job arrays when running a huge number of multithread (bound to one node only) or multinode (multithread across several nodes) jobs
- Use GNU parallel when running single core jobs.
- Combine GNU parallel with Job arrays when running huge number of single core jobs.
Policy
- A user is allowed to submit at most 100 jobs. Each job may be a job array.
- The array size is at most 1000 subjobs.
Job Arrays
!!! note A huge number of jobs may easily be submitted and managed as a job array.
A job array is a compact representation of many jobs called subjobs. Subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
- each subjob has a unique index, $PBS_ARRAY_INDEX
- job Identifiers of subjobs only differ by their indices
- the state of subjobs can differ (R, Q, ..., etc.)
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. An entire job array is submitted through a single qsub
command and may be managed by qdel
, qalter
, qhold
, qrls
, and qsig
commands as a single job.
Shared Jobscript
All subjobs in a job array use the very same single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by the $PBS_ARRAY_INDEX
variable.
Example:
Assume we have 900 input files with the name of each beginning with "file" (e.g. file001, ..., file900). Assume we would like to use each of these input files with myprog.x program executable, each as a separate job.
First, we create a tasklist file (or subjobs list), listing all tasks (subjobs) - all input files in our example:
$ find . -name 'file*' > tasklist
Then we create the jobscript for Anselm:
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
Then we create jobscript for Salomon cluster:
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=24,walltime=02:00:00
# change to scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
In this example, the submit directory holds the 900 input files, the myprog.x executable, and the jobscript file. As an input for each run, we take the filename of the input file from the created tasklist file. We copy the input file to the local scratch memory /lscratch/$PBS_JOBID
, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out
name. The myprog.x executable runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x is not multithreaded, then all the jobs are run as single thread programs in a sequential manner. Due to the allocation of the whole node, the accounted time is equal to the usage of the whole node, while using only 1/16 of the node.
If running a huge number of parallel multicore (in means of multinode multithread, e.g. MPI enabled) jobs is needed, then a job array approach should be used. The main difference as compared to previous examples using one node is that the local scratch memory should not be used (as it is not shared between nodes) and MPI or other techniques for parallel multinode processing has to be used properly.
Submit the Job Array
To submit the job array, use the qsub -J
command. The 900 jobs of the example above may be submitted like this:
Anselm
$ qsub -N JOBNAME -J 1-900 jobscript
12345[].dm2