Commit 4dfc7ae8 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

Merge branch 'capacity-computing' into 'master'

Update information in capacity-computing.md to match Salomon cluster.

See merge request !190
parents e901280a 0dcf498c
......@@ -16,7 +16,7 @@ However, executing huge number of jobs via the PBS queue may strain the system.
## Policy
1. A user is allowed to submit at most 100 jobs. Each job may be [a job array](capacity-computing/#job-arrays).
1. The array size is at most 1000 subjobs.
1. The array size is at most 1500 subjobs.
## Job Arrays
......@@ -53,7 +53,7 @@ Then we create jobscript:
#PBS -q qprod
#PBS -l select=1:ncpus=24,walltime=02:00:00
# change to local scratch directory
# change to scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
......@@ -70,7 +70,7 @@ cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory holds the 900 input files, executable myprog.x and the jobscript file. As input for each run, we take the filename of input file from created tasklist file. We copy the input file to scratch /scratch/work/user/$USER/$PBS_JOBID, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in sequential** manner. Due to allocation of the whole node, the **accounted time is equal to the usage of whole node**, while using only 1/24 of the node!
In this example, the submit directory holds the 900 input files, executable myprog.x and the jobscript file. As input for each run, we take the filename of input file from created tasklist file. We copy the input file to scratch (/scratch/work/user/$USER/$PBS_JOBID), execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in sequential** manner. Due to allocation of the whole node, the **accounted time is equal to the usage of whole node**, while using only 1/24 of the node!
If huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed to run, then a job array approach should also be used. The main difference compared to previous example using one node is that the local scratch should not be used (as it's not shared between nodes) and MPI or other technique for parallel multinode run has to be used properly.
......@@ -154,12 +154,12 @@ Read more on job arrays in the [PBSPro Users guide](../pbspro/).
!!! note
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful in running single core jobs via the queue system on Anselm.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful in running single core jobs via the queue system on the cluster.
For more information and examples see the parallel man page:
```console
$ module add parallel
$ ml parallel
$ man parallel
```
......@@ -186,9 +186,9 @@ Then we create jobscript:
#PBS -l select=1:ncpus=24,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/tasklist $0 ; }
{ ml parallel ; exec parallel -a $PBS_O_WORKDIR/tasklist $0 ; }
# change to local scratch directory
# change to scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
......@@ -205,7 +205,7 @@ cat input > output
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, tasks from tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of jobscript is finished, new instance starts until all entries in tasklist are processed. Currently processed entry of the joblist may be retrieved via $1 variable. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
In this example, tasks from tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of jobscript is finished, new instance starts until all entries in tasklist are processed. Currently processed entry of the joblist may be retrieved via $1 variable. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to the scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
### Submit the Job
......@@ -237,7 +237,7 @@ Combined approach, very similar to job arrays, can be taken. Job array is submit
Example:
Assume we have 992 input files with name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
Assume we have 960 input files with name beginning with "file" (e. g. file001, ..., file960). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
......@@ -248,7 +248,7 @@ $ find . -name 'file*' > tasklist
Next we create a file, controlling how many tasks will be executed in one subjob
```console
$ seq 32 > numtasks
$ seq 48 > numtasks
```
Then we create jobscript:
......@@ -260,9 +260,9 @@ Then we create jobscript:
#PBS -l select=1:ncpus=24,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/numtasks $0 ; }
{ ml parallel ; exec parallel -a $PBS_O_WORKDIR/numtasks $0 ; }
# change to local scratch directory
# change to scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
......@@ -281,7 +281,7 @@ cat input > output
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to the scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
!!! note
Select subjob walltime and number of tasks per subjob carefully
......@@ -294,14 +294,14 @@ When deciding this values, think about following guiding rules :
### Submit the Job Array (-J)
To submit the job array, use the qsub -J command. The 992 tasks' job of the [example above](capacity-computing/#combined_example) may be submitted like this:
To submit the job array, use the qsub -J command. The 960 tasks' job of the [example above](capacity-computing/#combined_example) may be submitted like this:
```console
$ qsub -N JOBNAME -J 1-992:32 jobscript
$ qsub -N JOBNAME -J 1-960:48 jobscript
12345[].dm2
```
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
In this example, we submit a job array of 20 subjobs. Note the -J 1-960:48, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
!!! note
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
......@@ -310,7 +310,7 @@ In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**,
Download the examples in [capacity.zip](capacity.zip), illustrating the above listed ways to run huge number of jobs. We recommend to try out the examples, before using this for running production jobs.
Unzip the archive in an empty directory on Anselm and follow the instructions in the README file
Unzip the archive in an empty directory on the cluster and follow the instructions in the README file
```console
$ unzip capacity.zip
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment