Skip to content
Snippets Groups Projects
Commit f1c79e67 authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Update job-submission-and-execution.md

parent edd9973a
No related branches found
No related tags found
4 merge requests!368Update prace.md to document the change from qprace to qprod as the default...,!367Update prace.md to document the change from qprace to qprod as the default...,!366Update prace.md to document the change from qprace to qprod as the default...,!323extended-acls-storage-section
...@@ -11,10 +11,7 @@ When allocating computational resources for the job, specify: ...@@ -11,10 +11,7 @@ When allocating computational resources for the job, specify:
1. your Project ID 1. your Project ID
1. a Jobscript or interactive switch 1. a Jobscript or interactive switch
!!! note Submit the job using the `qsub` command:
Use the **qsub** command to submit your job to a queue for allocation of computational resources.
Submit the job using the qsub command:
```console ```console
$ qsub -A Project_ID -q queue -l select=x:ncpus=y,walltime=[[hh:]mm:]ss[.ms] jobscript $ qsub -A Project_ID -q queue -l select=x:ncpus=y,walltime=[[hh:]mm:]ss[.ms] jobscript
...@@ -68,7 +65,7 @@ $ qsub -m n ...@@ -68,7 +65,7 @@ $ qsub -m n
### Salomon - Intel Xeon Phi Co-Processors ### Salomon - Intel Xeon Phi Co-Processors
To allocate a node with Xeon Phi co-processor, the user needs to specify that in the select statement. Currently only allocation of whole nodes with both Phi cards as the smallest chunk is supported. A standard PBSPro approach through the "accelerator", "naccelerators", and "accelerator_model" attributes is used. The "accelerator_model" can be omitted since on Salomon, only one type of accelerator type/model is available. To allocate a node with Xeon Phi co-processor, the user needs to specify that in the select statement. Currently only allocation of whole nodes with both Phi cards as the smallest chunk is supported. A standard PBSPro approach through the `accelerator`, `naccelerators`, and `accelerator_model` attributes is used. The `accelerator_model` can be omitted since on Salomon, only one type of accelerator type/model is available.
The absence of specialized queue for accessing the nodes with cards means, that the Phi cards can be utilized in any queue, including qexp for testing/experiments, qlong for longer jobs, qfree after the project resources have been spent, etc. The Phi cards are thus also available to PRACE users. There is no need to ask for permission to utilize the Phi cards in project proposals. The absence of specialized queue for accessing the nodes with cards means, that the Phi cards can be utilized in any queue, including qexp for testing/experiments, qlong for longer jobs, qfree after the project resources have been spent, etc. The Phi cards are thus also available to PRACE users. There is no need to ask for permission to utilize the Phi cards in project proposals.
```console ```console
...@@ -111,7 +108,7 @@ exec_vnode = (r21u05n581-mic0:naccelerators=1:ncpus=0) ...@@ -111,7 +108,7 @@ exec_vnode = (r21u05n581-mic0:naccelerators=1:ncpus=0)
Per NUMA node allocation. Per NUMA node allocation.
Jobs are isolated by cpusets. Jobs are isolated by cpusets.
The UV2000 (node uv1) offers 3TB of RAM and 104 cores, distributed in 13 NUMA nodes. A NUMA node packs 8 cores and approx. 247GB RAM (with exception, node 11 has only 123GB RAM). In the PBS the UV2000 provides 13 chunks, a chunk per NUMA node (see [Resource allocation policy][1]). The jobs on UV2000 are isolated from each other by cpusets, so that a job by one user may not utilize CPU or memory allocated to a job by other user. Always, full chunks are allocated, a job may only use resources of the NUMA nodes allocated to itself. The UV2000 (node uv1) offers 3TB of RAM and 104 cores, distributed in 13 NUMA nodes. A NUMA node packs 8 cores and approx. 247GB RAM (with exception, node 11 has only 123GB RAM). In the PBS the UV2000 provides 13 chunks, a chunk per NUMA node (see [Resource allocation policy][2]). The jobs on UV2000 are isolated from each other by cpusets, so that a job by one user may not utilize CPU or memory allocated to a job by other user. Always, full chunks are allocated, a job may only use resources of the NUMA nodes allocated to itself.
```console ```console
$ qsub -A OPEN-0-0 -q qfat -l select=13 ./myjob $ qsub -A OPEN-0-0 -q qfat -l select=13 ./myjob
...@@ -139,7 +136,7 @@ In this example, we allocate 2000GB of memory and 16 cores on the UV2000 for 48 ...@@ -139,7 +136,7 @@ In this example, we allocate 2000GB of memory and 16 cores on the UV2000 for 48
### Useful Tricks ### Useful Tricks
All qsub options may be [saved directly into the jobscript][2]. In such a case, no options to qsub are needed. All qsub options may be [saved directly into the jobscript][1]. In such a case, no options to qsub are needed.
```console ```console
$ qsub ./myjob $ qsub ./myjob
...@@ -186,9 +183,9 @@ In this example, we allocate 4 nodes, 16 cores per node, selecting only the node ...@@ -186,9 +183,9 @@ In this example, we allocate 4 nodes, 16 cores per node, selecting only the node
### Anselm - Placement by IB Switch ### Anselm - Placement by IB Switch
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network][2] fat tree topology. Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and facilitates unbiased, highly efficient network communication. Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network][3] fat tree topology. Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and facilitates unbiased, highly efficient network communication.
Nodes sharing the same switch may be selected via the PBS resource attribute ibswitch. Values of this attribute are iswXX, where XX is the switch number. The node-switch mapping can be seen in the [Hardware Overview][3] section. Nodes sharing the same switch may be selected via the PBS resource attribute `ibswitch`. Values of this attribute are `iswXX`, where `XX` is the switch number. The node-switch mapping can be seen in the [Hardware Overview][4] section.
We recommend allocating compute nodes to a single switch when best possible computational network performance is required to run the job efficiently: We recommend allocating compute nodes to a single switch when best possible computational network performance is required to run the job efficiently:
...@@ -211,7 +208,7 @@ Nodes directly connected to the same InifiBand switch can communicate most effic ...@@ -211,7 +208,7 @@ Nodes directly connected to the same InifiBand switch can communicate most effic
!!! note !!! note
We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently. We recommend allocating compute nodes of a single switch when the best possible computational network performance is required to run job efficiently.
Nodes directly connected to the one InifiBand switch can be allocated using node grouping on the PBS resource attribute _switch_. Nodes directly connected to the one InifiBand switch can be allocated using node grouping on the PBS resource attribute `switch`.
In this example, we request all 9 nodes directly connected to the same switch using node grouping placement. In this example, we request all 9 nodes directly connected to the same switch using node grouping placement.
...@@ -224,7 +221,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob ...@@ -224,7 +221,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=9:ncpus=24 -l place=group=switch ./myjob
!!! note !!! note
Not useful for ordinary computing, suitable for testing and management tasks. Not useful for ordinary computing, suitable for testing and management tasks.
Nodes directly connected to the specific InifiBand switch can be selected using the PBS resource attribute _switch_. Nodes directly connected to the specific InifiBand switch can be selected using the PBS resource attribute `switch`.
In this example, we request all 9 nodes directly connected to the r4i1s0sw1 switch. In this example, we request all 9 nodes directly connected to the r4i1s0sw1 switch.
...@@ -273,7 +270,7 @@ Nodes located in the same dimension group may be allocated using node grouping o ...@@ -273,7 +270,7 @@ Nodes located in the same dimension group may be allocated using node grouping o
| 6D | ehc_6d | 432,576 | | 6D | ehc_6d | 432,576 |
| 7D | ehc_7d | all | | 7D | ehc_7d | all |
In this example, we allocate 16 nodes in the same [hypercube dimension][4] 1 group. In this example, we allocate 16 nodes in the same [hypercube dimension][5] 1 group.
```console ```console
$ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I $ qsub -A OPEN-0-0 -q qprod -l select=16:ncpus=24 -l place=group=ehc_1d -I
...@@ -335,7 +332,7 @@ Although this example is somewhat artificial, it demonstrates the flexibility of ...@@ -335,7 +332,7 @@ Although this example is somewhat artificial, it demonstrates the flexibility of
## Job Management ## Job Management
!!! note !!! note
Check status of your jobs using the **qstat** and **check-pbs-jobs** commands Check status of your jobs using the `qstat` and `check-pbs-jobs` commands
```console ```console
$ qstat -a $ qstat -a
...@@ -414,21 +411,21 @@ Run loop 3 ...@@ -414,21 +411,21 @@ Run loop 3
In this example, we see the actual output (some iteration loops) of the job 35141.dm2. In this example, we see the actual output (some iteration loops) of the job 35141.dm2.
!!! note !!! note
Manage your queued or running jobs, using the **qhold**, **qrls**, **qdel**, **qsig**, or **qalter** commands Manage your queued or running jobs, using the `qhold`, `qrls`, `qdel`, `qsig`, or `qalter` commands
You may release your allocation at any time, using the qdel command You may release your allocation at any time, using the `qdel` command
```console ```console
$ qdel 12345.srv11 $ qdel 12345.srv11
``` ```
You may kill a running job by force, using the qsig command You may kill a running job by force, using the `qsig` command
```console ```console
$ qsig -s 9 12345.srv11 $ qsig -s 9 12345.srv11
``` ```
Learn more by reading the pbs man page Learn more by reading the PBS man page
```console ```console
$ man pbs_professional $ man pbs_professional
...@@ -441,7 +438,7 @@ $ man pbs_professional ...@@ -441,7 +438,7 @@ $ man pbs_professional
!!! note !!! note
Prepare the jobscript to run batch jobs in the PBS queue system Prepare the jobscript to run batch jobs in the PBS queue system
The Jobscript is a user made script controlling a sequence of commands for executing the calculation. It is often written in bash, though other scripts may be used as well. The jobscript is supplied to the PBS **qsub** command as an argument, and is executed by the PBS Professional workload manager. The Jobscript is a user made script controlling a sequence of commands for executing the calculation. It is often written in bash, though other scripts may be used as well. The jobscript is supplied to the PBS `qsub` command as an argument, and is executed by the PBS Professional workload manager.
!!! note !!! note
The jobscript or interactive shell is executed on first of the allocated nodes. The jobscript or interactive shell is executed on first of the allocated nodes.
...@@ -474,7 +471,7 @@ $ pwd ...@@ -474,7 +471,7 @@ $ pwd
In this example, 4 nodes were allocated interactively for 1 hour via the qexp queue. The interactive shell is executed in the home directory. In this example, 4 nodes were allocated interactively for 1 hour via the qexp queue. The interactive shell is executed in the home directory.
!!! note !!! note
All nodes within the allocation may be accessed via SSH. Unallocated nodes are not accessible to the user. All nodes within the allocation may be accessed via SSH. Unallocated nodes are not accessible to the user.
The allocated nodes are accessible via SSH from login nodes. The nodes may access each other via SSH as well. The allocated nodes are accessible via SSH from login nodes. The nodes may access each other via SSH as well.
...@@ -538,12 +535,12 @@ exit ...@@ -538,12 +535,12 @@ exit
In this example, a directory in /home holds the input file input and the mympiprog.x executable. We create the myjob directory on the /scratch filesystem, copy input and executable files from the /home directory where the qsub was invoked ($PBS_O_WORKDIR) to /scratch, execute the MPI program mympiprog.x and copy the output file back to the /home directory. mympiprog.x is executed as one process per node, on all allocated nodes. In this example, a directory in /home holds the input file input and the mympiprog.x executable. We create the myjob directory on the /scratch filesystem, copy input and executable files from the /home directory where the qsub was invoked ($PBS_O_WORKDIR) to /scratch, execute the MPI program mympiprog.x and copy the output file back to the /home directory. mympiprog.x is executed as one process per node, on all allocated nodes.
!!! note !!! note
Consider preloading inputs and executables onto [shared scratch][4] memory before the calculation starts. Consider preloading inputs and executables onto [shared scratch][6] memory before the calculation starts.
In some cases, it may be impractical to copy the inputs to the scratch memory and the outputs to the home directory. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such cases, it is the users' responsibility to preload the input files on shared /scratch memory before the job submission, and retrieve the outputs manually after all calculations are finished. In some cases, it may be impractical to copy the inputs to the scratch memory and the outputs to the home directory. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such cases, it is the users' responsibility to preload the input files on shared /scratch memory before the job submission, and retrieve the outputs manually after all calculations are finished.
!!! note !!! note
Store the qsub options within the jobscript. Use the **mpiprocs** and **ompthreads** qsub options to control the MPI job execution. Store the qsub options within the jobscript. Use the `mpiprocs` and `ompthreads` qsub options to control the MPI job execution.
### Example Jobscript for MPI Calculation With Preloaded Inputs ### Example Jobscript for MPI Calculation With Preloaded Inputs
...@@ -570,16 +567,16 @@ mpirun ./mympiprog.x ...@@ -570,16 +567,16 @@ mpirun ./mympiprog.x
exit exit
``` ```
In this example, input and executable files are assumed to be preloaded manually in the /scratch/$USER/myjob directory. Note the **mpiprocs** and **ompthreads** qsub options controlling the behavior of the MPI execution. mympiprog.x is executed as one process per node, on all 100 allocated nodes. If mympiprog.x implements OpenMP threads, it will run 16 threads per node. In this example, input and executable files are assumed to be preloaded manually in the /scratch/$USER/myjob directory. Note the `mpiprocs` and `ompthreads` qsub options controlling the behavior of the MPI execution. mympiprog.x is executed as one process per node, on all 100 allocated nodes. If mympiprog.x implements OpenMP threads, it will run 16 threads per node.
More information can be found in the [Running OpenMPI][5] and [Running MPICH2][6] sections. More information can be found in the [Running OpenMPI][7] and [Running MPICH2][8] sections.
### Example Jobscript for Single Node Calculation ### Example Jobscript for Single Node Calculation
!!! note !!! note
The local scratch directory is often useful for single node jobs. Local scratch memory will be deleted immediately after the job ends. The local scratch directory is often useful for single node jobs. Local scratch memory will be deleted immediately after the job ends.
Example jobscript for single node calculation, using [local scratch][4] memory on the node: Example jobscript for single node calculation, using [local scratch][6] memory on the node:
```bash ```bash
#!/bin/bash #!/bin/bash
...@@ -605,12 +602,14 @@ In this example, a directory in /home holds the input file input and executable ...@@ -605,12 +602,14 @@ In this example, a directory in /home holds the input file input and executable
### Other Jobscript Examples ### Other Jobscript Examples
Further jobscript examples may be found in the software section and the [Capacity computing][7] section. Further jobscript examples may be found in the software section and the [Capacity computing][9] section.
[1]: #example-jobscript-for-mpi-calculation-with-preloaded-inputs [1]: #example-jobscript-for-mpi-calculation-with-preloaded-inputs
[2]: ../anselm/network.md [2]: resources-allocation-policy.md
[3]: ../anselm/hardware-overview.md [3]: ../anselm/network.md
[4]: ../anselm/storage.md [4]: ../anselm/hardware-overview.md
[5]: ../software/mpi/running_openmpi.md [5]: ../salomon/7d-enhanced-hypercube
[6]: ../software/mpi/running-mpich2.md [6]: ../anselm/storage.md
[7]: capacity-computing.md [7]: ../software/mpi/running_openmpi.md
[8]: ../software/mpi/running-mpich2.md
[9]: capacity-computing.md
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment