...
 
Commits (1)
# Capacity Computing
## Introduction
In many cases, it is useful to submit a huge (>100+) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization.
However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
!!! note
Please follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
* Use [Job arrays][1] when running a huge number of [multithread][2] (bound to one node only) or multinode (multithread across several nodes) jobs
* Use [GNU parallel][3] when running single core jobs
* Combine [GNU parallel with Job arrays][4] when running huge number of single core jobs
## Policy
1. A user is allowed to submit at most 100 jobs. Each job may be [a job array][1].
1. The array size is at most 1000 subjobs.
## Job Arrays
!!! note
A huge number of jobs may easily be submitted and managed as a job array.
A job array is a compact representation of many jobs, called subjobs. The subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
* each subjob has a unique index, $PBS_ARRAY_INDEX
* job Identifiers of subjobs only differ by their indices
* the state of subjobs can differ (R,Q,...etc.)
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. An entire job array is submitted through a single qsub command and may be managed by qdel, qalter, qhold, qrls, and qsig commands as a single job.
### Shared Jobscript
All subjobs in a job array use the very same, single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by the $PBS_ARRAY_INDEX variable.
Example:
Assume we have 900 input files with the name of each beginning with "file" (e. g. file001, ..., file900). Assume we would like to use each of these input files with program executable myprog.x, each as a separate job.
First, we create a tasklist file (or subjobs list), listing all tasks (subjobs) - all input files in our example:
```console
$ find . -name 'file*' > tasklist
```
Then we create the jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory holds the 900 input files, the executable myprog.x, and the jobscript file. As an input for each run, we take the filename of the input file from the created tasklist file. We copy the input file to the local scratch memory /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in a sequential** manner. Due to the allocation of the whole node, the accounted time is equal to the usage of the whole node, while using only 1/16 of the node!
If running a huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed, then a job array approach should be used. The main difference as compared to previous examples using one node is that the local scratch memory should not be used (as it's not shared between nodes) and MPI or other techniques for parallel multinode processing has to be used properly.
### Submit the Job Array
To submit the job array, use the qsub -J command. The 900 jobs of the [example above][5] may be submitted like this:
```console
$ qsub -N JOBNAME -J 1-900 jobscript
12345[].dm2
```
In this example, we submit a job array of 900 subjobs. Each subjob will run on one full node and is assumed to take less than 2 hours (note the #PBS directives in the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue).
Sometimes for testing purposes, you may need to submit a one-element only array. This is not allowed by PBSPro, but there's a workaround:
```console
$ qsub -N JOBNAME -J 9-10:2 jobscript
```
This will only choose the lower index (9 in this example) for submitting/running your job.
### Manage the Job Array
Check status of the job array using the qstat command.
```console
$ qstat -a 12345[].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[].dm2 user2 qprod xx 13516 1 16 -- 00:50 B 00:02
```
When the status is B it means that some subjobs are already running.
Check the status of the first 100 subjobs using the qstat command.
```console
$ qstat -a 12345[1-100].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[1].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[2].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[3].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:01
12345[4].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
. . . . . . . . . . .
, . . . . . . . . . .
12345[100].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
```
Delete the entire job array. Running subjobs will be killed, queueing subjobs will be deleted.
```console
$ qdel 12345[].dm2
```
Deleting large job arrays may take a while.
Display status information for all user's jobs, job arrays, and subjobs.
```console
$ qstat -u $USER -t
```
Display status information for all user's subjobs.
```console
$ qstat -u $USER -tJ
```
Read more on job arrays in the [PBSPro Users guide][6].
## GNU Parallel
!!! note
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful when running single core jobs via the queue system on Anselm.
For more information and examples see the parallel man page:
```console
$ module add parallel
$ man parallel
```
### GNU Parallel Jobscript
The GNU parallel shell executes multiple instances of the jobscript using all cores on the node. The instances execute different work, controlled by the $PARALLEL_SEQ variable.
Example:
Assume we have 101 input files with name beginning with "file" (e. g. file001, ..., file101). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
```console
$ find . -name 'file*' > tasklist
```
Then we create a jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/tasklist $0 ; }
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
# get individual task from tasklist
TASK=$1
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input
# execute the calculation
cat input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, tasks from the tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of the jobscript is finished, a new instance starts until all entries in the tasklist are processed. Currently processed entries of the joblist may be retrieved via $1 variable. The variable $TASK expands to one of the input filenames from the tasklist. We copy the input file to local scratch memory, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
### Submit the Job
To submit the job, use the qsub command. The 101 task job of the [example above][7] may be submitted as follows:
```console
$ qsub -N JOBNAME jobscript
12345.dm2
```
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
!!! hint
Use #PBS directives at the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel
!!! note
Combine the Job arrays and GNU parallel for the best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on a single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! note
Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU Parallel, Shared jobscript
A combined approach, very similar to job arrays, can be taken. A job array is submitted to the queuing system. The subjobs run GNU parallel. The GNU parallel shell executes multiple instances of the jobscript using all of the cores on the node. The instances execute different work, controlled by the $PBS_JOB_ARRAY and $PARALLEL_SEQ variables.
Example:
Assume we have 992 input files with each name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
```console
$ find . -name 'file*' > tasklist
```
Next we create a file, controlling how many tasks will be executed in one subjob:
```console
$ seq 32 > numtasks
```
Then we create a jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/numtasks $0 ; }
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
# get individual task from tasklist with index from PBS JOB ARRAY and index form Parallel
IDX=$(($PBS_ARRAY_INDEX + $PARALLEL_SEQ - 1))
TASK=$(sed -n "${IDX}p" $PBS_O_WORKDIR/tasklist)
[ -z "$TASK" ] && exit
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input
# execute the calculation
cat input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. The variable $TASK expands to one of the input filenames from the tasklist. We copy the input file to local scratch memory, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once a task is finished, a new task starts, until the number of tasks in the numtasks file is reached.
!!! note
Select subjob walltime and number of tasks per subjob carefully
When deciding this values, keep in mind the following guiding rules:
1. Let n=N/16. Inequality (n+1) \* T < W should hold. N is the number of tasks per subjob, T is expected single task walltime and W is subjob walltime. Short subjob walltime improves scheduling and job throughput.
1. The number of tasks should be modulo 16.
1. These rules are valid only when all tasks have similar task walltimes T.
### Submit the Job Array (-J)
To submit the job array, use the qsub -J command. The 992 task job of the [example above][8] may be submitted like this:
```console
$ qsub -N JOBNAME -J 1-992:32 jobscript
12345[].dm2
```
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on one full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
!!! hint
Use #PBS directives at the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue.
## Examples
Download the examples in [capacity.zip][9], illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs.
Unzip the archive in an empty directory on Anselm and follow the instructions in the README file
```console
$ unzip capacity.zip
$ cat README
```
[1]: #job-arrays
[2]: #shared-jobscript-on-one-node
[3]: #gnu-parallel
[4]: #job-arrays-and-gnu-parallel
[5]: #array_example
[6]: ../pbspro.md
[7]: #gp_example
[8]: #combined_example
[9]: capacity.zip
# Compute Nodes
## Node Configuration
Anselm is a cluster of x86-64 Intel based nodes built with Bull Extreme Computing bullx technology. The cluster contains four types of compute nodes.
### Compute Nodes Without Accelerators
* 180 nodes
* 2880 cores in total
* two Intel Sandy Bridge E5-2665, 8-core, 2.4GHz processors per node
* 64 GB of physical memory per node
* one 500GB SATA 2,5” 7,2 krpm HDD per node
* bullx B510 blade servers
* cn[1-180]
### Compute Nodes With a GPU Accelerator
* 23 nodes
* 368 cores in total
* two Intel Sandy Bridge E5-2470, 8-core, 2.3GHz processors per node
* 96 GB of physical memory per node
* one 500GB SATA 2,5” 7,2 krpm HDD per node
* GPU accelerator 1x NVIDIA Tesla Kepler K20m per node
* bullx B515 blade servers
* cn[181-203]
### Compute Nodes With a MIC Accelerator
* 4 nodes
* 64 cores in total
* two Intel Sandy Bridge E5-2470, 8-core, 2.3GHz processors per node
* 96 GB of physical memory per node
* one 500GB SATA 2,5” 7,2 krpm HDD per node
* MIC accelerator 1x Intel Phi 5110P per node
* bullx B515 blade servers
* cn[204-207]
### Fat Compute Nodes
* 2 nodes
* 32 cores in total
* 2 Intel Sandy Bridge E5-2665, 8-core, 2.4GHz processors per node
* 512 GB of physical memory per node
* two 300GB SAS 3,5” 15krpm HDD (RAID1) per node
* two 100GB SLC SSD per node
* bullx R423-E3 servers
* cn[208-209]
![](../img/bullxB510.png)
**Anselm bullx B510 servers**
### Compute Node Summary
| Node type | Count | Range | Memory | Cores | Queues |
| ---------------------------- | ----- | ----------- | ------ | ----------- | -------------------------------------- |
| Nodes without an accelerator | 180 | cn[1-180] | 64GB | 16 @ 2.4GHz | qexp, qprod, qlong, qfree, qprace, qatlas |
| Nodes with a GPU accelerator | 23 | cn[181-203] | 96GB | 16 @ 2.3GHz | qnvidia, qexp |
| Nodes with a MIC accelerator | 4 | cn[204-207] | 96GB | 16 @ 2.3GHz | qmic, qexp |
| Fat compute nodes | 2 | cn[208-209] | 512GB | 16 @ 2.4GHz | qfat, qexp |
## Processor Architecture
Anselm is equipped with Intel Sandy Bridge processors Intel Xeon E5-2665 (nodes without accelerators and fat nodes) and Intel Xeon E5-2470 (nodes with accelerators). The processors support Advanced Vector Extensions (AVX) 256-bit instruction set.
### Intel Sandy Bridge E5-2665 Processor
* eight-core
* speed: 2.4 GHz, up to 3.1 GHz using Turbo Boost Technology
* peak performance: 19.2 GFLOP/s per core
* caches:
* L2: 256 KB per core
* L3: 20 MB per processor
* memory bandwidth at the level of the processor: 51.2 GB/s
### Intel Sandy Bridge E5-2470 Processor
* eight-core
* speed: 2.3 GHz, up to 3.1 GHz using Turbo Boost Technology
* peak performance: 18.4 GFLOP/s per core
* caches:
* L2: 256 KB per core
* L3: 20 MB per processor
* memory bandwidth at the level of the processor: 38.4 GB/s
Nodes equipped with Intel Xeon E5-2665 CPU have a set PBS resource attribute cpu_freq = 24, nodes equipped with Intel Xeon E5-2470 CPU have set PBS resource attribute cpu_freq = 23.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
```
In this example, we allocate 4 nodes, 16 cores at 2.4GHhz per node.
Intel Turbo Boost Technology is used by default, you can disable it for all nodes of job by using resource attribute cpu_turbo_boost.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
```
## Memmory Architecture
The cluster contains three types of compute nodes.
### Compute Nodes Without Accelerators
* 2 sockets
* Memory Controllers are integrated into processors.
* 8 DDR3 DIMMs per node
* 4 DDR3 DIMMs per CPU
* 1 DDR3 DIMMs per channel
* Data rate support: up to 1600MT/s
* Populated memory: 8 x 8 GB DDR3 DIMM 1600 MHz
### Compute Nodes With a GPU or MIC Accelerator
* 2 sockets
* Memory Controllers are integrated into processors.
* 6 DDR3 DIMMs per node
* 3 DDR3 DIMMs per CPU
* 1 DDR3 DIMMs per channel
* Data rate support: up to 1600MT/s
* Populated memory: 6 x 16 GB DDR3 DIMM 1600 MHz
### Fat Compute Nodes
* 2 sockets
* Memory Controllers are integrated into processors.
* 16 DDR3 DIMMs per node
* 8 DDR3 DIMMs per CPU
* 2 DDR3 DIMMs per channel
* Data rate support: up to 1600MT/s
* Populated memory: 16 x 32 GB DDR3 DIMM 1600 MHz
# Hardware Overview
The Anselm cluster consists of 209 computational nodes named cn[1-209] of which 180 are regular compute nodes, 23 are GPU Kepler K20 accelerated nodes, 4 are MIC Xeon Phi 5110P accelerated nodes, and 2 are fat nodes. Each node is a powerful x86-64 computer, equipped with 16 cores (two eight-core Intel Sandy Bridge processors), at least 64 GB of RAM, and a local hard drive. User access to the Anselm cluster is provided by two login nodes login[1,2]. The nodes are interlinked through high speed InfiniBand and Ethernet networks. All nodes share a 320 TB /home disk for storage of user files. The 146 TB shared /scratch storage is available for scratch data.
The Fat nodes are equipped with a large amount (512 GB) of memory. Virtualization infrastructure provides resources to run long term servers and services in virtual mode. Fat nodes and virtual servers may access 45 TB of dedicated block storage. Accelerated nodes, fat nodes, and virtualization infrastructure are available [upon request][a] from a PI.
Schematic representation of the Anselm cluster. Each box represents a node (computer) or storage capacity:
![](../img/Anselm-Schematic-Representation.png)
The cluster compute nodes cn[1-207] are organized within 13 chassis.
There are four types of compute nodes:
* 180 compute nodes without an accelerator
* 23 compute nodes with a GPU accelerator - an NVIDIA Tesla Kepler K20m
* 4 compute nodes with a MIC accelerator - an Intel Xeon Phi 5110P
* 2 fat nodes - equipped with 512 GB of RAM and two 100 GB SSD drives
[More about Compute nodes][1].
GPU and accelerated nodes are available upon request, see the [Resources Allocation Policy][2].
All of these nodes are interconnected through fast InfiniBand and Ethernet networks. [More about the Network][3].
Every chassis provides an InfiniBand switch, marked **isw**, connecting all nodes in the chassis, as well as connecting the chassis to the upper level switches.
All of the nodes share a 360 TB /home disk for storage of user files. The 146 TB shared /scratch storage is available for scratch data. These file systems are provided by the Lustre parallel file system. There is also local disk storage available on all compute nodes in /lscratch. [More about Storage][4].
User access to the Anselm cluster is provided by two login nodes login1, login2, and data mover node dm1. [More about accessing the cluster][5].
The parameters are summarized in the following tables:
| **In general** | |
| ------------------------------------------- | -------------------------------------------- |
| Primary purpose | High Performance Computing |
| Architecture of compute nodes | x86-64 |
| Operating system | Linux (CentOS) |
| [**Compute nodes**][1] | |
| Total | 209 |
| Processor cores | 16 (2 x 8 cores) |
| RAM | min. 64 GB, min. 4 GB per core |
| Local disk drive | yes - usually 500 GB |
| Compute network | InfiniBand QDR, fully non-blocking, fat-tree |
| w/o accelerator | 180, cn[1-180] |
| GPU accelerated | 23, cn[181-203] |
| MIC accelerated | 4, cn[204-207] |
| Fat compute nodes | 2, cn[208-209] |
| **In total** | |
| Total theoretical peak performance (Rpeak) | 94 TFLOP/s |
| Total max. LINPACK performance (Rmax) | 73 TFLOP/s |
| Total amount of RAM | 15.136 TB |
| Node | Processor | Memory | Accelerator |
| ---------------- | --------------------------------------- | ------ | -------------------- |
| w/o accelerator | 2 x Intel Sandy Bridge E5-2665, 2.4 GHz | 64 GB | - |
| GPU accelerated | 2 x Intel Sandy Bridge E5-2470, 2.3 GHz | 96 GB | NVIDIA Kepler K20m |
| MIC accelerated | 2 x Intel Sandy Bridge E5-2470, 2.3 GHz | 96 GB | Intel Xeon Phi 5110P |
| Fat compute node | 2 x Intel Sandy Bridge E5-2665, 2.4 GHz | 512 GB | - |
For more details refer to [Compute nodes][1], [Storage][4], and [Network][3].
[1]: compute-nodes.md
[2]: resources-allocation-policy.md
[3]: network.md
[4]: storage.md
[5]: shell-and-data-access.md
[a]: https://support.it4i.cz/rt
# Introduction
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totalling 3344 compute cores with 15 TB RAM, giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB of RAM, and a 500 GB hard disk drive. Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview][1].
The cluster runs with an operating system which is compatible with the RedHat [Linux family][a]. We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment][2].
The user data shared file-system (HOME, 320 TB) and job data shared file-system (SCRATCH, 146 TB) are available to users.
The PBS Professional workload manager provides [computing resources allocations and job execution][3].
Read more on how to [apply for resources][4], [obtain login credentials][5] and [access the cluster][6].
[1]: hardware-overview.md
[2]: ../environment-and-modules.md
[3]: resources-allocation-policy.md
[4]: ../general/applying-for-resources.md
[5]: ../general/obtaining-login-credentials/obtaining-login-credentials.md
[6]: shell-and-data-access.md
[a]: http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg
# Job Scheduling
## Job Execution Priority
The scheduler gives each job an execution priority and then uses this job execution priority to select which job(s) to run.
Job execution priority on Anselm is determined by these job properties (in order of importance):
1. queue priority
1. fair-share priority
1. eligible time
### Queue Priority
Queue priority is the priority of the queue in which the job is waiting prior to execution.
Queue priority has the biggest impact on job execution priority. The execution priority of jobs in higher priority queues is always greater than the execution priority of jobs in lower priority queues. Other properties of jobs used for determining the job execution priority (fair-share priority, eligible time) cannot compete with queue priority.
Queue priorities can be seen [here][a].
### Fair-Share Priority
Fair-share priority is priority calculated on the basis of recent usage of resources. Fair-share priority is calculated per project, all members of a project sharing the same fair-share priority. Projects with higher recent usage have a lower fair-share priority than projects with lower or no recent usage.
Fair-share priority is used for ranking jobs with equal queue priority.
Fair-share priority is calculated as
---8<--- "fairshare_formula.md"
where MAX_FAIRSHARE has value 1E6,
usage<sub>Project</sub> is accumulated usage by all members of a selected project,
usage<sub>Total</sub> is total usage by all users, across all projects.
Usage counts allocated core-hours (`ncpus x walltime`). Usage decays, halving at intervals of 168 hours (one week).
Jobs queued in the queue qexp are not used to calculate the project's usage.
!!! note
Calculated usage and fair-share priority can be seen [here][b].
Calculated fair-share priority can be also be seen in the Resource_List.fairshare attribute of a job.
### Eligible Time
Eligible time is the amount (in seconds) of eligible time a job accrues while waiting to run. Jobs with higher eligible time gain higher priority.
Eligible time has the least impact on execution priority. Eligible time is used for sorting jobs with equal queue priority and fair-share priority. It is very, very difficult for eligible time to compete with fair-share priority.
Eligible time can be seen in the eligible_time attribute of job.
### Formula
Job execution priority (job sort formula) is calculated as:
---8<--- "job_sort_formula.md"
### Job Backfilling
The Anselm cluster uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (the job with the highest execution priority) cannot run.
The scheduler makes a list of jobs to run in order of execution priority. The scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
This means that jobs with lower execution priority can be run before jobs with higher execution priority.
!!! note
It is **very beneficial to specify the walltime** when submitting jobs.
Specifying more accurate walltime enables better scheduling, better execution times, and better resource usage. Jobs with suitable (small) walltime can be backfilled - and overtake job(s) with a higher priority.
---8<--- "mathjax.md"
[a]: https://extranet.it4i.cz/anselm/queues
[b]: https://extranet.it4i.cz/anselm/projects
# Job Submission and Execution
## Job Submission
When allocating computational resources for the job, specify:
1. a suitable queue for your job (the default is qprod)
1. the number of computational nodes required
1. the number of cores per node required
1. the maximum wall time allocated to your calculation, note that jobs exceeding the maximum wall time will be killed
1. your Project ID
1. a Jobscript or interactive switch
!!! note
Use the **qsub** command to submit your job to a queue for allocation of computational resources.
Submit the job using the qsub command:
```console
$ qsub -A Project_ID -q queue -l select=x:ncpus=y,walltime=[[hh:]mm:]ss[.ms] jobscript
```
The qsub command submits the job to the queue, i.e. the qsub command creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to the above described policies and constraints. **After the resources are allocated, the jobscript or interactive shell is executed on the first of the allocated nodes.**
!!! note
PBS statement nodes (qsub -l nodes=nodespec) are not supported on the Anselm cluster.
### Job Submission Examples
```console
$ qsub -A OPEN-0-0 -q qprod -l select=64:ncpus=16,walltime=03:00:00 ./myjob
```
In this example, we allocate 64 nodes, 16 cores per node, for 3 hours. We allocate these resources via the qprod queue, consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The jobscript 'myjob' will be executed on the first node in the allocation.
```console
$ qsub -q qexp -l select=4:ncpus=16 -I
```
In this example, we allocate 4 nodes, 16 cores per node, for 1 hour. We allocate these resources via the qexp queue. The resources will be available interactively.
```console
$ qsub -A OPEN-0-0 -q qnvidia -l select=10:ncpus=16 ./myjob
```
In this example, we allocate 10 nvidia accelerated nodes, 16 cores per node, for 24 hours. We allocate these resources via the qnvidia queue. the jobscript 'myjob' will be executed on the first node in the allocation.
```console
$ qsub -A OPEN-0-0 -q qfree -l select=10:ncpus=16 ./myjob
```
In this example, we allocate 10 nodes, 16 cores per node, for 12 hours. We allocate these resources via the qfree queue. It is not required that the project OPEN-0-0 has any available resources left. Consumed resources are still accounted for. The jobscript myjob will be executed on the first node in the allocation.
All qsub options may be [saved directly into the jobscript][1]. In such cases, it is not necessary to specify any options for qsub.
```console
$ qsub ./myjob
```
By default, the PBS batch system sends an e-mail only when the job is aborted. Disabling mail events completely can be done as follows:
```console
$ qsub -m n
```
## Advanced Job Placement
### Placement by Name
Specific nodes may be allocated via the PBS
```console
$ qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=16:host=cn171+1:ncpus=16:host=cn172 -I
```
In this example, we allocate nodes cn171 and cn172, all 16 cores per node, for 24 hours. Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively.
### Placement by CPU Type
Nodes equipped with an Intel Xeon E5-2665 CPU have a base clock frequency of 2.4GHz, nodes equipped with an Intel Xeon E5-2470 CPU have a base frequency of 2.3 GHz (see the section Compute Nodes for details). Nodes may be selected via the PBS resource attribute cpu_freq .
| CPU Type | base freq. | Nodes | cpu_freq attribute |
| ------------------ | ---------- | ---------------------- | ------------------ |
| Intel Xeon E5-2665 | 2.4GHz | cn[1-180], cn[208-209] | 24 |
| Intel Xeon E5-2470 | 2.3GHz | cn[181-207] | 23 |
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
```
In this example, we allocate 4 nodes, 16 cores per node, selecting only the nodes with Intel Xeon E5-2665 CPU.
### Placement by IB Switch
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network][2] fat tree topology. Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and facilitates unbiased, highly efficient network communication.
Nodes sharing the same switch may be selected via the PBS resource attribute ibswitch. Values of this attribute are iswXX, where XX is the switch number. The node-switch mapping can be seen in the [Hardware Overview][3] section.
We recommend allocating compute nodes to a single switch when best possible computational network performance is required to run the job efficiently:
```console
$ qsub -A OPEN-0-0 -q qprod -l select=18:ncpus=16:ibswitch=isw11 ./myjob
```
In this example, we request all of the 18 nodes sharing the isw11 switch for 24 hours. a full chassis will be allocated.
## Advanced Job Handling
### Selecting Turbo Boost Off
Intel Turbo Boost Technology is on by default. We strongly recommend keeping the default.
If necessary (such as in the case of benchmarking) you can disable the Turbo for all nodes of the job by using the PBS resource attribute cpu_turbo_boost:
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
```
More information about the Intel Turbo Boost can be found in the TurboBoost section
### Advanced Examples
In the following example, we select an allocation for benchmarking a very special and demanding MPI program. We request Turbo off, and 2 full chassis of compute nodes (nodes sharing the same IB switches) for 30 minutes:
```console
$ qsub -A OPEN-0-0 -q qprod
-l select=18:ncpus=16:ibswitch=isw10:mpiprocs=1:ompthreads=16+18:ncpus=16:ibswitch=isw20:mpiprocs=16:ompthreads=1
-l cpu_turbo_boost=0,walltime=00:30:00
-N Benchmark ./mybenchmark
```
The MPI processes will be distributed differently on the nodes connected to the two switches. On the isw10 nodes, we will run 1 MPI process per node with 16 threads per process, on isw20 nodes we will run 16 plain MPI processes.
Although this example is somewhat artificial, it demonstrates the flexibility of the qsub command options.
## Job Management
!!! note
Check status of your jobs using the **qstat** and **check-pbs-jobs** commands
```console
$ qstat -a
$ qstat -a -u username
$ qstat -an -u username
$ qstat -f 12345.srv11
```
Example:
```console
$ qstat -a
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
16287.srv11 user1 qlong job1 6183 4 64 -- 144:0 R 38:25
16468.srv11 user1 qlong job2 8060 4 64 -- 144:0 R 17:44
16547.srv11 user2 qprod job3x 13516 2 32 -- 48:00 R 00:58
```
In this example user1 and user2 are running jobs named job1, job2 and job3x. The jobs job1 and job2 are using 4 nodes, 16 cores per node each. job1 has already run for 38 hours and 25 minutes, and job2 for 17 hours 44 minutes. job1 has already consumed `64 x 38.41 = 2458.6` core hours. job3x has already consumed `0.96 x 32 = 30.93` core hours. These consumed core hours will be accounted for on the respective project accounts, regardless of whether the allocated cores were actually used for computations.
The following commands allow you to; check the status of your jobs using the check-pbs-jobs command; check for the presence of user's PBS jobs' processes on execution hosts; display load and processes; display job standard and error output; continuously display (tail -f) job standard or error output;
```console
$ check-pbs-jobs --check-all
$ check-pbs-jobs --print-load --print-processes
$ check-pbs-jobs --print-job-out --print-job-err
$ check-pbs-jobs --jobid JOBID --check-all --print-all
$ check-pbs-jobs --jobid JOBID --tailf-job-out
```
Examples:
```console
$ check-pbs-jobs --check-all
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Check session id: OK
Check processes
cn164: OK
cn165: No process
```
In this example we see that job 35141.dm2 is not currently running any processes on the allocated node cn165, which may indicate an execution error.
```console
$ check-pbs-jobs --print-load --print-processes
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print load
cn164: LOAD: 16.01, 16.01, 16.00
cn165: LOAD: 0.01, 0.00, 0.01
Print processes
%CPU CMD
cn164: 0.0 -bash
cn164: 0.0 /bin/bash /var/spool/PBS/mom_priv/jobs/35141.dm2.SC
cn164: 99.7 run-task
...
```
In this example we see that job 35141.dm2 is currently running a process run-task on node cn164, using one thread only, while node cn165 is empty, which may indicate an execution error.
```console
$ check-pbs-jobs --jobid 35141.dm2 --print-job-out
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print job standard output:
======================== Job start ==========================
Started at : Fri Aug 30 02:47:53 CEST 2013
Script name : script
Run loop 1
Run loop 2
Run loop 3
```
In this example, we see actual output (some iteration loops) of the job 35141.dm2
!!! note
Manage your queued or running jobs, using the **qhold**, **qrls**, **qdel**, **qsig** or **qalter** commands
You may release your allocation at any time, using the qdel command
```console
$ qdel 12345.srv11
```
You may kill a running job by force, using the qsig command
```console
$ qsig -s 9 12345.srv11
```
Learn more by reading the pbs man page
```console
$ man pbs_professional
```
## Job Execution
### Jobscript
!!! note
Prepare the jobscript to run batch jobs in the PBS queue system
The Jobscript is a user made script controlling a sequence of commands for executing the calculation. It is often written in bash, though other scripts may be used as well. The jobscript is supplied to the PBS **qsub** command as an argument, and is executed by the PBS Professional workload manager.
!!! note
The jobscript or interactive shell is executed on first of the allocated nodes.
```console
$ qsub -q qexp -l select=4:ncpus=16 -N Name0 ./myjob
$ qstat -n -u username
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.srv11 username qexp Name0 5530 4 64 -- 01:00 R 00:00
cn17/0*16+cn108/0*16+cn109/0*16+cn110/0*16
```
In this example, the nodes cn17, cn108, cn109, and cn110 were allocated for 1 hour via the qexp queue. The jobscript myjob will be executed on the node cn17, while the nodes cn108, cn109, and cn110 are available for use as well.
The jobscript or interactive shell is by default executed in the home directory
```console
$ qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
```
In this example, 4 nodes were allocated interactively for 1 hour via the qexp queue. The interactive shell is executed in the home directory.
!!! note
All nodes within the allocation may be accessed via ssh. Unallocated nodes are not accessible to the user.
The allocated nodes are accessible via ssh from login nodes. The nodes may access each other via ssh as well.
Calculations on allocated nodes may be executed remotely via the MPI, ssh, pdsh or clush. You may find out which nodes belong to the allocation by reading the $PBS_NODEFILE file
```console
qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
$ sort -u $PBS_NODEFILE
cn17.bullx
cn108.bullx
cn109.bullx
cn110.bullx
$ pdsh -w cn17,cn[108-110] hostname
cn17: cn17
cn108: cn108
cn109: cn109
cn110: cn110
```
In this example, the hostname program is executed via pdsh from the interactive shell. The execution runs on all four allocated nodes. The same result would be achieved if the pdsh is called from any of the allocated nodes or from the login nodes.
### Example Jobscript for MPI Calculation
!!! note
Production jobs must use the /scratch directory for I/O
The recommended way to run production jobs is to change to the /scratch directory early in the jobscript, copy all inputs to /scratch, execute the calculations and copy outputs to the home directory.
```bash
#!/bin/bash
# change to scratch directory, exit on failure
SCRDIR=/scratch/$USER/myjob
mkdir -p $SCRDIR
cd $SCRDIR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/mympiprog.x .
# load the MPI module
ml OpenMPI
# execute the calculation
mpirun -pernode ./mympiprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, a directory in /home holds the input file input and executable mympiprog.x . We create the directory myjob on the /scratch filesystem, copy input and executable files from the /home directory where the qsub was invoked ($PBS_O_WORKDIR) to /scratch, execute the MPI program mympiprog.x and copy the output file back to the /home directory. mympiprog.x is executed as one process per node, on all allocated nodes.
!!! note
Consider preloading inputs and executables onto [shared scratch][4] memory before the calculation starts.
In some cases, it may be impractical to copy the inputs to the scratch memory and the outputs to the home directory. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such cases, it is the users' responsibility to preload the input files on shared /scratch memory before the job submission, and retrieve the outputs manually after all calculations are finished.
!!! note
Store the qsub options within the jobscript. Use **mpiprocs** and **ompthreads** qsub options to control the MPI job execution.
### Example Jobscript for MPI Calculation With Preloaded Inputs
Example jobscript for an MPI job with preloaded inputs and executables, options for qsub are stored within the script:
```bash
#!/bin/bash
#PBS -q qprod
#PBS -N MYJOB
#PBS -l select=100:ncpus=16:mpiprocs=1:ompthreads=16
#PBS -A OPEN-0-0
# change to scratch directory, exit on failure
SCRDIR=/scratch/$USER/myjob
cd $SCRDIR || exit
# load the MPI module
ml OpenMPI
# execute the calculation
mpirun ./mympiprog.x
#exit
exit
```
In this example, input and executable files are assumed to be preloaded manually in the /scratch/$USER/myjob directory. Note the **mpiprocs** and **ompthreads** qsub options controlling the behavior of the MPI execution. mympiprog.x is executed as one process per node, on all 100 allocated nodes. If mympiprog.x implements OpenMP threads, it will run 16 threads per node.
More information can be found in the [Running OpenMPI][5] and [Running MPICH2][6] sections.
### Example Jobscript for Single Node Calculation
!!! note
The local scratch directory is often useful for single node jobs. Local scratch memory will be deleted immediately after the job ends.
Example jobscript for single node calculation, using [local scratch][4] memory on the node:
```bash
#!/bin/bash
# change to local scratch directory
cd /lscratch/$PBS_JOBID || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, a directory in /home holds the input file input and executable myprog.x . We copy input and executable files from the home directory where the qsub was invoked ($PBS_O_WORKDIR) to local scratch memory /lscratch/$PBS_JOBID, execute myprog.x and copy the output file back to the /home directory. myprog.x runs on one node only and may use threads.
### Other Jobscript Examples
Further jobscript examples may be found in the software section and the [Capacity computing][7] section.
[1]: #example-jobscript-for-mpi-calculation-with-preloaded-inputs
[2]: network.md
[3]: hardware-overview.md
[4]: storage.md
[5]: ../software/mpi/running_openmpi.md
[6]: ../software/mpi/running-mpich2.md
[7]: capacity-computing.md
# Network
All of the compute and login nodes of Anselm are interconnected through an [InfiniBand][a] QDR network and a Gigabit [Ethernet][b] network. Both networks may be used to transfer user data.
## InfiniBand Network
All of the compute and login nodes of Anselm are interconnected through a high-bandwidth, low-latency [InfiniBand][a] QDR network (IB 4 x QDR, 40 Gbps). The network topology is a fully non-blocking fat-tree.
The compute nodes may be accessed via the InfiniBand network using ib0 network interface, in address range 10.2.1.1-209. The MPI may be used to establish native InfiniBand connection among the nodes.
!!! note
The network provides **2170 MB/s** transfer rates via the TCP connection (single stream) and up to **3600 MB/s** via the native InfiniBand protocol.
The Fat tree topology ensures that peak transfer rates are achieved between any two nodes, independent of network traffic exchanged among other nodes concurrently.
## Ethernet Network
The compute nodes may be accessed via the regular Gigabit Ethernet network interface eth0, in address range 10.1.1.1-209, or by using aliases cn1-cn209. The network provides **114 MB/s** transfer rates via the TCP connection.
## Example
In this example, we access the node cn110 through the InfiniBand network via the ib0 interface, then from cn110 to cn108 through the Ethernet network.
```console
$ qsub -q qexp -l select=4:ncpus=16 -N Name0 ./myjob
$ qstat -n -u username
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.srv11 username qexp Name0 5530 4 64 -- 01:00 R 00:00
cn17/0*16+cn108/0*16+cn109/0*16+cn110/0*16
$ ssh 10.2.1.110
$ ssh 10.1.1.108
```
[a]: http://en.wikipedia.org/wiki/InfiniBand
[b]: http://en.wikipedia.org/wiki/Ethernet
# Resources Allocation Policy
## Job Queue Policies
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and the resources available to the Project. The Fair-share system of Anselm ensures that individual users may consume approximately equal amounts of resources per week. Detailed information can be found in the [Job scheduling][1] section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. The following table provides the queue partitioning overview:
!!! note
Check the queue status at <https://extranet.it4i.cz/anselm/>
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
| ------------------- | -------------- | -------------------- | ---------------------------------------------------- | --------- | -------- | ------------- | -------- |
| qexp | no | none required | 209 nodes | 1 | 150 | no | 1 h |
| qprod | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 24/48 h |
| qlong | yes | > 0 | 180 nodes w/o accelerator | 16 | 0 | no | 72/144 h |
| qnvidia, qmic | yes | > 0 | 23 nvidia nodes, 4 mic nodes | 16 | 200 | yes | 24/48 h |
| qfat | yes | > 0 | 2 fat nodes | 16 | 200 | yes | 24/144 h |
| qfree | yes | < 120% of allocation | 180 w/o accelerator | 16 | -1024 | no | 12 h |
!!! note
**The qfree queue is not free of charge**. [Normal accounting][2] applies. However, it allows for utilization of free resources, once a project has exhausted all its allocated computational resources. This does not apply to Director's Discretion projects (DD projects) by default. Usage of qfree after exhaustion of DD projects' computational resources is allowed after request for this queue.
**The qexp queue is equipped with nodes which do not have exactly the same CPU clock speed.** Should you need the nodes to have exactly the same CPU speed, you have to select the proper nodes during the PSB job submission.
* **qexp**, the Express queue: This queue is dedicated to testing and running very small jobs. It is not required to specify a project to enter the qexp. There are always 2 nodes reserved for this queue (w/o accelerators), a maximum 8 nodes are available via the qexp for a particular user, from a pool of nodes containing Nvidia accelerated nodes (cn181-203), MIC accelerated nodes (cn204-207) and Fat nodes with 512GB of RAM (cn208-209). This enables us to test and tune accelerated code and code with higher RAM requirements. The nodes may be allocated on a per core basis. No special authorization is required to use qexp. The maximum runtime in qexp is 1 hour.
* **qprod**, the Production queue: This queue is intended for normal production runs. It is required that an active project with nonzero remaining resources is specified to enter the qprod. All nodes may be accessed via the qprod queue, except the reserved ones. 178 nodes without accelerators are included. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprod is 48 hours.
* **qlong**, the Long queue: This queue is intended for long production runs. It is required that an active project with nonzero remaining resources is specified to enter the qlong. Only 60 nodes without acceleration may be accessed via the qlong queue. Full nodes, 16 cores per node, are allocated. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qlong is 144 hours (three times that of the standard qprod time - 3 x 48 h).
* **qnvidia**, qmic, qfat, the Dedicated queues: The queue qnvidia is dedicated to accessing the Nvidia accelerated nodes, the qmic to accessing MIC nodes and qfat the Fat nodes. It is required that an active project with nonzero remaining resources is specified to enter these queues. 23 nvidia, 4 mic, and 2 fat nodes are included. Full nodes, 16 cores per node, are allocated. The queues run with very high priority, the jobs will be scheduled before the jobs coming from the qexp queue. An PI needs to explicitly ask [support][a] for authorization to enter the dedicated queues for all users associated with her/his project.
* **qfree**, The Free resource queue: The queue qfree is intended for utilization of free resources, after a project has exhausted all of its allocated computational resources (Does not apply to DD projects by default; DD projects have to request persmission to use qfree after exhaustion of computational resources). It is required that active project is specified to enter the queue. Consumed resources will be accounted to the Project. Access to the qfree queue is automatically removed if consumed resources exceed 120% of the resources allocated to the Project. Only 180 nodes without accelerators may be accessed from this queue. Full nodes, 16 cores per node, are allocated. The queue runs with very low priority and no special authorization is required to use it. The maximum runtime in qfree is 12 hours.
## Queue Notes
The job wall clock time defaults to **half the maximum time**, see the table above. Longer wall time limits can be [set manually, see examples][3].
Jobs that exceed the reserved wall clock time (Req'd Time) get killed automatically. The wall clock time limit can be changed for queuing jobs (state Q) using the qalter command, however it cannot be changed for a running job (state R).
Anselm users may check the current queue configuration [here][b].
## Queue Status
!!! tip
Check the status of jobs, queues and compute nodes [here][c].
![rspbs web interface](../img/rsweb.png)
Display the queue status on Anselm:
```console
$ qstat -q
```
The PBS allocation overview may be obtained also using the rspbs command:
```console
$ rspbs
Usage: rspbs [options]
Options:
--version show program's version number and exit
-h, --help show this help message and exit
--get-node-ncpu-chart
Print chart of allocated ncpus per node
--summary Print summary
--get-server-details Print server
--get-queues Print queues
--get-queues-details Print queues details
--get-reservations Print reservations
--get-reservations-details
Print reservations details
--get-nodes Print nodes of PBS complex
--get-nodeset Print nodeset of PBS complex
--get-nodes-details Print nodes details
--get-jobs Print jobs
--get-jobs-details Print jobs details
--get-jobs-check-params
Print jobid, job state, session_id, user, nodes
--get-users Print users of jobs
--get-allocated-nodes
Print allocated nodes of jobs
--get-allocated-nodeset
Print allocated nodeset of jobs
--get-node-users Print node users
--get-node-jobs Print node jobs
--get-node-ncpus Print number of ncpus per node
--get-node-allocated-ncpus
Print number of allocated ncpus per node
--get-node-qlist Print node qlist
--get-node-ibswitch Print node ibswitch
--get-user-nodes Print user nodes
--get-user-nodeset Print user nodeset
--get-user-jobs Print user jobs
--get-user-jobc Print number of jobs per user
--get-user-nodec Print number of allocated nodes per user
--get-user-ncpus Print number of allocated ncpus per user
--get-qlist-nodes Print qlist nodes
--get-qlist-nodeset Print qlist nodeset
--get-ibswitch-nodes Print ibswitch nodes
--get-ibswitch-nodeset
Print ibswitch nodeset
--state=STATE Only for given job state
--jobid=JOBID Only for given job ID
--user=USER Only for given user
--node=NODE Only for given node
--nodestate=NODESTATE
Only for given node state (affects only --get-node*
--get-qlist-* --get-ibswitch-* actions)
--incl-finished Include finished jobs
```
---8<--- "resource_accounting.md"
---8<--- "mathjax.md"
[1]: job-priority.md
[2]: #resources-accounting-policy
[3]: job-submission-and-execution.md
[a]: https://support.it4i.cz/rt/
[b]: https://extranet.it4i.cz/anselm/queues
[c]: https://extranet.it4i.cz/anselm/
# Accessing the Cluster
## Shell Access
The Anselm cluster is accessed by SSH protocol via login nodes login1 and login2 at the address anselm.it4i.cz. The login nodes may be addressed specifically, by prepending the login node name to the address.
| Login address | Port | Protocol | Login node |
| --------------------- | ---- | -------- | -------------------------------------------- |
| anselm.it4i.cz | 22 | ssh | round-robin DNS record for login1 and login2 |
| login1.anselm.it4i.cz | 22 | ssh | login1 |
| login2.anselm.it4i.cz | 22 | ssh | login2 |
Authentication is available by [private key][1] only.
!!! note
Please verify SSH fingerprints during the first logon. They are identical on all login nodes:
md5:
29:b3:f4:64:b0:73:f5:6f:a7:85:0f:e0:0d:be:76:bf (DSA)
d4:6f:5c:18:f4:3f:70:ef:bc:fc:cc:2b:fd:13:36:b7 (RSA)
sha256:
LX2034TYy6Lf0Q7Zf3zOIZuFlG09DaSGROGBz6LBUy4 (DSA)
+DcED3GDoA9piuyvQOho+ltNvwB9SJSYXbB639hbejY (RSA)
Private key authentication:
On **Linux** or **Mac**, use:
```console
$ ssh -i /path/to/id_rsa username@anselm.it4i.cz
```
If you see a warning message "UNPROTECTED PRIVATE KEY FILE!", use this command to set lower permissions to the private key file:
```console
$ chmod 600 /path/to/id_rsa
```
On **Windows**, use [PuTTY ssh client][2].
After logging in, you will see the command prompt:
```console
_
/\ | |
/ \ _ __ ___ ___| |_ __ ___
/ /\ \ | '_ \/ __|/ _ \ | '_ ` _ \
/ ____ \| | | \__ \ __/ | | | | | |
/_/ \_\_| |_|___/\___|_|_| |_| |_|
http://www.it4i.cz/?lang=en
Last login: Tue Jul 9 15:57:38 2013 from your-host.example.com
[username@login2.anselm ~]$
```
Example to the cluster login:
!!! note
The environment is **not** shared between login nodes, except for [shared filesystems][3].
## Data Transfer
Data in and out of the system may be transferred by the [scp][a] and sftp protocols. (Not available yet). In the case that large volumes of data are transferred, use the dedicated data mover node dm1.anselm.it4i.cz for increased performance.
| Address | Port | Protocol |
| --------------------- | ---- | --------- |
| anselm.it4i.cz | 22 | scp |
| login1.anselm.it4i.cz | 22 | scp |
| login2.anselm.it4i.cz | 22 | scp |
Authentication is by [private key][1] only.
!!! note
Data transfer rates of up to **160MB/s** can be achieved with scp or sftp.
1TB may be transferred in 1:50h.
To achieve 160MB/s transfer rates, the end user must be connected by 10G line all the way to IT4Innovations, and be using a computer with a fast processor for the transfer. When using a Gigabit ethernet connection, up to 110MB/s transfer rates may be expected. Fast cipher (aes128-ctr) should be used.
!!! note
If you experience degraded data transfer performance, consult your local network provider.
On linux or Mac, use an scp or sftp client to transfer data to Anselm:
```console
$ scp -i /path/to/id_rsa my-local-file username@anselm.it4i.cz:directory/file
```
```console
$ scp -i /path/to/id_rsa -r my-local-dir username@anselm.it4i.cz:directory
```
or
```console
$ sftp -o IdentityFile=/path/to/id_rsa username@anselm.it4i.cz
```
A very convenient way to transfer files in and out of Anselm is via the fuse filesystem [sshfs][b].
```console
$ sshfs -o IdentityFile=/path/to/id_rsa username@anselm.it4i.cz:. mountpoint
```
Using sshfs, the users Anselm home directory will be mounted on your local computer, just like an external disk.
Learn more about ssh, scp and sshfs by reading the manpages
```console
$ man ssh
$ man scp
$ man sshfs
```
On Windows, use the [WinSCP client][c] to transfer the data. The [win-sshfs client][d] provides a way to mount the Anselm filesystems directly as an external disc.
More information about the shared file systems is available [here][4].
## Connection Restrictions
Outgoing connections, from Anselm Cluster login nodes to the outside world, are restricted to the following ports:
| Port | Protocol |
| ---- | -------- |
| 22 | ssh |
| 80 | http |
| 443 | https |
| 9418 | git |
!!! note
Please use **ssh port forwarding** and proxy servers to connect from Anselm to all other remote ports.
Outgoing connections, from Anselm Cluster compute nodes are restricted to the internal network. Direct connections form compute nodes to the outside world are cut.
## Port Forwarding
### Port Forwarding From Login Nodes
!!! note
Port forwarding allows an application running on Anselm to connect to arbitrary remote hosts and ports.
It works by tunneling the connection from Anselm back to users' workstations and forwarding from the workstation to the remote host.
Pick some unused port on the Anselm login node (for example 6000) and establish the port forwarding:
```console
$ ssh -R 6000:remote.host.com:1234 anselm.it4i.cz
```
In this example, we establish port forwarding between port 6000 on Anselm and port 1234 on the remote.host.com. By accessing localhost:6000 on Anselm, an application will see the response of remote.host.com:1234. The traffic will run via the user's local workstation.
Port forwarding may be done **using PuTTY** as well. On the PuTTY Configuration screen, load your Anselm configuration first. Then go to Connection->SSH->Tunnels to set up the port forwarding. Click Remote radio button. Insert 6000 to theSource port textbox. Insert remote.host.com:1234. Click the Add button, then Open.
Port forwarding may be established directly to the remote host. However, this requires that the user has ssh access to remote.host.com
```console
$ ssh -L 6000:localhost:1234 remote.host.com
```
!!! note
Port number 6000 is chosen as an example only. Pick any free port.
### Port Forwarding From Compute Nodes
Remote port forwarding from compute nodes allows applications running on the compute nodes to access hosts outside the Anselm Cluster.
First, establish the remote port forwarding form the login node, as [described above][5].
Second, invoke port forwarding from the compute node to the login node. Insert the following line into your jobscript or interactive shell:
```console
$ ssh -TN -f -L 6000:localhost:6000 login1
```
In this example, we assume that port forwarding from `login1:6000` to `remote.host.com:1234` has been established beforehand. By accessing `localhost:6000`, an application running on a compute node will see the response of `remote.host.com:1234`.
### Using Proxy Servers
Port forwarding is static, each single port is mapped to a particular port on a remote host. Connection to another remote host requires a new forward.
!!! note
Applications with inbuilt proxy support experience unlimited access to remote hosts via a single proxy server.
To establish a local proxy server on your workstation, install and run SOCKS proxy server software. On Linux, sshd demon provides the functionality. To establish SOCKS proxy server listening on port 1080 run:
```console
$ ssh -D 1080 localhost
```
On Windows, install and run the free, open source [Sock Puppet][e] server.
Once the proxy server is running, establish ssh port forwarding from Anselm to the proxy server, port 1080, exactly as [described above][5]:
```console
$ ssh -R 6000:localhost:1080 anselm.it4i.cz
```
Now, configure the applications proxy settings to **localhost:6000**. Use port forwarding to access the [proxy server from compute nodes][5] as well.
## Graphical User Interface
* The [X Window system][6] is the principal way to get GUI access to the clusters.
* [Virtual Network Computing][7] is a graphical [desktop sharing][f] system that uses the [Remote Frame Buffer protocol][g] to remotely control another [computer][h].
## VPN Access
* Access IT4Innovations internal resources via [VPN][8].
[1]: ../general/accessing-the-clusters/shell-access-and-data-transfer/ssh-keys.md
[2]: ../general/accessing-the-clusters/shell-access-and-data-transfer/putty.md
[3]: storage.md#shared-filesystems
[4]: storage.md
[5]: #port-forwarding-from-login-nodes
[6]: ../general/accessing-the-clusters/graphical-user-interface/x-window-system.md
[7]: ../general/accessing-the-clusters/graphical-user-interface/vnc.md
[8]: ../general/accessing-the-clusters/vpn-access.md
[a]: http://en.wikipedia.org/wiki/Secure_copy
[b]: http://linux.die.net/man/1/sshfs
[c]: http://winscp.net/eng/download.php
[d]: http://code.google.com/p/win-sshfs/
[e]: http://sockspuppet.com/
[f]: http://en.wikipedia.org/wiki/Desktop_sharing
[g]: http://en.wikipedia.org/wiki/RFB_protocol
[h]: http://en.wikipedia.org/wiki/Computer
# NVIDIA CUDA
Guide to NVIDIA CUDA Programming and GPU Usage
## CUDA Programming on Anselm
The default programming model for GPU accelerators on Anselm is Nvidia CUDA. To set up the environment for CUDA use;
```console
$ ml av cuda
$ ml cuda **or** ml CUDA
```
If the user code is hybrid and uses both CUDA and MPI, the MPI environment has to be set up as well. One way to do this is to use the PrgEnv-gnu module, which sets up the correct combination of the GNU compiler and MPI library;
```console
$ ml PrgEnv-gnu
```
CUDA code can be compiled directly on login1 or login2 nodes. The user does not have to use compute nodes with GPU accelerators for compilation. To compile CUDA source code, use an nvcc compiler;
```console
$ nvcc --version
```
The CUDA Toolkit comes with large number of examples which can be a helpful reference to start with. To compile and test these examples, users should copy them to their home directory;
```console
$ cd ~
$ mkdir cuda-samples
$ cp -R /apps/nvidia/cuda/6.5.14/samples/* ~/cuda-samples/
```
To compile examples, change directory to the particular example (here the example used is deviceQuery) and run "make" to start the compilation;
```console
$ cd ~/cuda-samples/1_Utilities/deviceQuery
$ make
```
To run the code, the user can use PBS interactive session to get access to a node from qnvidia queue (note: use your project name with parameter -A in the qsub command) and execute the binary file;
```console
$ qsub -I -q qnvidia -A OPEN-0-0
$ ml cuda
$ ~/cuda-samples/1_Utilities/deviceQuery/deviceQuery
```
The expected output of the deviceQuery example executed on a node with a Tesla K20m is;
```console
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla K20m"
CUDA Driver Version / Runtime Version 5.0 / 5.0
CUDA Capability Major/Minor version number: 3.5
Total amount of global memory: 4800 MBytes (5032706048 bytes)
(13) Multiprocessors x (192) CUDA Cores/MP: 2496 CUDA Cores
GPU Clock rate: 706 MHz (0.71 GHz)
Memory Clock rate: 2600 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 1310720 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.0, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = Tesla K20m
```
### Code Example
In this section we provide a basic CUDA based vector addition code example. You can directly copy and paste the code to test it.
```cpp
$ vim test.cu
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
#include <stdio.h>
#include <stdlib.h>
// GPU kernel function to add two vectors
__global__ void add_gpu( int *a, int *b, int *c, int n){
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
// CPU function to add two vectors
void add_cpu (int *a, int *b, int *c, int n) {
for (int i=0; i < n; i++)
c[i] = a[i] + b[i];
}
// CPU function to generate a vector of random integers
void random_ints (int *a, int n) {
for (int i = 0; i < n; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare_ints( int *a, int *b, int n ){
int pass = 0;
for (int i = 0; i < N; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %dn",i, a[i], b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passedn"); else printf ("Test Failedn");
return pass;
}
int main( void ) {
int *a, *b, *c; // host copies of a, b, c
int *dev_a, *dev_b, *dev_c; // device copies of a, b, c
int size = N * sizeof( int ); // we need space for N integers
// Allocate GPU/device copies of dev_a, dev_b, dev_c
cudaMalloc( (void**)&dev_a, size );
cudaMalloc( (void**)&dev_b, size );
cudaMalloc( (void**)&dev_c, size );
// Allocate CPU/host copies of a, b, c
a = (int*)malloc( size );
b = (int*)malloc( size );
c = (int*)malloc( size );
// Fill input vectors with random integer numbers
random_ints( a, N );
random_ints( b, N );
// copy inputs to device
cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice );
cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice );
// launch add_gpu() kernel with blocks and threads
add_gpu<<< N/THREADS_PER_BLOCK, THREADS_PER_BLOCK >>( dev_a, dev_b, dev_c, N );
// copy device result back to host copy of c
cudaMemcpy( c, dev_c, size, cudaMemcpyDeviceToHost );
//Check the results with CPU implementation
int *c_h; c_h = (int*)malloc( size );
add_cpu (a, b, c_h, N);
compare_ints(c, c_h, N);
// Clean CPU memory allocations
free( a ); free( b ); free( c ); free (c_h);
// Clean GPU memory allocations
cudaFree( dev_a );
cudaFree( dev_b );
cudaFree( dev_c );
return 0;
}
```
This code can be compiled using the following command;
```console
$ nvcc test.cu -o test_cuda
```
To run the code, use an interactive PBS session to get access to one of the GPU accelerated nodes;
```console
$ qsub -I -q qnvidia -A OPEN-0-0
$ ml cuda
$ ./test.cuda
```
## CUDA Libraries
### cuBLAS
The NVIDIA CUDA Basic Linear Algebra Subroutines (cuBLAS) library is a GPU-accelerated version of the complete standard BLAS library with 152 standard BLAS routines. A basic description of the library together with basic performance comparisons with MKL can be found [here][a].
#### cuBLAS Example: SAXPY
The SAXPY function multiplies the vector x by the scalar alpha, and adds it to the vector y, overwriting the latest vector with the result. A description of the cuBLAS function can be found in [NVIDIA CUDA documentation][b]. Code can be pasted in the file and compiled without any modification.
```cpp
/* Includes, system */
#include <stdio.h>
#include <stdlib.h>
/* Includes, cuda */
#include <cuda_runtime.h>
#include <cublas_v2.h>
/* Vector size */
#define N (32)
/* Host implementation of a simple version of saxpi */
void saxpy(int n, float alpha, const float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = alpha*x[i] + y[i];
}
/* Main */
int main(int argc, char **argv)
{
float *h_X, *h_Y, *h_Y_ref;
float *d_X = 0;
float *d_Y = 0;
const float alpha = 1.0f;
int i;
cublasHandle_t handle;
/* Initialize CUBLAS */
printf("simpleCUBLAS test running..n");
cublasCreate(&handle);
/* Allocate host memory for the matrices */
h_X = (float *)malloc(N * sizeof(h_X[0]));
h_Y = (float *)malloc(N * sizeof(h_Y[0]));
h_Y_ref = (float *)malloc(N * sizeof(h_Y_ref[0]));
/* Fill the matrices with test data */
for (i = 0; i < N; i++)
{
h_X[i] = rand() / (float)RAND_MAX;
h_Y[i] = rand() / (float)RAND_MAX;
h_Y_ref[i] = h_Y[i];
}
/* Allocate device memory for the matrices */
cudaMalloc((void **)&d_X, N * sizeof(d_X[0]));
cudaMalloc((void **)&d_Y, N * sizeof(d_Y[0]));
/* Initialize the device matrices with the host matrices */
cublasSetVector(N, sizeof(h_X[0]), h_X, 1, d_X, 1);
cublasSetVector(N, sizeof(h_Y[0]), h_Y, 1, d_Y, 1);
/* Performs operation using plain C code */
saxpy(N, alpha, h_X, h_Y_ref);
/* Performs operation using cublas */
cublasSaxpy(handle, N, &alpha, d_X, 1, d_Y, 1);
/* Read the result back */
cublasGetVector(N, sizeof(h_Y[0]), d_Y, 1, h_Y, 1);
/* Check result against reference */
for (i = 0; i < N; ++i)
printf("CPU res = %f t GPU res = %f t diff = %f n", h_Y_ref[i], h_Y[i], h_Y_ref[i] - h_Y[i]);
/* Memory clean up */
free(h_X); free(h_Y); free(h_Y_ref);