Commit c87b1044 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

fix

parents 76c18542 9457dacf
......@@ -2,14 +2,14 @@
## Introduction
In many cases, it is useful to submit huge (>100+) number of computational jobs into the PBS queue system. Huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving best runtime, throughput and computer utilization.
In many cases, it is useful to submit a huge (>100+) number of computational jobs into the PBS queue system. A huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving the best runtime, throughput, and computer utilization.
However, executing huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
However, executing a huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling, and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
!!! note
Please follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
* Use [Job arrays](capacity-computing/#job-arrays) when running huge number of [multithread](capacity-computing/#shared-jobscript-on-one-node) (bound to one node only) or multinode (multithread across several nodes) jobs
* Use [Job arrays](capacity-computing/#job-arrays) when running a huge number of [multithread](capacity-computing/#shared-jobscript-on-one-node) (bound to one node only) or multinode (multithread across several nodes) jobs
* Use [GNU parallel](capacity-computing/#gnu-parallel) when running single core jobs
* Combine [GNU parallel with Job arrays](capacity-computing/#job-arrays-and-gnu-parallel) when running huge number of single core jobs
......@@ -21,7 +21,7 @@ However, executing huge number of jobs via the PBS queue may strain the system.
## Job Arrays
!!! note
Huge number of jobs may be easily submitted and managed as a job array.
A huge number of jobs may easily be submitted and managed as a job array.
A job array is a compact representation of many jobs, called subjobs. The subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
......@@ -29,15 +29,15 @@ A job array is a compact representation of many jobs, called subjobs. The subjob
* job Identifiers of subjobs only differ by their indices
* the state of subjobs can differ (R,Q,...etc.)
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. Entire job array is submitted through a single qsub command and may be managed by qdel, qalter, qhold, qrls and qsig commands as a single job.
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. An entire job array is submitted through a single qsub command and may be managed by qdel, qalter, qhold, qrls, and qsig commands as a single job.
### Shared Jobscript
All subjobs in job array use the very same, single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by $PBS_ARRAY_INDEX variable.
All subjobs in a job array use the very same, single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by the $PBS_ARRAY_INDEX variable.
Example:
Assume we have 900 input files with name beginning with "file" (e. g. file001, ..., file900). Assume we would like to use each of these input files with program executable myprog.x, each as a separate job.
Assume we have 900 input files with the name of each beginning with "file" (e. g. file001, ..., file900). Assume we would like to use each of these input files with program executable myprog.x, each as a separate job.
First, we create a tasklist file (or subjobs list), listing all tasks (subjobs) - all input files in our example:
......@@ -45,7 +45,7 @@ First, we create a tasklist file (or subjobs list), listing all tasks (subjobs)
$ find . -name 'file*' > tasklist
```
Then we create jobscript:
Then we create the jobscript:
```bash
#!/bin/bash
......@@ -70,9 +70,9 @@ cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory holds the 900 input files, executable myprog.x and the jobscript file. As input for each run, we take the filename of input file from created tasklist file. We copy the input file to local scratch /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in sequential** manner. Due to allocation of the whole node, the accounted time is equal to the usage of whole node, while using only 1/16 of the node!
In this example, the submit directory holds the 900 input files, the executable myprog.x, and the jobscript file. As an input for each run, we take the filename of the input file from the created tasklist file. We copy the input file to the local scratch memory /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in a sequential** manner. Due to the allocation of the whole node, the accounted time is equal to the usage of the whole node, while using only 1/16 of the node!
If huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed to run, then a job array approach should also be used. The main difference compared to previous example using one node is that the local scratch should not be used (as it's not shared between nodes) and MPI or other technique for parallel multinode run has to be used properly.
If running a huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed, then a job array approach should be used. The main difference as compared to previous examples using one node is that the local scratch memory should not be used (as it's not shared between nodes) and MPI or other techniques for parallel multinode processing has to be used properly.
### Submit the Job Array
......@@ -83,9 +83,9 @@ $ qsub -N JOBNAME -J 1-900 jobscript
12345[].dm2
```
In this example, we submit a job array of 900 subjobs. Each subjob will run on full node and is assumed to take less than 2 hours (please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue).
In this example, we submit a job array of 900 subjobs. Each subjob will run on one full node and is assumed to take less than 2 hours (please note the #PBS directives in the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue).
Sometimes for testing purposes, you may need to submit only one-element array. This is not allowed by PBSPro, but there's a workaround:
Sometimes for testing purposes, you may need to submit a one-element only array. This is not allowed by PBSPro, but there's a workaround:
```console
$ qsub -N JOBNAME -J 9-10:2 jobscript
......@@ -95,7 +95,7 @@ This will only choose the lower index (9 in this example) for submitting/running
### Manage the Job Array
Check status of the job array by the qstat command.
Check status of the job array using the qstat command.
```console
$ qstat -a 12345[].dm2
......@@ -107,8 +107,8 @@ Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
12345[].dm2 user2 qprod xx 13516 1 16 -- 00:50 B 00:02
```
The status B means that some subjobs are already running.
Check status of the first 100 subjobs by the qstat command.
When the status is B it means that some subjobs are already running.
Check the status of the first 100 subjobs using the qstat command.
```console
$ qstat -a 12345[1-100].dm2
......@@ -152,7 +152,7 @@ Read more on job arrays in the [PBSPro Users guide](../pbspro/).
!!! note
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful in running single core jobs via the queue system on Anselm.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful when running single core jobs via the queue system on Anselm.
For more information and examples see the parallel man page:
......@@ -175,7 +175,7 @@ First, we create a tasklist file, listing all tasks - all input files in our exa
$ find . -name 'file*' > tasklist
```
Then we create jobscript:
Then we create a jobscript:
```bash
#!/bin/bash
......@@ -203,11 +203,11 @@ cat input > output
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, tasks from tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of jobscript is finished, new instance starts until all entries in tasklist are processed. Currently processed entry of the joblist may be retrieved via $1 variable. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
In this example, tasks from the tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of the jobscript is finished, a new instance starts until all entries in the tasklist are processed. Currently processed entries of the joblist may be retrieved via $1 variable. The variable $TASK expands to one of the input filenames from the tasklist. We copy the input file to local scratch memory, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
### Submit the Job
To submit the job, use the qsub command. The 101 tasks' job of the [example above](capacity-computing/#gp_example) may be submitted like this:
To submit the job, use the qsub command. The 101 task job of the [example above](capacity-computing/#gp_example) may be submitted as follows:
```console
$ qsub -N JOBNAME jobscript
......@@ -217,25 +217,25 @@ $ qsub -N JOBNAME jobscript
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
!!! hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
Use #PBS directives at the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel
!!! note
Combine the Job arrays and GNU parallel for best throughput of single core jobs
Combine the Job arrays and GNU parallel for the best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on a single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! note
Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU Parallel, Shared jobscript
Combined approach, very similar to job arrays, can be taken. Job array is submitted to the queuing system. The subjobs run GNU parallel. The GNU parallel shell executes multiple instances of the jobscript using all cores on the node. The instances execute different work, controlled by the $PBS_JOB_ARRAY and $PARALLEL_SEQ variables.
A combined approach, very similar to job arrays, can be taken. A job array is submitted to the queuing system. The subjobs run GNU parallel. The GNU parallel shell executes multiple instances of the jobscript using all of the cores on the node. The instances execute different work, controlled by the $PBS_JOB_ARRAY and $PARALLEL_SEQ variables.
Example:
Assume we have 992 input files with name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
Assume we have 992 input files with each name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
......@@ -243,13 +243,13 @@ First, we create a tasklist file, listing all tasks - all input files in our exa
$ find . -name 'file*' > tasklist
```
Next we create a file, controlling how many tasks will be executed in one subjob
Next we create a file, controlling how many tasks will be executed in one subjob:
```console
$ seq 32 > numtasks
```
Then we create jobscript:
Then we create a jobscript:
```bash
#!/bin/bash
......@@ -279,34 +279,34 @@ cat input > output
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. The variable $TASK expands to one of the input filenames from the tasklist. We copy the input file to local scratch memory, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once a task is finished, a new task starts, until the number of tasks in the numtasks file is reached.
!!! note
Select subjob walltime and number of tasks per subjob carefully
When deciding this values, think about following guiding rules:
When deciding this values, keep in mind the following guiding rules:
1. Let n=N/16. Inequality (n+1) \* T < W should hold. The N is number of tasks per subjob, T is expected single task walltime and W is subjob walltime. Short subjob walltime improves scheduling and job throughput.
1. Number of tasks should be modulo 16.
1. Let n=N/16. Inequality (n+1) \* T < W should hold. N is the number of tasks per subjob, T is expected single task walltime and W is subjob walltime. Short subjob walltime improves scheduling and job throughput.
1. The number of tasks should be modulo 16.
1. These rules are valid only when all tasks have similar task walltimes T.
### Submit the Job Array (-J)
To submit the job array, use the qsub -J command. The 992 tasks' job of the [example above](capacity-computing/#combined_example) may be submitted like this:
To submit the job array, use the qsub -J command. The 992 task job of the [example above](capacity-computing/#combined_example) may be submitted like this:
```console
$ qsub -N JOBNAME -J 1-992:32 jobscript
12345[].dm2
```
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on one full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
!!! hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
Use #PBS directives at the beginning of the jobscript file, don't forget to set your valid PROJECT_ID and desired queue.
## Examples
Download the examples in [capacity.zip](capacity.zip), illustrating the above listed ways to run huge number of jobs. We recommend to try out the examples, before using this for running production jobs.
Download the examples in [capacity.zip](capacity.zip), illustrating the above listed ways to run a huge number of jobs. We recommend trying out the examples before using this for running production jobs.
Unzip the archive in an empty directory on Anselm and follow the instructions in the README file
......
# Compute Nodes
## Nodes Configuration
## Node Configuration
Anselm is cluster of x86-64 Intel based nodes built on Bull Extreme Computing bullx technology. The cluster contains four types of compute nodes.
Anselm is cluster of x86-64 Intel based nodes built with Bull Extreme Computing bullx technology. The cluster contains four types of compute nodes.
### Compute Nodes Without Accelerator
### Compute Nodes Without Accelerators
* 180 nodes
* 2880 cores in total
......@@ -14,7 +14,7 @@ Anselm is cluster of x86-64 Intel based nodes built on Bull Extreme Computing bu
* bullx B510 blade servers
* cn[1-180]
### Compute Nodes With GPU Accelerator
### Compute Nodes With a GPU Accelerator
* 23 nodes
* 368 cores in total
......@@ -25,7 +25,7 @@ Anselm is cluster of x86-64 Intel based nodes built on Bull Extreme Computing bu
* bullx B515 blade servers
* cn[181-203]
### Compute Nodes With MIC Accelerator
### Compute Nodes With a MIC Accelerator
* 4 nodes
* 64 cores in total
......@@ -42,26 +42,26 @@ Anselm is cluster of x86-64 Intel based nodes built on Bull Extreme Computing bu
* 32 cores in total
* 2 Intel Sandy Bridge E5-2665, 8-core, 2.4GHz processors per node
* 512 GB of physical memory per node
* two 300GB SAS 3,5”15krpm HDD (RAID1) per node
* two 300GB SAS 3,5” 15krpm HDD (RAID1) per node
* two 100GB SLC SSD per node
* bullx R423-E3 servers
* cn[208-209]
![](../img/bullxB510.png)
**Figure Anselm bullx B510 servers**
**Anselm bullx B510 servers**
### Compute Nodes Summary
### Compute Node Summary
| Node type | Count | Range | Memory | Cores | [Access](resources-allocation-policy/) |
| -------------------------- | ----- | ----------- | ------ | ----------- | -------------------------------------- |
| Nodes without accelerator | 180 | cn[1-180] | 64GB | 16 @ 2.4GHz | qexp, qprod, qlong, qfree, qatlas, qprace |
| Nodes with GPU accelerator | 23 | cn[181-203] | 96GB | 16 @ 2.3GHz | qnvidia, qexp, qatlas |
| Nodes with MIC accelerator | 4 | cn[204-207] | 96GB | 16 @ 2.3GHz | qmic, qexp |
| Fat compute nodes | 2 | cn[208-209] | 512GB | 16 @ 2.4GHz | qfat, qexp |
| Node type | Count | Range | Memory | Cores | [Access](resources-allocation-policy/) |
| ---------------------------- | ----- | ----------- | ------ | ----------- | -------------------------------------- |
| Nodes without an accelerator | 180 | cn[1-180] | 64GB | 16 @ 2.4GHz | qexp, qprod, qlong, qfree, qprace, qatlas |
| Nodes with a GPU accelerator | 23 | cn[181-203] | 96GB | 16 @ 2.3GHz | qgpu, qexp |
| Nodes with a MIC accelerator | 4 | cn[204-207] | 96GB | 16 @ 2.3GHz | qmic, qexp |
| Fat compute nodes | 2 | cn[208-209] | 512GB | 16 @ 2.4GHz | qfat, qexp |
## Processor Architecture
Anselm is equipped with Intel Sandy Bridge processors Intel Xeon E5-2665 (nodes without accelerator and fat nodes) and Intel Xeon E5-2470 (nodes with accelerator). Processors support Advanced Vector Extensions (AVX) 256-bit instruction set.
Anselm is equipped with Intel Sandy Bridge processors Intel Xeon E5-2665 (nodes without accelerators and fat nodes) and Intel Xeon E5-2470 (nodes with accelerators). The processors support Advanced Vector Extensions (AVX) 256-bit instruction set.
### Intel Sandy Bridge E5-2665 Processor
......@@ -83,7 +83,7 @@ Anselm is equipped with Intel Sandy Bridge processors Intel Xeon E5-2665 (nodes
* L3: 20 MB per processor
* memory bandwidth at the level of the processor: 38.4 GB/s
Nodes equipped with Intel Xeon E5-2665 CPU have set PBS resource attribute cpu_freq = 24, nodes equipped with Intel Xeon E5-2470 CPU have set PBS resource attribute cpu_freq = 23.
Nodes equipped with Intel Xeon E5-2665 CPU have a set PBS resource attribute cpu_freq = 24, nodes equipped with Intel Xeon E5-2470 CPU have set PBS resource attribute cpu_freq = 23.
```console
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
......@@ -99,7 +99,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
## Memory Architecture
### Compute Node Without Accelerator
### Compute Nodes Without Accelerators
* 2 sockets
* Memory Controllers are integrated into processors.
......@@ -109,7 +109,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
* Data rate support: up to 1600MT/s
* Populated memory: 8 x 8 GB DDR3 DIMM 1600 MHz
### Compute Node With GPU or MIC Accelerator
### Compute Nodes With a GPU or MIC Accelerator
* 2 sockets
* Memory Controllers are integrated into processors.
......@@ -119,7 +119,7 @@ $ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
* Data rate support: up to 1600MT/s
* Populated memory: 6 x 16 GB DDR3 DIMM 1600 MHz
### Fat Compute Node
### Fat Compute Nodes
* 2 sockets
* Memory Controllers are integrated into processors.
......
# Environment and Modules
## Environment Customization
After logging in, you may want to configure the environment. Write your preferred path definitions, aliases, functions and module loads in the .bashrc file;
```console
$ cat ./bashrc
# ./bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
alias qs='qstat -a'
module load PrgEnv-gnu
# Display information to standard output - only in interactive ssh session
if [ -n "$SSH_TTY" ]
then
module list # Display loaded modules
fi
```
!!! note
Do not run commands outputting to standard output (echo, module list, etc) in .bashrc for non-interactive SSH sessions. It breaks the fundamental functionality (scp, PBS) of your account! Consider utilization of SSH session interactivity for such commands as stated in the previous example.
## Application Modules
In order to configure your shell for running a particular application on Anselm we use Module package interface.
!!! note
The modules set up the application paths, library paths, and environment variables for running a particular application.
We can also have a second modules repository. This modules repository is created using a tool called EasyBuild. On the Salomon cluster, all modules are built with this tool. If you want to use software from this modules repository, please follow the instructions in the section [Application Modules Path Expansion](environment-and-modules/#application-modules-path-expansion).
The modules may be loaded, unloaded, and switched as required.
To check available modules use;
```console
$ module avail **or** ml av
```
To load a module, for example the octave module use;
```console
$ module load octave **or** ml octave
```
loading the octave module will set up paths and the environment variables of your active shell such that you are ready to run the octave software.
To check loaded modules use;
```console
$ module list **or** ml
```
To unload a module, for example the octave module use;
```console
$ module unload octave **or** ml -octave
```
Learn more about modules by reading the module man page;
```console
$ man module
```
The following modules set up the development environment;
PrgEnv-gnu sets up the GNU development environment in conjunction with the bullx MPI library.
PrgEnv-intel sets up the INTEL development environment in conjunction with the Intel MPI library.
## Application Modules Path Expansion
All application modules on the Salomon cluster (and further) are built using a tool called [EasyBuild](http://hpcugent.github.io/easybuild/ "EasyBuild"). In the case that you want to use applications that have already been built by EasyBuild, you have to modify your MODULEPATH environment variable.
```console
export MODULEPATH=$MODULEPATH:/apps/easybuild/modules/all/
```
This command expands your searched paths to modules. You can also add this command to the .bashrc file to expand paths permanently. After this command, you can use the same commands to list/add/remove modules as described above.
# Hardware Overview
The Anselm cluster consists of 209 computational nodes named cn[1-209] of which 180 are regular compute nodes, 23 GPU Kepler K20m accelerated nodes, 4 MIC Xeon Phi 5110P accelerated nodes and 2 fat nodes. Each node is a powerful x86-64 computer, equipped with 16 cores (two eight-core Intel Sandy Bridge processors), at least 64 GB RAM, and local hard drive. The user access to the Anselm cluster is provided by two login nodes login[1,2]. The nodes are interlinked by high speed InfiniBand and Ethernet networks. All nodes share 320 TB /home disk storage to store the user files. The 146 TB shared /scratch storage is available for the scratch data.
The Anselm cluster consists of 209 computational nodes named cn[1-209] of which 180 are regular compute nodes, 23 are GPU Kepler K20 accelerated nodes, 4 are MIC Xeon Phi 5110P accelerated nodes, and 2 are fat nodes. Each node is a powerful x86-64 computer, equipped with 16 cores (two eight-core Intel Sandy Bridge processors), at least 64 GB of RAM, and a local hard drive. User access to the Anselm cluster is provided by two login nodes login[1,2]. The nodes are interlinked through high speed InfiniBand and Ethernet networks. All nodes share a 320 TB /home disk for storage of user files. The 146 TB shared /scratch storage is available for scratch data.
The Fat nodes are equipped with large amount (512 GB) of memory. Fat nodes may access 45 TB of dedicated block storage. Accelerated nodes, fat nodes are available [upon request](https://support.it4i.cz/rt) made by a PI.
The Fat nodes are equipped with a large amount (512 GB) of memory. Virtualization infrastructure provides resources to run long term servers and services in virtual mode. Fat nodes and virtual servers may access 45 TB of dedicated block storage. Accelerated nodes, fat nodes, and virtualization infrastructure are available [upon request](https://support.it4i.cz/rt) from a PI.
Schematic representation of the Anselm cluster. Each box represents a node (computer) or storage capacity:
......@@ -12,21 +12,21 @@ The cluster compute nodes cn[1-207] are organized within 13 chassis.
There are four types of compute nodes:
* 180 compute nodes without the accelerator
* 23 compute nodes with GPU accelerator - equipped with NVIDIA Tesla Kepler K20m
* 4 compute nodes with MIC accelerator - equipped with Intel Xeon Phi 5110P
* 2 fat nodes - equipped with 512 GB RAM and two 100 GB SSD drives
* 180 compute nodes without an accelerator
* 23 compute nodes with a GPU accelerator - an NVIDIA Tesla Kepler K20m
* 4 compute nodes with a MIC accelerator - an Intel Xeon Phi 5110P
* 2 fat nodes - equipped with 512 GB of RAM and two 100 GB SSD drives
[More about Compute nodes](compute-nodes/).
GPU and accelerated nodes are available upon request, see the [Resources Allocation Policy](resources-allocation-policy/).
All these nodes are interconnected by fast InfiniBand network and Ethernet network. [More about the Network](network/).
Every chassis provides InfiniBand switch, marked **isw**, connecting all nodes in the chassis, as well as connecting the chassis to the upper level switches.
All of these nodes are interconnected through fast InfiniBand and Ethernet networks. [More about the Network](network/).
Every chassis provides an InfiniBand switch, marked **isw**, connecting all nodes in the chassis, as well as connecting the chassis to the upper level switches.
All nodes share 360 TB /home disk storage to store user files. The 146 TB shared /scratch storage is available for the scratch data. These file systems are provided by Lustre parallel file system. There is also local disk storage available on all compute nodes in /lscratch. [More about Storage](storage/).
All of the nodes share a 360 TB /home disk for storage of user files. The 146 TB shared /scratch storage is available for scratch data. These file systems are provided by the Lustre parallel file system. There is also local disk storage available on all compute nodes in /lscratch. [More about Storage](storage/).
The user access to the Anselm cluster is provided by two login nodes login1, login2, and data mover node dm1. [More about accessing cluster.](shell-and-data-access/)
User access to the Anselm cluster is provided by two login nodes login1, login2, and data mover node dm1. [More about accessing the cluster.](shell-and-data-access/)
The parameters are summarized in the following tables:
......@@ -36,7 +36,7 @@ The parameters are summarized in the following tables:
| Architecture of compute nodes | x86-64 |
| Operating system | Linux (CentOS) |
| [**Compute nodes**](compute-nodes/) | |
| Totally | 209 |
| Total | 209 |
| Processor cores | 16 (2 x 8 cores) |
| RAM | min. 64 GB, min. 4 GB per core |
| Local disk drive | yes - usually 500 GB |
......@@ -57,4 +57,4 @@ The parameters are summarized in the following tables:
| MIC accelerated | 2 x Intel Sandy Bridge E5-2470, 2.3 GHz | 96 GB | Intel Xeon Phi 5110P |
| Fat compute node | 2 x Intel Sandy Bridge E5-2665, 2.4 GHz | 512 GB | - |
For more details please refer to the [Compute nodes](compute-nodes/), [Storage](storage/), and [Network](network/).
For more details please refer to [Compute nodes](compute-nodes/), [Storage](storage/), and [Network](network/).
# Introduction
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totaling 3344 compute cores with 15 TB RAM and giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB RAM, and 500 GB hard disk drive. Nodes are interconnected by fully non-blocking fat-tree InfiniBand network and equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview](hardware-overview/).
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totalling 3344 compute cores with 15 TB RAM, giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB of RAM, and a 500 GB hard disk drive. Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview](hardware-overview/).
The cluster runs operating system, which is compatible with the RedHat [Linux family.](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg) We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment](../environment-and-modules/).
The cluster runs with an [operating system](software/operating-system/) which is compatible with the RedHat [Linux family.](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg) We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment](environment-and-modules/).
User data shared file-system (HOME, 320 TB) and job data shared file-system (SCRATCH, 146 TB) are available to users.
The user data shared file-system (HOME, 320 TB) and job data shared file-system (SCRATCH, 146 TB) are available to users.
The PBS Professional workload manager provides [computing resources allocations and job execution](resources-allocation-policy/).
......
......@@ -2,7 +2,7 @@
## Job Execution Priority
Scheduler gives each job an execution priority and then uses this job execution priority to select which job(s) to run.
The scheduler gives each job an execution priority and then uses this job execution priority to select which job(s) to run.
Job execution priority on Anselm is determined by these job properties (in order of importance):
......@@ -12,15 +12,15 @@ Job execution priority on Anselm is determined by these job properties (in order
### Queue Priority
Queue priority is priority of queue where job is queued before execution.
Queue priority is the priority of the queue in which the job is waiting prior to execution.
Queue priority has the biggest impact on job execution priority. Execution priority of jobs in higher priority queues is always greater than execution priority of jobs in lower priority queues. Other properties of job used for determining job execution priority (fair-share priority, eligible time) cannot compete with queue priority.
Queue priority has the biggest impact on job execution priority. The execution priority of jobs in higher priority queues is always greater than the execution priority of jobs in lower priority queues. Other properties of jobs used for determining the job execution priority (fair-share priority, eligible time) cannot compete with queue priority.
Queue priorities can be seen at <https://extranet.it4i.cz/anselm/queues>
### Fair-Share Priority
Fair-share priority is priority calculated on recent usage of resources. Fair-share priority is calculated per project, all members of project share same fair-share priority. Projects with higher recent usage have lower fair-share priority than projects with lower or none recent usage.
Fair-share priority is priority calculated on the basis of recent usage of resources. Fair-share priority is calculated per project, all members of a project sharing the same fair-share priority. Projects with higher recent usage have a lower fair-share priority than projects with lower or no recent usage.
Fair-share priority is used for ranking jobs with equal queue priority.
......@@ -29,24 +29,24 @@ Fair-share priority is calculated as
---8<--- "fairshare_formula.md"
where MAX_FAIRSHARE has value 1E6,
usage<sub>Project</sub> is cumulated usage by all members of selected project,
usage<sub>Total</sub> is total usage by all users, by all projects.
usage<sub>Project</sub> is accumulated usage by all members of a selected project,
usage<sub>Total</sub> is total usage by all users, across all projects.
Usage counts allocated core-hours (`ncpus x walltime`). Usage is decayed, or cut in half periodically, at the interval 168 hours (one week).
Jobs queued in queue qexp are not calculated to project's usage.
Usage counts allocated core-hours (`ncpus x walltime`). Usage decays, halving at intervals of 168 hours (one week).
Jobs queued in the queue qexp are not used to calculate the project's usage.
!!! note
Calculated usage and fair-share priority can be seen at <https://extranet.it4i.cz/anselm/projects>.
Calculated fair-share priority can be also seen as Resource_List.fairshare attribute of a job.
Calculated fair-share priority can be also be seen in the Resource_List.fairshare attribute of a job.
### Eligible Time
Eligible time is amount (in seconds) of eligible time job accrued while waiting to run. Jobs with higher eligible time gains higher priority.
Eligible time is the amount (in seconds) of eligible time a job accrues while waiting to run. Jobs with higher eligible time gain higher priority.
Eligible time has the least impact on execution priority. Eligible time is used for sorting jobs with equal queue priority and fair-share priority. It is very, very difficult for eligible time to compete with fair-share priority.
Eligible time can be seen as eligible_time attribute of job.
Eligible time can be seen in the eligible_time attribute of job.
### Formula
......@@ -56,17 +56,17 @@ Job execution priority (job sort formula) is calculated as:
### Job Backfilling
Anselm cluster uses job backfilling.
The Anselm cluster uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (job with the highest execution priority) cannot run.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (the job with the highest execution priority) cannot run.
The scheduler makes a list of jobs to run in order of execution priority. Scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
The scheduler makes a list of jobs to run in order of execution priority. The scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
It means, that jobs with lower execution priority can be run before jobs with higher execution priority.
This means that jobs with lower execution priority can be run before jobs with higher execution priority.
!!! note
It is **very beneficial to specify the walltime** when submitting jobs.
Specifying more accurate walltime enables better scheduling, better execution times and better resource usage. Jobs with suitable (small) walltime could be backfilled - and overtake job(s) with higher priority.
Specifying more accurate walltime enables better scheduling, better execution times, and better resource usage. Jobs with suitable (small) walltime can be backfilled - and overtake job(s) with a higher priority.
---8<--- "mathjax.md"
# Network
All compute and login nodes of Anselm are interconnected by [InfiniBand](http://en.wikipedia.org/wiki/InfiniBand) QDR network and by Gigabit [Ethernet](http://en.wikipedia.org/wiki/Ethernet) network. Both networks may be used to transfer user data.
All of the compute and login nodes of Anselm are interconnected through an [InfiniBand](http://en.wikipedia.org/wiki/InfiniBand) QDR network and a Gigabit [Ethernet](http://en.wikipedia.org/wiki/Ethernet) network. Both networks may be used to transfer user data.
## InfiniBand Network
All compute and login nodes of Anselm are interconnected by a high-bandwidth, low-latency [InfiniBand](http://en.wikipedia.org/wiki/InfiniBand) QDR network (IB 4 x QDR, 40 Gbps). The network topology is a fully non-blocking fat-tree.
All of the compute and login nodes of Anselm are interconnected through a high-bandwidth, low-latency [InfiniBand](http://en.wikipedia.org/wiki/InfiniBand) QDR network (IB 4 x QDR, 40 Gbps). The network topology is a fully non-blocking fat-tree.
The compute nodes may be accessed via the InfiniBand network using ib0 network interface, in address range 10.2.1.1-209. The MPI may be used to establish native InfiniBand connection among the nodes.
!!! note
The network provides **2170 MB/s** transfer rates via the TCP connection (single stream) and up to **3600 MB/s** via native InfiniBand protocol.
The network provides **2170 MB/s** transfer rates via the TCP connection (single stream) and up to **3600 MB/s** via the native InfiniBand protocol.
The Fat tree topology ensures that peak transfer rates are achieved between any two nodes, independent of network traffic exchanged among other nodes concurrently.
......@@ -32,4 +32,4 @@ $ ssh 10.2.1.110
$ ssh 10.1.1.108
```
In this example, we access the node cn110 by InfiniBand network via the ib0 interface, then from cn110 to cn108 by Ethernet network.
In this example, we access the node cn110 through the InfiniBand network via the ib0 interface, then from cn110 to cn108 through the Ethernet network.
# PRACE User Support
## Intro
PRACE users coming to Anselm as to TIER-1 system offered through the DECI calls are in general treated as standard users and so most of the general documentation applies to them as well. This section shows the main differences for quicker orientation, but often uses references to the original documentation. PRACE users who don't undergo the full procedure (including signing the IT4I AuP on top of the PRACE AuP) will not have a password and thus access to some services intended for regular users. This is inconvenient, but otherwise they should be able to use the TIER-1 system as intended. Please see the [Obtaining Login Credentials section](../general/obtaining-login-credentials/obtaining-login-credentials/), if the same level of access is required.
All general [PRACE User Documentation](http://www.prace-ri.eu/user-documentation/) should be read before continuing to read the local documentation here.
## Help and Support
If you have any troubles, need information, require support or want to install additional software, please use the [PRACE Helpdesk](http://www.prace-ri.eu/helpdesk-guide264/).
Information about the local services is provided in the [introduction of the general user documentation](introduction/). Please keep in mind, that standard PRACE accounts don't have a password to access the web interface of the local (IT4Innovations) request tracker and thus a new ticket should be created by sending an e-mail to support[at]it4i.cz.
## Obtaining Login Credentials
In general PRACE users already have a PRACE account setup through their HOMESITE (institution from their country) as a result of rewarded PRACE project proposal. This includes signed PRACE AuP, generated and registered certificates, etc.
If there's a special need a PRACE user can get a standard (local) account at IT4Innovations. To get an account on the Anselm cluster, the user needs to obtain the login credentials. The procedure is the same as for general users of the cluster, so please see the corresponding section of the general documentation here.
## Accessing the Cluster
### Access With GSI-SSH
For all PRACE users, the method for interactive access (login) and data transfer based on grid services from the Globus Toolkit (GSI SSH and GridFTP) is supported.
The user will need a valid certificate and to be present in the PRACE LDAP (please contact your HOME SITE or the primary investigator of your project for LDAP account creation).
Most of the information needed by PRACE users accessing the Anselm TIER-1 system can be found here:
* [General user's FAQ](http://www.prace-ri.eu/Users-General-FAQs)
* [Certificates FAQ](http://www.prace-ri.eu/Certificates-FAQ)
* [Interactive access using GSISSH](http://www.prace-ri.eu/Interactive-Access-Using-gsissh)
* [Data transfer with GridFTP](http://www.prace-ri.eu/Data-Transfer-with-GridFTP-Details)
* [Data transfer with gtransfer](http://www.prace-ri.eu/Data-Transfer-with-gtransfer)
Before you start to use any of the services don't forget to create a proxy certificate from your certificate:
```console
$ grid-proxy-init