Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Commits on Source (734)
Showing
with 1829 additions and 0 deletions
site/
stages:
- test
- build
- deploy
- production
docs:
stage: test
image: davidhrbac/docker-mdcheck:latest
allow_failure: true
script:
- mdl -r ~MD013 *.md docs.it4i/
mkdocs:
stage: build
image: davidhrbac/docker-mkdocscheck:latest
script:
#- apt-get update
#- apt-get -y install git
- bash add_version.sh
- bash get_modules.sh
- mkdocs build
- bash clean_json.sh site/mkdocs/search_index.json
artifacts:
paths:
- site
expire_in: 1 week
shellcheck:
stage: test
image: davidhrbac/docker-shellcheck:latest
allow_failure: true
script:
- which shellcheck || apt-get update && apt-get install -y shellcheck
- shellcheck *.sh
deploy to stage:
environment: stage
stage: deploy
image: davidhrbac/docker-mkdocscheck:latest
before_script:
# install ssh-agent
- 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client -y )'
- 'which rsync || ( apt-get update -y && apt-get install rsync -y )'
# run ssh-agent
- eval $(ssh-agent -s)
# add ssh key stored in SSH_PRIVATE_KEY variable to the agent store
- ssh-add <(echo "$SSH_PRIVATE_KEY")
# disable host key checking (NOTE: makes you susceptible to man-in-the-middle attacks)
# WARNING: use only in docker container, if you use it with shell you will overwrite your user's ssh config
- mkdir -p ~/.ssh
- echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
- useradd -lM nginx
script:
- chown nginx:nginx site -R
- rsync -a --delete site/ root@"$SSH_HOST_STAGE":/srv/docs.it4i.cz/devel/$CI_BUILD_REF_NAME/
deploy to production:
environment: production
stage: deploy
image: davidhrbac/docker-mkdocscheck:latest
before_script:
# install ssh-agent
- 'which ssh-agent || ( apt-get update -y && apt-get install openssh-client -y )'
- 'which rsync || ( apt-get update -y && apt-get install rsync -y )'
# run ssh-agent
- eval $(ssh-agent -s)
# add ssh key stored in SSH_PRIVATE_KEY variable to the agent store
- ssh-add <(echo "$SSH_PRIVATE_KEY")
# disable host key checking (NOTE: makes you susceptible to man-in-the-middle attacks)
# WARNING: use only in docker container, if you use it with shell you will overwrite your user's ssh config
- mkdir -p ~/.ssh
- echo -e "Host *\n\tStrictHostKeyChecking no\n\n" > ~/.ssh/config
- useradd -lM nginx
script:
- chown nginx:nginx site -R
- rsync -a --delete site/ root@"$SSH_HOST_STAGE":/srv/docs.it4i.cz/site/
only:
- master
when: manual
File added
# User documentation - TEST
* https://docs-new.it4i.cz - master branch
* https://docs-new.it4i.cz/devel/$BRANCH_NAME - maps the branches
# URLs
* http://facelessuser.github.io/pymdown-extensions/
* http://squidfunk.github.io/mkdocs-material/
#!/bin/bash
VER=$(git log --pretty=format:'/ ver. %h / %ai' -n 1)
sed "s,__VERSION__, $VER," mkdocs.yml -i
#!/bin/bash
sed 's/\\u00a0/ /g' -i "$@"
sed 's,\\n\\n, ,g' -i "$@"
sed 's,\\n , ,g' -i "$@"
sed 's, \\n, ,g' -i "$@"
sed -e 's/^[ \t]*//' -i "$@"
#!/usr/bin/python
import csv
import collections
def get_data(filename):
'''function to read the data form the input csv file to use in the analysis'''
reader = [] # Just in case the file open fails
with open(filename, 'rb') as f:
reader = csv.reader(f,delimiter=',')
#returns all the data from the csv file in list form
#f.close() # May need to close the file when done
return list(reader) # only return the reader when you have finished.
your_list = []
your_list += get_data('modules-anselm.csv')
your_list += get_data('modules-salomon.csv')
#print your_list
a=[["python/2.8.1",1],["python/2.9.1",2],["python/2.8.1",4],["python/3.0.1",4]]
counts = dict()
for i in your_list:
#print i[0]
#print int(i[1])
counts[i[0]]=counts.get(i[0], 0) + int(i[1])
#print sorted(counts.items())
c=[
"---",
"--A",
"-S-",
"-SA",
"U--",
"U-A",
"US-",
"USA",
]
print "| Module | Versions | Clusters |"
print "| ------ | -------- | -------- |"
software = dict()
versions = ''
clusters = ''
prev = ''
for m,i in sorted(counts.items()):
split = m.split('/')
if len(split) > 1:
a = split[0]
b = split[1]
if split[0] <> prev:
software[a] = {}
software[a][b] = '`' + c[i] + '`'
prev = a
#print software.items()
for m in sorted(software.items(), key=lambda i: i[0].lower()):
software = m[0]
versions = ''
clusters = ''
#print '</br>'.join(m[1].keys())
#print '</br>'.join(m[1].values())
print "| %s | %s | %s |" % (software, '</br>'.join(m[1].keys()), '</br>'.join(m[1].values()))
#!/bin/bash
for i in {"red","pink","purple","deep purple","indigo","blue","light blue","cyan","teal","green","light green","lime","yellow","amber","orange","deep orange","brown","grey","blue grey"}
do
echo "Setting color to: $i, ${i/ /_}, color_${i/ /_}"
#git checkout -b color_${i/ /_}
git checkout color_${i/ /_}
#rm -fr "/home/hrb33/Dokumenty/dev/it4i/docs.it4i.git/.git/rebase-apply"
git pull --rebase origin colors
git pull --rebase
sed -ri "s/(primary: ').*'/\1$i'/" mkdocs.yml
#git cherry-pick bd8e358
git commit -am "Setting color to: $i"
git push --set-upstream origin color_${i/ /_}
git checkout colors
done
Capacity computing
==================
Introduction
------------
In many cases, it is useful to submit huge (>100+) number of computational jobs into the PBS queue system. Huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving best runtime, throughput and computer utilization.
However, executing huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
!!! Note "Note"
Please follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
- Use [Job arrays](capacity-computing/#job-arrays) when running huge number of [multithread](capacity-computing/#shared-jobscript-on-one-node) (bound to one node only) or multinode (multithread across several nodes) jobs
- Use [GNU parallel](capacity-computing/#gnu-parallel) when running single core jobs
- Combine[GNU parallel with Job arrays](capacity-computing/#combining-job-arrays-and-gnu-parallel) when running huge number of single core jobs
Policy
------
1. A user is allowed to submit at most 100 jobs. Each job may be [a job array](capacity-computing/#job-arrays).
2. The array size is at most 1000 subjobs.
Job arrays
--------------
!!! Note "Note"
Huge number of jobs may be easily submitted and managed as a job array.
A job array is a compact representation of many jobs, called subjobs. The subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
- each subjob has a unique index, $PBS_ARRAY_INDEX
- job Identifiers of subjobs only differ by their indices
- the state of subjobs can differ (R,Q,...etc.)
All subjobs within a job array have the same scheduling priority and schedule as independent jobs. Entire job array is submitted through a single qsub command and may be managed by qdel, qalter, qhold, qrls and qsig commands as a single job.
### Shared jobscript
All subjobs in job array use the very same, single jobscript. Each subjob runs its own instance of the jobscript. The instances execute different work controlled by $PBS_ARRAY_INDEX variable.
Example:
Assume we have 900 input files with name beginning with "file" (e. g. file001, ..., file900). Assume we would like to use each of these input files with program executable myprog.x, each as a separate job.
First, we create a tasklist file (or subjobs list), listing all tasks (subjobs) - all input files in our example:
```bash
$ find . -name 'file*' > tasklist
```
Then we create jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# get individual tasks from tasklist with index from PBS JOB ARRAY
TASK=$(sed -n "${PBS_ARRAY_INDEX}p" $PBS_O_WORKDIR/tasklist)
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input ; cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x < input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the submit directory holds the 900 input files, executable myprog.x and the jobscript file. As input for each run, we take the filename of input file from created tasklist file. We copy the input file to local scratch /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to >the submit directory, under the $TASK.out name. The myprog.x runs on one node only and must use threads to run in parallel. Be aware, that if the myprog.x **is not multithreaded**, then all the **jobs are run as single thread programs in sequential** manner. Due to allocation of the whole node, the accounted time is equal to the usage of whole node**, while using only 1/16 of the node!
If huge number of parallel multicore (in means of multinode multithread, e. g. MPI enabled) jobs is needed to run, then a job array approach should also be used. The main difference compared to previous example using one node is that the local scratch should not be used (as it's not shared between nodes) and MPI or other technique for parallel multinode run has to be used properly.
### Submit the job array
To submit the job array, use the qsub -J command. The 900 jobs of the [example above](capacity-computing/#array_example) may be submitted like this:
```bash
$ qsub -N JOBNAME -J 1-900 jobscript
12345[].dm2
```
In this example, we submit a job array of 900 subjobs. Each subjob will run on full node and is assumed to take less than 2 hours (please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue).
Sometimes for testing purposes, you may need to submit only one-element array. This is not allowed by PBSPro, but there's a workaround:
```bash
$ qsub -N JOBNAME -J 9-10:2 jobscript
```
This will only choose the lower index (9 in this example) for submitting/running your job.
### Manage the job array
Check status of the job array by the qstat command.
```bash
$ qstat -a 12345[].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[].dm2 user2 qprod xx 13516 1 16 -- 00:50 B 00:02
```
The status B means that some subjobs are already running.
Check status of the first 100 subjobs by the qstat command.
```bash
$ qstat -a 12345[1-100].dm2
dm2:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
12345[1].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[2].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:02
12345[3].dm2 user2 qprod xx 13516 1 16 -- 00:50 R 00:01
12345[4].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
. . . . . . . . . . .
, . . . . . . . . . .
12345[100].dm2 user2 qprod xx 13516 1 16 -- 00:50 Q --
```
Delete the entire job array. Running subjobs will be killed, queueing subjobs will be deleted.
```bash
$ qdel 12345[].dm2
```
Deleting large job arrays may take a while.
Display status information for all user's jobs, job arrays, and subjobs.
```bash
$ qstat -u $USER -t
```
Display status information for all user's subjobs.
```bash
$ qstat -u $USER -tJ
```
Read more on job arrays in the [PBSPro Users guide](../../pbspro-documentation/sitemap/).
GNU parallel
----------------
!!! Note "Note"
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful in running single core jobs via the queue system on Anselm.
For more information and examples see the parallel man page:
```bash
$ module add parallel
$ man parallel
```
### GNU parallel jobscript
The GNU parallel shell executes multiple instances of the jobscript using all cores on the node. The instances execute different work, controlled by the $PARALLEL_SEQ variable.
Example:
Assume we have 101 input files with name beginning with "file" (e. g. file001, ..., file101). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
```bash
$ find . -name 'file*' > tasklist
```
Then we create jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/tasklist $0 ; }
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
# get individual task from tasklist
TASK=$1
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input
# execute the calculation
cat input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, tasks from tasklist are executed via the GNU parallel. The jobscript executes multiple instances of itself in parallel, on all cores of the node. Once an instace of jobscript is finished, new instance starts until all entries in tasklist are processed. Currently processed entry of the joblist may be retrieved via $1 variable. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name.
### Submit the job
To submit the job, use the qsub command. The 101 tasks' job of the [example above](capacity-computing/#gp_example) may be submitted like this:
```bash
$ qsub -N JOBNAME jobscript
12345.dm2
```
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
Job arrays and GNU parallel
-------------------------------
!!! Note "Note"
Combine the Job arrays and GNU parallel for best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! Note "Note"
Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU parallel, shared jobscript
Combined approach, very similar to job arrays, can be taken. Job array is submitted to the queuing system. The subjobs run GNU parallel. The GNU parallel shell executes multiple instances of the jobscript using all cores on the node. The instances execute different work, controlled by the $PBS_JOB_ARRAY and $PARALLEL_SEQ variables.
Example:
Assume we have 992 input files with name beginning with "file" (e. g. file001, ..., file992). Assume we would like to use each of these input files with program executable myprog.x, each as a separate single core job. We call these single core jobs tasks.
First, we create a tasklist file, listing all tasks - all input files in our example:
```bash
$ find . -name 'file*' > tasklist
```
Next we create a file, controlling how many tasks will be executed in one subjob
```bash
$ seq 32 > numtasks
```
Then we create jobscript:
```bash
#!/bin/bash
#PBS -A PROJECT_ID
#PBS -q qprod
#PBS -l select=1:ncpus=16,walltime=02:00:00
[ -z "$PARALLEL_SEQ" ] &&
{ module add parallel ; exec parallel -a $PBS_O_WORKDIR/numtasks $0 ; }
# change to local scratch directory
SCR=/lscratch/$PBS_JOBID/$PARALLEL_SEQ
mkdir -p $SCR ; cd $SCR || exit
# get individual task from tasklist with index from PBS JOB ARRAY and index form Parallel
IDX=$(($PBS_ARRAY_INDEX + $PARALLEL_SEQ - 1))
TASK=$(sed -n "${IDX}p" $PBS_O_WORKDIR/tasklist)
[ -z "$TASK" ] && exit
# copy input file and executable to scratch
cp $PBS_O_WORKDIR/$TASK input
# execute the calculation
cat input > output
# copy output file to submit directory
cp output $PBS_O_WORKDIR/$TASK.out
```
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
!!! Note "Note"
Select subjob walltime and number of tasks per subjob carefully
When deciding this values, think about following guiding rules:
1. Let n=N/16. Inequality (n+1) * T &lt; W should hold. The N is number of tasks per subjob, T is expected single task walltime and W is subjob walltime. Short subjob walltime improves scheduling and job throughput.
2. Number of tasks should be modulo 16.
3. These rules are valid only when all tasks have similar task walltimes T.
### Submit the job array
To submit the job array, use the qsub -J command. The 992 tasks' job of the [example above](capacity-computing/#combined_example) may be submitted like this:
```bash
$ qsub -N JOBNAME -J 1-992:32 jobscript
12345[].dm2
```
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
Examples
--------
Download the examples in [capacity.zip](capacity.zip), illustrating the above listed ways to run huge number of jobs. We recommend to try out the examples, before using this for running production jobs.
Unzip the archive in an empty directory on Anselm and follow the instructions in the README file
```bash
$ unzip capacity.zip
$ cat README
```
File added
Compute Nodes
=============
Nodes Configuration
-------------------
Anselm is cluster of x86-64 Intel based nodes built on Bull Extreme Computing bullx technology. The cluster contains four types of compute nodes.
###Compute Nodes Without Accelerator
- 180 nodes
- 2880 cores in total
- two Intel Sandy Bridge E5-2665, 8-core, 2.4GHz processors per node
- 64 GB of physical memory per node
- one 500GB SATA 2,5” 7,2 krpm HDD per node
- bullx B510 blade servers
- cn[1-180]
###Compute Nodes With GPU Accelerator
- 23 nodes
- 368 cores in total
- two Intel Sandy Bridge E5-2470, 8-core, 2.3GHz processors per node
- 96 GB of physical memory per node
- one 500GB SATA 2,5” 7,2 krpm HDD per node
- GPU accelerator 1x NVIDIA Tesla Kepler K20 per node
- bullx B515 blade servers
- cn[181-203]
###Compute Nodes With MIC Accelerator
- 4 nodes
- 64 cores in total
- two Intel Sandy Bridge E5-2470, 8-core, 2.3GHz processors per node
- 96 GB of physical memory per node
- one 500GB SATA 2,5” 7,2 krpm HDD per node
- MIC accelerator 1x Intel Phi 5110P per node
- bullx B515 blade servers
- cn[204-207]
###Fat Compute Nodes
- 2 nodes
- 32 cores in total
- 2 Intel Sandy Bridge E5-2665, 8-core, 2.4GHz processors per node
- 512 GB of physical memory per node
- two 300GB SAS 3,5”15krpm HDD (RAID1) per node
- two 100GB SLC SSD per node
- bullx R423-E3 servers
- cn[208-209]
![](../img/bullxB510.png)
**Figure Anselm bullx B510 servers**
### Compute Nodes Summary
|Node type|Count|Range|Memory|Cores|[Access](resources-allocation-policy/)|
|---|---|---|---|---|---|
|Nodes without accelerator|180|cn[1-180]|64GB|16 @ 2.4Ghz|qexp, qprod, qlong, qfree|
|Nodes with GPU accelerator|23|cn[181-203]|96GB|16 @ 2.3Ghz|qgpu, qprod|
|Nodes with MIC accelerator|4|cn[204-207]|96GB|16 @ 2.3GHz|qmic, qprod|
|Fat compute nodes|2|cn[208-209]|512GB|16 @ 2.4GHz|qfat, qprod|
Processor Architecture
----------------------
Anselm is equipped with Intel Sandy Bridge processors Intel Xeon E5-2665 (nodes without accelerator and fat nodes) and Intel Xeon E5-2470 (nodes with accelerator). Processors support Advanced Vector Extensions (AVX) 256-bit instruction set.
### Intel Sandy Bridge E5-2665 Processor
- eight-core
- speed: 2.4 GHz, up to 3.1 GHz using Turbo Boost Technology
- peak performance: 19.2 Gflop/s per
core
- caches:
- L2: 256 KB per core
- L3: 20 MB per processor
- memory bandwidth at the level of the processor: 51.2 GB/s
### Intel Sandy Bridge E5-2470 Processor
- eight-core
- speed: 2.3 GHz, up to 3.1 GHz using Turbo Boost Technology
- peak performance: 18.4 Gflop/s per
core
- caches:
- L2: 256 KB per core
- L3: 20 MB per processor
- memory bandwidth at the level of the processor: 38.4 GB/s
Nodes equipped with Intel Xeon E5-2665 CPU have set PBS resource attribute cpu_freq = 24, nodes equipped with Intel Xeon E5-2470 CPU have set PBS resource attribute cpu_freq = 23.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
```
In this example, we allocate 4 nodes, 16 cores at 2.4GHhz per node.
Intel Turbo Boost Technology is used by default, you can disable it for all nodes of job by using resource attribute cpu_turbo_boost.
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
```
Memory Architecture
-------------------
### Compute Node Without Accelerator
- 2 sockets
- Memory Controllers are integrated into processors.
- 8 DDR3 DIMMS per node
- 4 DDR3 DIMMS per CPU
- 1 DDR3 DIMMS per channel
- Data rate support: up to 1600MT/s
- Populated memory: 8x 8GB DDR3 DIMM 1600Mhz
### Compute Node With GPU or MIC Accelerator
- 2 sockets
- Memory Controllers are integrated into processors.
- 6 DDR3 DIMMS per node
- 3 DDR3 DIMMS per CPU
- 1 DDR3 DIMMS per channel
- Data rate support: up to 1600MT/s
- Populated memory: 6x 16GB DDR3 DIMM 1600Mhz
### Fat Compute Node
- 2 sockets
- Memory Controllers are integrated into processors.
- 16 DDR3 DIMMS per node
- 8 DDR3 DIMMS per CPU
- 2 DDR3 DIMMS per channel
- Data rate support: up to 1600MT/s
- Populated memory: 16x 32GB DDR3 DIMM 1600Mhz
Environment and Modules
=======================
### Environment Customization
After logging in, you may want to configure the environment. Write your preferred path definitions, aliases, functions and module loads in the .bashrc file
```bash
# ./bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific aliases and functions
alias qs='qstat -a'
module load PrgEnv-gnu
# Display informations to standard output - only in interactive ssh session
if [ -n "$SSH_TTY" ]
then
module list # Display loaded modules
fi
```
!!! Note "Note"
Do not run commands outputing to standard output (echo, module list, etc) in .bashrc for non-interactive SSH sessions. It breaks fundamental functionality (scp, PBS) of your account! Take care for SSH session interactivity for such commands as stated in the previous example.
### Application Modules
In order to configure your shell for running particular application on Anselm we use Module package interface.
!!! Note "Note"
The modules set up the application paths, library paths and environment variables for running particular application.
We have also second modules repository. This modules repository is created using tool called EasyBuild. On Salomon cluster, all modules will be build by this tool. If you want to use software from this modules repository, please follow instructions in section [Application Modules Path Expansion](environment-and-modules/#EasyBuild).
The modules may be loaded, unloaded and switched, according to momentary needs.
To check available modules use
```bash
$ module avail
```
To load a module, for example the octave module use
```bash
$ module load octave
```
loading the octave module will set up paths and environment variables of your active shell such that you are ready to run the octave software
To check loaded modules use
```bash
$ module list
```
To unload a module, for example the octave module use
```bash
$ module unload octave
```
Learn more on modules by reading the module man page
```bash
$ man module
```
Following modules set up the development environment
PrgEnv-gnu sets up the GNU development environment in conjunction with the bullx MPI library
PrgEnv-intel sets up the INTEL development environment in conjunction with the Intel MPI library
How to using modules in examples:
<tty-player controls src=/src/anselm/modules_anselm.ttyrec></tty-player>
### Application Modules Path Expansion
All application modules on Salomon cluster (and further) will be build using tool called [EasyBuild](http://hpcugent.github.io/easybuild/ "EasyBuild"). In case that you want to use some applications that are build by EasyBuild already, you have to modify your MODULEPATH environment variable.
```bash
export MODULEPATH=$MODULEPATH:/apps/easybuild/modules/all/
```
This command expands your searched paths to modules. You can also add this command to the .bashrc file to expand paths permanently. After this command, you can use same commands to list/add/remove modules as is described above.
Hardware Overview
=================
The Anselm cluster consists of 209 computational nodes named cn[1-209] of which 180 are regular compute nodes, 23 GPU Kepler K20 accelerated nodes, 4 MIC Xeon Phi 5110 accelerated nodes and 2 fat nodes. Each node is a powerful x86-64 computer, equipped with 16 cores (two eight-core Intel Sandy Bridge processors), at least 64GB RAM, and local hard drive. The user access to the Anselm cluster is provided by two login nodes login[1,2]. The nodes are interlinked by high speed InfiniBand and Ethernet networks. All nodes share 320TB /home disk storage to store the user files. The 146TB shared /scratch storage is available for the scratch data.
The Fat nodes are equipped with large amount (512GB) of memory. Virtualization infrastructure provides resources to run long term servers and services in virtual mode. Fat nodes and virtual servers may access 45 TB of dedicated block storage. Accelerated nodes, fat nodes, and virtualization infrastructure are available [upon request](https://support.it4i.cz/rt) made by a PI.
Schematic representation of the Anselm cluster. Each box represents a node (computer) or storage capacity:
![](../img/Anselm-Schematic-Representation.png)
The cluster compute nodes cn[1-207] are organized within 13 chassis.
There are four types of compute nodes:
- 180 compute nodes without the accelerator
- 23 compute nodes with GPU accelerator - equipped with NVIDIA Tesla Kepler K20
- 4 compute nodes with MIC accelerator - equipped with Intel Xeon Phi 5110P
- 2 fat nodes - equipped with 512GB RAM and two 100GB SSD drives
[More about Compute nodes](compute-nodes/).
GPU and accelerated nodes are available upon request, see the [Resources Allocation Policy](resources-allocation-policy/).
All these nodes are interconnected by fast InfiniBand network and Ethernet network. [More about the Network](network/).
Every chassis provides Infiniband switch, marked **isw**, connecting all nodes in the chassis, as well as connecting the chassis to the upper level switches.
All nodes share 360TB /home disk storage to store user files. The 146TB shared /scratch storage is available for the scratch data. These file systems are provided by Lustre parallel file system. There is also local disk storage available on all compute nodes /lscratch. [More about Storage](storage/).
The user access to the Anselm cluster is provided by two login nodes login1, login2, and data mover node dm1. [More about accessing cluster.](shell-and-data-access/)
The parameters are summarized in the following tables:
|**In general**||
|---|---|
|Primary purpose|High Performance Computing|
|Architecture of compute nodes|x86-64|
|Operating system|Linux|
|[**Compute nodes**](compute-nodes/)||
|Totally|209|
|Processor cores|16 (2x8 cores)|
|RAM|min. 64 GB, min. 4 GB per core|
|Local disk drive|yes - usually 500 GB|
|Compute network|InfiniBand QDR, fully non-blocking, fat-tree|
|w/o accelerator|180, cn[1-180]|
|GPU accelerated|23, cn[181-203]|
|MIC accelerated|4, cn[204-207]|
|Fat compute nodes|2, cn[208-209]|
|**In total**||
|Total theoretical peak performance (Rpeak)|94 Tflop/s|
|Total max. LINPACK performance (Rmax)|73 Tflop/s|
|Total amount of RAM|15.136 TB|
|Node|Processor|Memory|Accelerator|
|---|---|---|---|
|w/o accelerator|2x Intel Sandy Bridge E5-2665, 2.4GHz|64GB|-|
|GPU accelerated|2x Intel Sandy Bridge E5-2470, 2.3GHz|96GB|NVIDIA Kepler K20|
|MIC accelerated|2x Intel Sandy Bridge E5-2470, 2.3GHz|96GB|Intel Xeon Phi P5110|
|Fat compute node|2x Intel Sandy Bridge E5-2665, 2.4GHz|512GB|-|
For more details please refer to the [Compute nodes](compute-nodes/), [Storage](storage/), and [Network](network/).
Introduction
============
Welcome to Anselm supercomputer cluster. The Anselm cluster consists of 209 compute nodes, totaling 3344 compute cores with 15TB RAM and giving over 94 Tflop/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64GB RAM, and 500GB harddrive. Nodes are interconnected by fully non-blocking fat-tree Infiniband network and equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators. Read more in [Hardware Overview](hardware-overview/).
The cluster runs bullx Linux ([bull](http://www.bull.com/bullx-logiciels/systeme-exploitation.html)) [operating system](software/operating-system/), which is compatible with the RedHat [ Linux family.](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg) We have installed a wide range of software packages targeted at different scientific domains. These packages are accessible via the [modules environment](environment-and-modules/).
User data shared file-system (HOME, 320TB) and job data shared file-system (SCRATCH, 146TB) are available to users.
The PBS Professional workload manager provides [computing resources allocations and job execution](resources-allocation-policy/).
Read more on how to [apply for resources](../get-started-with-it4innovations/applying-for-resources/), [obtain login credentials,](../get-started-with-it4innovations/obtaining-login-credentials/obtaining-login-credentials/) and [access the cluster](shell-and-data-access/).
Job scheduling
==============
Job execution priority
----------------------
Scheduler gives each job an execution priority and then uses this job execution priority to select which job(s) to run.
Job execution priority on Anselm is determined by these job properties (in order of importance):
1. queue priority
2. fairshare priority
3. eligible time
### Queue priority
Queue priority is priority of queue where job is queued before execution.
Queue priority has the biggest impact on job execution priority. Execution priority of jobs in higher priority queues is always greater than execution priority of jobs in lower priority queues. Other properties of job used for determining job execution priority (fairshare priority, eligible time) cannot compete with queue priority.
Queue priorities can be seen at <https://extranet.it4i.cz/anselm/queues>
### Fairshare priority
Fairshare priority is priority calculated on recent usage of resources. Fairshare priority is calculated per project, all members of project share same fairshare priority. Projects with higher recent usage have lower fairshare priority than projects with lower or none recent usage.
Fairshare priority is used for ranking jobs with equal queue priority.
Fairshare priority is calculated as
![](../../img/fairshare_formula.png)
where MAX_FAIRSHARE has value 1E6, usage~Project~ is cumulated usage by all members of selected project, usage~Total~ is total usage by all users, by all projects.
Usage counts allocated corehours (ncpus*walltime). Usage is decayed, or cut in half periodically, at the interval 168 hours (one week). Jobs queued in queue qexp are not calculated to project's usage.
>Calculated usage and fairshare priority can be seen at <https://extranet.it4i.cz/anselm/projects>.
Calculated fairshare priority can be also seen as Resource_List.fairshare attribute of a job.
###Eligible time
Eligible time is amount (in seconds) of eligible time job accrued while waiting to run. Jobs with higher eligible time gains higher priority.
Eligible time has the least impact on execution priority. Eligible time is used for sorting jobs with equal queue priority and fairshare priority. It is very, very difficult for eligible time to compete with fairshare priority.
Eligible time can be seen as eligible_time attribute of job.
### Formula
Job execution priority (job sort formula) is calculated as:
![](../../img/job_sort_formula.png)
### Job backfilling
Anselm cluster uses job backfilling.
Backfilling means fitting smaller jobs around the higher-priority jobs that the scheduler is going to run next, in such a way that the higher-priority jobs are not delayed. Backfilling allows us to keep resources from becoming idle when the top job (job with the highest execution priority) cannot run.
The scheduler makes a list of jobs to run in order of execution priority. Scheduler looks for smaller jobs that can fit into the usage gaps around the highest-priority jobs in the list. The scheduler looks in the prioritized list of jobs and chooses the highest-priority smaller jobs that fit. Filler jobs are run only if they will not delay the start time of top jobs.
It means, that jobs with lower execution priority can be run before jobs with higher execution priority.
!!! Note "Note"
It is **very beneficial to specify the walltime** when submitting jobs.
Specifying more accurate walltime enables better schedulling, better execution times and better resource usage. Jobs with suitable (small) walltime could be backfilled - and overtake job(s) with higher priority.
Job submission and execution
============================
Job Submission
--------------
When allocating computational resources for the job, please specify
1. suitable queue for your job (default is qprod)
2. number of computational nodes required
3. number of cores per node required
4. maximum wall time allocated to your calculation, note that jobs exceeding maximum wall time will be killed
5. Project ID
6. Jobscript or interactive switch
!!! Note "Note"
Use the **qsub** command to submit your job to a queue for allocation of the computational resources.
Submit the job using the qsub command:
```bash
$ qsub -A Project_ID -q queue -l select=x:ncpus=y,walltime=[[hh:]mm:]ss[.ms] jobscript
```
The qsub submits the job into the queue, in another words the qsub command creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to above described policies and constraints. **After the resources are allocated the jobscript or interactive shell is executed on first of the allocated nodes.**
### Job Submission Examples
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=64:ncpus=16,walltime=03:00:00 ./myjob
```
In this example, we allocate 64 nodes, 16 cores per node, for 3 hours. We allocate these resources via the qprod queue, consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. Jobscript myjob will be executed on the first node in the allocation.
```bash
$ qsub -q qexp -l select=4:ncpus=16 -I
```
In this example, we allocate 4 nodes, 16 cores per node, for 1 hour. We allocate these resources via the qexp queue. The resources will be available interactively
```bash
$ qsub -A OPEN-0-0 -q qnvidia -l select=10:ncpus=16 ./myjob
```
In this example, we allocate 10 nvidia accelerated nodes, 16 cores per node, for 24 hours. We allocate these resources via the qnvidia queue. Jobscript myjob will be executed on the first node in the allocation.
```bash
$ qsub -A OPEN-0-0 -q qfree -l select=10:ncpus=16 ./myjob
```
In this example, we allocate 10 nodes, 16 cores per node, for 12 hours. We allocate these resources via the qfree queue. It is not required that the project OPEN-0-0 has any available resources left. Consumed resources are still accounted for. Jobscript myjob will be executed on the first node in the allocation.
All qsub options may be [saved directly into the jobscript](job-submission-and-execution/#PBSsaved). In such a case, no options to qsub are needed.
```bash
$ qsub ./myjob
```
By default, the PBS batch system sends an e-mail only when the job is aborted. Disabling mail events completely can be done like this:
```bash
$ qsub -m n
```
Advanced job placement
----------------------
### Placement by name
Specific nodes may be allocated via the PBS
```bash
qsub -A OPEN-0-0 -q qprod -l select=1:ncpus=16:host=cn171+1:ncpus=16:host=cn172 -I
```
In this example, we allocate nodes cn171 and cn172, all 16 cores per node, for 24 hours. Consumed resources will be accounted to the Project identified by Project ID OPEN-0-0. The resources will be available interactively.
### Placement by CPU type
Nodes equipped with Intel Xeon E5-2665 CPU have base clock frequency 2.4GHz, nodes equipped with Intel Xeon E5-2470 CPU have base frequency 2.3 GHz (see section Compute Nodes for details). Nodes may be selected via the PBS resource attribute cpu_freq .
|CPU Type|base freq.|Nodes|cpu_freq attribute|
|---|---|
|Intel Xeon E5-2665|2.4GHz|cn[1-180], cn[208-209]|24|
|Intel Xeon E5-2470|2.3GHz|cn[181-207]|23|
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16:cpu_freq=24 -I
```
In this example, we allocate 4 nodes, 16 cores, selecting only the nodes with Intel Xeon E5-2665 CPU.
### Placement by IB switch
Groups of computational nodes are connected to chassis integrated Infiniband switches. These switches form the leaf switch layer of the [Infiniband network](../network/) fat tree topology. Nodes sharing the leaf switch can communicate most efficiently. Sharing the same switch prevents hops in the network and provides for unbiased, most efficient network communication.
Nodes sharing the same switch may be selected via the PBS resource attribute ibswitch. Values of this attribute are iswXX, where XX is the switch number. The node-switch mapping can be seen at [Hardware Overview](../hardware-overview/) section.
We recommend allocating compute nodes of a single switch when best possible computational network performance is required to run the job efficiently:
```bash
qsub -A OPEN-0-0 -q qprod -l select=18:ncpus=16:ibswitch=isw11 ./myjob
```
In this example, we request all the 18 nodes sharing the isw11 switch for 24 hours. Full chassis will be allocated.
Advanced job handling
---------------------
### Selecting Turbo Boost off
Intel Turbo Boost Technology is on by default. We strongly recommend keeping the default.
If necessary (such as in case of benchmarking) you can disable the Turbo for all nodes of the job by using the PBS resource attribute cpu_turbo_boost
```bash
$ qsub -A OPEN-0-0 -q qprod -l select=4:ncpus=16 -l cpu_turbo_boost=0 -I
```
More about the Intel Turbo Boost in the TurboBoost section
### Advanced examples
In the following example, we select an allocation for benchmarking a very special and demanding MPI program. We request Turbo off, 2 full chassis of compute nodes (nodes sharing the same IB switches) for 30 minutes:
```bash
$ qsub -A OPEN-0-0 -q qprod
-l select=18:ncpus=16:ibswitch=isw10:mpiprocs=1:ompthreads=16+18:ncpus=16:ibswitch=isw20:mpiprocs=16:ompthreads=1
-l cpu_turbo_boost=0,walltime=00:30:00
-N Benchmark ./mybenchmark
```
The MPI processes will be distributed differently on the nodes connected to the two switches. On the isw10 nodes, we will run 1 MPI process per node 16 threads per process, on isw20 nodes we will run 16 plain MPI processes.
Although this example is somewhat artificial, it demonstrates the flexibility of the qsub command options.
Job Management
--------------
!!! Note "Note"
Check status of your jobs using the **qstat** and **check-pbs-jobs** commands
```bash
$ qstat -a
$ qstat -a -u username
$ qstat -an -u username
$ qstat -f 12345.srv11
```
Example:
```bash
$ qstat -a
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
16287.srv11 user1 qlong job1 6183 4 64 -- 144:0 R 38:25
16468.srv11 user1 qlong job2 8060 4 64 -- 144:0 R 17:44
16547.srv11 user2 qprod job3x 13516 2 32 -- 48:00 R 00:58
```
In this example user1 and user2 are running jobs named job1, job2 and job3x. The jobs job1 and job2 are using 4 nodes, 16 cores per node each. The job1 already runs for 38 hours and 25 minutes, job2 for 17 hours 44 minutes. The job1 already consumed 64*38.41 = 2458.6 core hours. The job3x already consumed 0.96*32 = 30.93 core hours. These consumed core hours will be accounted on the respective project accounts, regardless of whether the allocated cores were actually used for computations.
Check status of your jobs using check-pbs-jobs command. Check presence of user's PBS jobs' processes on execution hosts. Display load, processes. Display job standard and error output. Continuously display (tail -f) job standard or error output.
```bash
$ check-pbs-jobs --check-all
$ check-pbs-jobs --print-load --print-processes
$ check-pbs-jobs --print-job-out --print-job-err
$ check-pbs-jobs --jobid JOBID --check-all --print-all
$ check-pbs-jobs --jobid JOBID --tailf-job-out
```
Examples:
```bash
$ check-pbs-jobs --check-all
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Check session id: OK
Check processes
cn164: OK
cn165: No process
```
In this example we see that job 35141.dm2 currently runs no process on allocated node cn165, which may indicate an execution error.
```bash
$ check-pbs-jobs --print-load --print-processes
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print load
cn164: LOAD: 16.01, 16.01, 16.00
cn165: LOAD: 0.01, 0.00, 0.01
Print processes
%CPU CMD
cn164: 0.0 -bash
cn164: 0.0 /bin/bash /var/spool/PBS/mom_priv/jobs/35141.dm2.SC
cn164: 99.7 run-task
...
```
In this example we see that job 35141.dm2 currently runs process run-task on node cn164, using one thread only, while node cn165 is empty, which may indicate an execution error.
```bash
$ check-pbs-jobs --jobid 35141.dm2 --print-job-out
JOB 35141.dm2, session_id 71995, user user2, nodes cn164,cn165
Print job standard output:
======================== Job start ==========================
Started at : Fri Aug 30 02:47:53 CEST 2013
Script name : script
Run loop 1
Run loop 2
Run loop 3
```
In this example, we see actual output (some iteration loops) of the job 35141.dm2
!!! Note "Note"
Manage your queued or running jobs, using the **qhold**, **qrls**, **qdel**, **qsig** or **qalter** commands
You may release your allocation at any time, using qdel command
```bash
$ qdel 12345.srv11
```
You may kill a running job by force, using qsig command
```bash
$ qsig -s 9 12345.srv11
```
Learn more by reading the pbs man page
```bash
$ man pbs_professional
```
Job Execution
-------------
### Jobscript
!!! Note "Note"
Prepare the jobscript to run batch jobs in the PBS queue system
The Jobscript is a user made script, controlling sequence of commands for executing the calculation. It is often written in bash, other scripts may be used as well. The jobscript is supplied to PBS **qsub** command as an argument and executed by the PBS Professional workload manager.
!!! Note "Note"
The jobscript or interactive shell is executed on first of the allocated nodes.
```bash
$ qsub -q qexp -l select=4:ncpus=16 -N Name0 ./myjob
$ qstat -n -u username
srv11:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.srv11 username qexp Name0 5530 4 64 -- 01:00 R 00:00
cn17/0*16+cn108/0*16+cn109/0*16+cn110/0*16
```
In this example, the nodes cn17, cn108, cn109 and cn110 were allocated for 1 hour via the qexp queue. The jobscript myjob will be executed on the node cn17, while the nodes cn108, cn109 and cn110 are available for use as well.
The jobscript or interactive shell is by default executed in home directory
```bash
$ qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
```
In this example, 4 nodes were allocated interactively for 1 hour via the qexp queue. The interactive shell is executed in the home directory.
!!! Note "Note"
All nodes within the allocation may be accessed via ssh. Unallocated nodes are not accessible to user.
The allocated nodes are accessible via ssh from login nodes. The nodes may access each other via ssh as well.
Calculations on allocated nodes may be executed remotely via the MPI, ssh, pdsh or clush. You may find out which nodes belong to the allocation by reading the $PBS_NODEFILE file
```bash
qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
$ sort -u $PBS_NODEFILE
cn17.bullx
cn108.bullx
cn109.bullx
cn110.bullx
$ pdsh -w cn17,cn[108-110] hostname
cn17: cn17
cn108: cn108
cn109: cn109
cn110: cn110
```
In this example, the hostname program is executed via pdsh from the interactive shell. The execution runs on all four allocated nodes. The same result would be achieved if the pdsh is called from any of the allocated nodes or from the login nodes.
### Example Jobscript for MPI Calculation
!!! Note "Note"
Production jobs must use the /scratch directory for I/O
The recommended way to run production jobs is to change to /scratch directory early in the jobscript, copy all inputs to /scratch, execute the calculations and copy outputs to home directory.
```bash
#!/bin/bash
# change to scratch directory, exit on failure
SCRDIR=/scratch/$USER/myjob
mkdir -p $SCRDIR
cd $SCRDIR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/mympiprog.x .
# load the mpi module
module load openmpi
# execute the calculation
mpiexec -pernode ./mympiprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, some directory on the /home holds the input file input and executable mympiprog.x . We create a directory myjob on the /scratch filesystem, copy input and executable files from the /home directory where the qsub was invoked ($PBS_O_WORKDIR) to /scratch, execute the MPI programm mympiprog.x and copy the output file back to the /home directory. The mympiprog.x is executed as one process per node, on all allocated nodes.
!!! Note "Note"
Consider preloading inputs and executables onto [shared scratch](storage/) before the calculation starts.
In some cases, it may be impractical to copy the inputs to scratch and outputs to home. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such a case, it is users responsibility to preload the input files on shared /scratch before the job submission and retrieve the outputs manually, after all calculations are finished.
!!! Note "Note"
Store the qsub options within the jobscript. Use **mpiprocs** and **ompthreads** qsub options to control the MPI job execution.
Example jobscript for an MPI job with preloaded inputs and executables, options for qsub are stored within the script :
```bash
#!/bin/bash
#PBS -q qprod
#PBS -N MYJOB
#PBS -l select=100:ncpus=16:mpiprocs=1:ompthreads=16
#PBS -A OPEN-0-0
# change to scratch directory, exit on failure
SCRDIR=/scratch/$USER/myjob
cd $SCRDIR || exit
# load the mpi module
module load openmpi
# execute the calculation
mpiexec ./mympiprog.x
#exit
exit
```
In this example, input and executable files are assumed preloaded manually in /scratch/$USER/myjob directory. Note the **mpiprocs** and **ompthreads** qsub options, controlling behavior of the MPI execution. The mympiprog.x is executed as one process per node, on all 100 allocated nodes. If mympiprog.x implements OpenMP threads, it will run 16 threads per node.
More information is found in the [Running OpenMPI](../software/mpi/Running_OpenMPI/) and [Running MPICH2](../software/mpi/running-mpich2/)
sections.
### Example Jobscript for Single Node Calculation
!!! Note "Note"
Local scratch directory is often useful for single node jobs. Local scratch will be deleted immediately after the job ends.
Example jobscript for single node calculation, using [local scratch](storage/) on the node:
```bash
#!/bin/bash
# change to local scratch directory
cd /lscratch/$PBS_JOBID || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/input .
cp $PBS_O_WORKDIR/myprog.x .
# execute the calculation
./myprog.x
# copy output file to home
cp output $PBS_O_WORKDIR/.
#exit
exit
```
In this example, some directory on the home holds the input file input and executable myprog.x . We copy input and executable files from the home directory where the qsub was invoked ($PBS_O_WORKDIR) to local scratch /lscratch/$PBS_JOBID, execute the myprog.x and copy the output file back to the /home directory. The myprog.x runs on one node only and may use threads.
### Other Jobscript Examples
Further jobscript examples may be found in the software section and the [Capacity computing](capacity-computing/) section.
Network
=======
All compute and login nodes of Anselm are interconnected by [Infiniband](http://en.wikipedia.org/wiki/InfiniBand) QDR network and by Gigabit [Ethernet](http://en.wikipedia.org/wiki/Ethernet) network. Both networks may be used to transfer user data.
Infiniband Network
------------------
All compute and login nodes of Anselm are interconnected by a high-bandwidth, low-latency [Infiniband](http://en.wikipedia.org/wiki/InfiniBand) QDR network (IB 4x QDR, 40 Gbps). The network topology is a fully non-blocking fat-tree.
The compute nodes may be accessed via the Infiniband network using ib0 network interface, in address range 10.2.1.1-209. The MPI may be used to establish native Infiniband connection among the nodes.
!!! Note "Note"
The network provides **2170MB/s** transfer rates via the TCP connection (single stream) and up to **3600MB/s** via native Infiniband protocol.
The Fat tree topology ensures that peak transfer rates are achieved between any two nodes, independent of network traffic exchanged among other nodes concurrently.
Ethernet Network
----------------
The compute nodes may be accessed via the regular Gigabit Ethernet network interface eth0, in address range 10.1.1.1-209, or by using aliases cn1-cn209. The network provides **114MB/s** transfer rates via the TCP connection.
Example
-------
```bash
$ qsub -q qexp -l select=4:ncpus=16 -N Name0 ./myjob
$ qstat -n -u username
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -- |---|---| ------ --- --- ------ ----- - -----
15209.srv11 username qexp Name0 5530 4 64 -- 01:00 R 00:00
cn17/0*16+cn108/0*16+cn109/0*16+cn110/0*16
$ ssh 10.2.1.110
$ ssh 10.1.1.108
```
In this example, we access the node cn110 by Infiniband network via the ib0 interface, then from cn110 to cn108 by Ethernet network.
PRACE User Support
==================
Intro
-----
PRACE users coming to Anselm as to TIER-1 system offered through the DECI calls are in general treated as standard users and so most of the general documentation applies to them as well. This section shows the main differences for quicker orientation, but often uses references to the original documentation. PRACE users who don't undergo the full procedure (including signing the IT4I AuP on top of the PRACE AuP) will not have a password and thus access to some services intended for regular users. This can lower their comfort, but otherwise they should be able to use the TIER-1 system as intended. Please see the [Obtaining Login Credentials section](../get-started-with-it4innovations/obtaining-login-credentials/obtaining-login-credentials/), if the same level of access is required.
All general [PRACE User Documentation](http://www.prace-ri.eu/user-documentation/) should be read before continuing reading the local documentation here.
Help and Support
--------------------
If you have any troubles, need information, request support or want to install additional software, please use [PRACE Helpdesk](http://www.prace-ri.eu/helpdesk-guide264/).
Information about the local services are provided in the [introduction of general user documentation](introduction/). Please keep in mind, that standard PRACE accounts don't have a password to access the web interface of the local (IT4Innovations) request tracker and thus a new ticket should be created by sending an e-mail to support[at]it4i.cz.
Obtaining Login Credentials
---------------------------
In general PRACE users already have a PRACE account setup through their HOMESITE (institution from their country) as a result of rewarded PRACE project proposal. This includes signed PRACE AuP, generated and registered certificates, etc.
If there's a special need a PRACE user can get a standard (local) account at IT4Innovations. To get an account on the Anselm cluster, the user needs to obtain the login credentials. The procedure is the same as for general users of the cluster, so please see the corresponding section of the general documentation here.
Accessing the cluster
---------------------
### Access with GSI-SSH
For all PRACE users the method for interactive access (login) and data transfer based on grid services from Globus Toolkit (GSI SSH and GridFTP) is supported.
The user will need a valid certificate and to be present in the PRACE LDAP (please contact your HOME SITE or the primary investigator of your project for LDAP account creation).
Most of the information needed by PRACE users accessing the Anselm TIER-1 system can be found here:
- [General user's FAQ](http://www.prace-ri.eu/Users-General-FAQs)
- [Certificates FAQ](http://www.prace-ri.eu/Certificates-FAQ)
- [Interactive access using GSISSH](http://www.prace-ri.eu/Interactive-Access-Using-gsissh)
- [Data transfer with GridFTP](http://www.prace-ri.eu/Data-Transfer-with-GridFTP-Details)
- [Data transfer with gtransfer](http://www.prace-ri.eu/Data-Transfer-with-gtransfer)
Before you start to use any of the services don't forget to create a proxy certificate from your certificate:
```bash
$ grid-proxy-init
```
To check whether your proxy certificate is still valid (by default it's valid 12 hours), use:
```bash
$ grid-proxy-info
```
To access Anselm cluster, two login nodes running GSI SSH service are available. The service is available from public Internet as well as from the internal PRACE network (accessible only from other PRACE partners).
**Access from PRACE network:**
It is recommended to use the single DNS name anselm-prace.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are:
|Login address|Port|Protocol|Login node|
|---|---|---|---|
|anselm-prace.it4i.cz|2222|gsissh|login1 or login2|
|login1-prace.anselm.it4i.cz|2222|gsissh|login1|
|login2-prace.anselm.it4i.cz|2222|gsissh|login2|
```bash
$ gsissh -p 2222 anselm-prace.it4i.cz
```
When logging from other PRACE system, the prace_service script can be used:
```bash
$ gsissh `prace_service -i -s anselm`
```
**Access from public Internet:**
It is recommended to use the single DNS name anselm.it4i.cz which is distributed between the two login nodes. If needed, user can login directly to one of the login nodes. The addresses are:
|Login address|Port|Protocol|Login node|
|---|---|---|---|
|anselm.it4i.cz|2222|gsissh|login1 or login2|
|login1.anselm.it4i.cz|2222|gsissh|login1|
|login2.anselm.it4i.cz|2222|gsissh|login2|
```bash
$ gsissh -p 2222 anselm.it4i.cz
```
When logging from other PRACE system, the prace_service script can be used:
```bash
$ gsissh `prace_service -e -s anselm`
```
Although the preferred and recommended file transfer mechanism is [using GridFTP](prace/#file-transfers), the GSI SSH implementation on Anselm supports also SCP, so for small files transfer gsiscp can be used:
```bash
$ gsiscp -P 2222 _LOCAL_PATH_TO_YOUR_FILE_ anselm.it4i.cz:_ANSELM_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 anselm.it4i.cz:_ANSELM_PATH_TO_YOUR_FILE_ _LOCAL_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 _LOCAL_PATH_TO_YOUR_FILE_ anselm-prace.it4i.cz:_ANSELM_PATH_TO_YOUR_FILE_
$ gsiscp -P 2222 anselm-prace.it4i.cz:_ANSELM_PATH_TO_YOUR_FILE_ _LOCAL_PATH_TO_YOUR_FILE_
```
### Access to X11 applications (VNC)
If the user needs to run X11 based graphical application and does not have a X11 server, the applications can be run using VNC service. If the user is using regular SSH based access, please see the section in general documentation.
If the user uses GSI SSH based access, then the procedure is similar to the SSH based access, only the port forwarding must be done using GSI SSH:
```bash
$ gsissh -p 2222 anselm.it4i.cz -L 5961:localhost:5961
```
### Access with SSH
After successful obtainment of login credentials for the local IT4Innovations account, the PRACE users can access the cluster as regular users using SSH. For more information please see the section in general documentation.
File transfers
------------------
PRACE users can use the same transfer mechanisms as regular users (if they've undergone the full registration procedure). For information about this, please see the section in the general documentation.
Apart from the standard mechanisms, for PRACE users to transfer data to/from Anselm cluster, a GridFTP server running Globus Toolkit GridFTP service is available. The service is available from public Internet as well as from the internal PRACE network (accessible only from other PRACE partners).
There's one control server and three backend servers for striping and/or backup in case one of them would fail.
**Access from PRACE network:**
|Login address|Port|Node role|
|---|---|---|
|gridftp-prace.anselm.it4i.cz|2812|Front end /control server|
|login1-prace.anselm.it4i.cz|2813|Backend / data mover server|
|login2-prace.anselm.it4i.cz|2813|Backend / data mover server|
|dm1-prace.anselm.it4i.cz|2813|Backend / data mover server|
Copy files **to** Anselm by running the following commands on your local machine:
```bash
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://gridftp-prace.anselm.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_
```
Or by using prace_service script:
```bash
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://`prace_service -i -f anselm`/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_
```
Copy files **from** Anselm:
```bash
$ globus-url-copy gsiftp://gridftp-prace.anselm.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Or by using prace_service script:
```bash
$ globus-url-copy gsiftp://`prace_service -i -f anselm`/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
**Access from public Internet:**
|Login address|Port|Node role|
|---|---|---|
|gridftp.anselm.it4i.cz|2812|Front end /control server|
|login1.anselm.it4i.cz|2813|Backend / data mover server|
|login2.anselm.it4i.cz|2813|Backend / data mover server|
|dm1.anselm.it4i.cz|2813|Backend / data mover server|
Copy files **to** Anselm by running the following commands on your local machine:
```bash
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://gridftp.anselm.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_
```
Or by using prace_service script:
```bash
$ globus-url-copy file://_LOCAL_PATH_TO_YOUR_FILE_ gsiftp://`prace_service -e -f anselm`/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_
```
Copy files **from** Anselm:
```bash
$ globus-url-copy gsiftp://gridftp.anselm.it4i.cz:2812/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Or by using prace_service script:
```bash
$ globus-url-copy gsiftp://`prace_service -e -f anselm`/home/prace/_YOUR_ACCOUNT_ON_ANSELM_/_PATH_TO_YOUR_FILE_ file://_LOCAL_PATH_TO_YOUR_FILE_
```
Generally both shared file systems are available through GridFTP:
|File system mount point|Filesystem|Comment|
|---|---|---|
|/home|Lustre|Default HOME directories of users in format /home/prace/login/|
|/scratch|Lustre|Shared SCRATCH mounted on the whole cluster|
More information about the shared file systems is available [here](storage/).
Usage of the cluster
--------------------
There are some limitations for PRACE user when using the cluster. By default PRACE users aren't allowed to access special queues in the PBS Pro to have high priority or exclusive access to some special equipment like accelerated nodes and high memory (fat) nodes. There may be also restrictions obtaining a working license for the commercial software installed on the cluster, mostly because of the license agreement or because of insufficient amount of licenses.
For production runs always use scratch file systems, either the global shared or the local ones. The available file systems are described [here](hardware-overview/).
### Software, Modules and PRACE Common Production Environment
All system wide installed software on the cluster is made available to the users via the modules. The information about the environment and modules usage is in this [section of general documentation](environment-and-modules/).
PRACE users can use the "prace" module to use the [PRACE Common Production Environment](http://www.prace-ri.eu/PRACE-common-production).
```bash
$ module load prace
```
### Resource Allocation and Job Execution
General information about the resource allocation, job queuing and job execution is in this [section of general documentation](resources-allocation-policy/).
For PRACE users, the default production run queue is "qprace". PRACE users can also use two other queues "qexp" and "qfree".
|queue|Active project|Project resources|Nodes|priority|authorization|walltime|
|---|---|---|---|---|---|---|
|**qexp** Express queue|no|none required|2 reserved, 8 total|high|no|1 / 1h|
|**qprace** Production queue|yes|> 0|178 w/o accelerator|medium|no|24 / 48h|
|**qfree** Free resource queue|yes|none required|178 w/o accelerator|very low|no| 12 / 12h|
**qprace**, the PRACE: This queue is intended for normal production runs. It is required that active project with nonzero remaining resources is specified to enter the qprace. The queue runs with medium priority and no special authorization is required to use it. The maximum runtime in qprace is 12 hours. If the job needs longer time, it must use checkpoint/restart functionality.
### Accounting & Quota
The resources that are currently subject to accounting are the core hours. The core hours are accounted on the wall clock basis. The accounting runs whenever the computational cores are allocated or blocked via the PBS Pro workload manager (the qsub command), regardless of whether the cores are actually used for any calculation. See [example in the general documentation](resources-allocation-policy/).
PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Note "Note"
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
```bash
$ it4ifree
Password:
PID Total Used ...by me Free
-------- ------- ------ -------- -------
OPEN-0-0 1500000 400644 225265 1099356
DD-13-1 10000 2606 2606 7394
```
By default file system quota is applied. To check the current status of the quota use
```bash
$ lfs quota -u USER_LOGIN /home
$ lfs quota -u USER_LOGIN /scratch
```
If the quota is insufficient, please contact the [support](prace/#help-and-support) and request an increase.
Remote visualization service
============================
Introduction
------------
The goal of this service is to provide the users a GPU accelerated use of OpenGL applications, especially for pre- and post- processing work, where not only the GPU performance is needed but also fast access to the shared file systems of the cluster and a reasonable amount of RAM.
The service is based on integration of open source tools VirtualGL and TurboVNC together with the cluster's job scheduler PBS Professional.
Currently two compute nodes are dedicated for this service with following configuration for each node:
|[**Visualization node configuration**](compute-nodes/)||
|---|---|
|CPU|2x Intel Sandy Bridge E5-2670, 2.6GHz|
|Processor cores|16 (2x8 cores)|
|RAM|64 GB, min. 4 GB per core|
|GPU|NVIDIA Quadro 4000, 2GB RAM|
|Local disk drive|yes - 500 GB|
|Compute network|InfiniBand QDR|
Schematic overview
------------------
![rem_vis_scheme](../img/scheme.png "rem_vis_scheme")
![rem_vis_legend](../img/legend.png "rem_vis_legend")
How to use the service
----------------------
### Setup and start your own TurboVNC server.
TurboVNC is designed and implemented for cooperation with VirtualGL and available for free for all major platforms. For more information and download, please refer to: <http://sourceforge.net/projects/turbovnc/>
**Always use TurboVNC on both sides** (server and client) **don't mix TurboVNC and other VNC implementations** (TightVNC, TigerVNC, ...) as the VNC protocol implementation may slightly differ and diminish your user experience by introducing picture artifacts, etc.
The procedure is:
#### 1. Connect to a login node.
Please follow the documentation.
#### 2. Run your own instance of TurboVNC server.
To have the OpenGL acceleration, **24 bit color depth must be used**. Otherwise only the geometry (desktop size) definition is needed.
*At first VNC server run you need to define a password.*
This example defines desktop with dimensions 1200x700 pixels and 24 bit color depth.
```bash
$ module load turbovnc/1.2.2
$ vncserver -geometry 1200x700 -depth 24
Desktop 'TurboVNC: login2:1 (username)' started on display login2:1
Starting applications specified in /home/username/.vnc/xstartup.turbovnc
Log file is /home/username/.vnc/login2:1.log
```
#### 3. Remember which display number your VNC server runs (you will need it in the future to stop the server).
```bash
$ vncserver -list
TurboVNC server sessions:
X DISPLAY # PROCESS ID
:1 23269
```
In this example the VNC server runs on display **:1**.
#### 4. Remember the exact login node, where your VNC server runs.
```bash
$ uname -n
login2
```
In this example the VNC server runs on **login2**.
#### 5. Remember on which TCP port your own VNC server is running.
To get the port you have to look to the log file of your VNC server.
```bash
$ grep -E "VNC.*port" /home/username/.vnc/login2:1.log
20/02/2015 14:46:41 Listening for VNC connections on TCP port 5901
```
In this example the VNC server listens on TCP port **5901**.
#### 6. Connect to the login node where your VNC server runs with SSH to tunnel your VNC session.
Tunnel the TCP port on which your VNC server is listenning.
```bash
$ ssh login2.anselm.it4i.cz -L 5901:localhost:5901
```
x-window-system/
*If you use Windows and Putty, please refer to port forwarding setup in the documentation:*
[x-window-and-vnc#section-12](../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/)
#### 7. If you don't have Turbo VNC installed on your workstation.
Get it from: <http://sourceforge.net/projects/turbovnc/>
#### 8. Run TurboVNC Viewer from your workstation.
Mind that you should connect through the SSH tunneled port. In this example it is 5901 on your workstation (localhost).
```bash
$ vncviewer localhost:5901
```
*If you use Windows version of TurboVNC Viewer, just run the Viewer and use address **localhost:5901**.*
#### 9. Proceed to the chapter "Access the visualization node."
*Now you should have working TurboVNC session connected to your workstation.*
#### 10. After you end your visualization session.
*Don't forget to correctly shutdown your own VNC server on the login node!*
```bash
$ vncserver -kill :1
```
Access the visualization node
-----------------------------
**To access the node use a dedicated PBS Professional scheduler queue
qviz**. The queue has following properties:
|queue |active project |project resources |nodes|min ncpus*|priority|authorization|walltime |
| --- | --- |
|**qviz** Visualization queue |yes |none required |2 |4 |150 |no |1 hour / 8 hours |
Currently when accessing the node, each user gets 4 cores of a CPU allocated, thus approximately 16 GB of RAM and 1/4 of the GPU capacity. *If more GPU power or RAM is required, it is recommended to allocate one whole node per user, so that all 16 cores, whole RAM and whole GPU is exclusive. This is currently also the maximum allowed allocation per one user. One hour of work is allocated by default, the user may ask for 2 hours maximum.*
To access the visualization node, follow these steps:
#### 1. In your VNC session, open a terminal and allocate a node using PBSPro qsub command.
*This step is necessary to allow you to proceed with next steps.*
```bash
$ qsub -I -q qviz -A PROJECT_ID
```
In this example the default values for CPU cores and usage time are used.
```bash
$ qsub -I -q qviz -A PROJECT_ID -l select=1:ncpus=16 -l walltime=02:00:00
```
*Substitute **PROJECT_ID** with the assigned project identification string.*
In this example a whole node for 2 hours is requested.
If there are free resources for your request, you will have a shell unning on an assigned node. Please remember the name of the node.
```bash
$ uname -n
srv8
```
In this example the visualization session was assigned to node **srv8**.
#### 2. In your VNC session open another terminal (keep the one with interactive PBSPro job open).
Setup the VirtualGL connection to the node, which PBSPro allocated for our job.
```bash
$ vglconnect srv8
```
You will be connected with created VirtualGL tunnel to the visualization ode, where you will have a shell.
#### 3. Load the VirtualGL module.
```bash
$ module load virtualgl/2.4
```
#### 4. Run your desired OpenGL accelerated application using VirtualGL script "vglrun".
```bash
$ vglrun glxgears
```
Please note, that if you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. . g. to run the **Mentat** OpenGL application from **MARC** software ackage use:
```bash
$ module load marc/2013.1
$ vglrun mentat
```
#### 5. After you end your work with the OpenGL application.
Just logout from the visualization node and exit both opened terminals nd end your VNC server session as described above.
Tips and Tricks
---------------
If you want to increase the responsibility of the visualization, please djust your TurboVNC client settings in this way:
![rem_vis_settings](../img/turbovncclientsetting.png "rem_vis_settings")
To have an idea how the settings are affecting the resulting picture uality three levels of "JPEG image quality" are demonstrated:
1. JPEG image quality = 30
![rem_vis_q3](../img/quality3.png "rem_vis_q3")
2. JPEG image quality = 15
![rem_vis_q2](../img/quality2.png "rem_vis_q2")
3. JPEG image quality = 10
![rem_vis_q1](../img/quality1.png "rem_vis_q1")
Resource Allocation and Job Execution
=====================================
To run a [job](../introduction/), [computational resources](../introduction/) for this particular job must be allocated. This is done via the PBS Pro job workload manager software, which efficiently distributes workloads across the supercomputer. Extensive informations about PBS Pro can be found in the [official documentation here](../pbspro-documentation/pbspro/), especially in the PBS Pro User's Guide.
Resources Allocation Policy
---------------------------
The resources are allocated to the job in a fairshare fashion, subject to constraints set by the queue and resources available to the Project. [The Fairshare](job-priority/) at Anselm ensures that individual users may consume approximately equal amount of resources per week. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following queues are available to Anselm users:
- **qexp**, the Express queue
- **qprod**, the Production queue
- **qlong**, the Long queue, regula
- **qnvidia**, **qmic**, **qfat**, the Dedicated queues
- **qfree**, the Free resource utilization queue
!!! Note "Note"
Check the queue status at <https://extranet.it4i.cz/anselm/>
Read more on the [Resource AllocationPolicy](resources-allocation-policy/) page.
Job submission and execution
----------------------------
!!! Note "Note"
Use the **qsub** command to submit your jobs.
The qsub submits the job into the queue. The qsub command creates a request to the PBS Job manager for allocation of specified resources. The **smallest allocation unit is entire node, 16 cores**, with exception of the qexp queue. The resources will be allocated when available, subject to allocation policies and constraints. **After the resources are allocated the jobscript or interactive shell is executed on first of the allocated nodes.**
Read more on the [Job submission and execution](job-submission-and-execution/) page.
Capacity computing
------------------
!!! Note "Note"
Use Job arrays when running huge number of jobs.
Use GNU Parallel and/or Job arrays when running (many) single core jobs.
In many cases, it is useful to submit huge (100+) number of computational jobs into the PBS queue system. Huge number of (small) jobs is one of the most effective ways to execute embarrassingly parallel calculations, achieving best runtime, throughput and computer utilization. In this chapter, we discuss the the recommended way to run huge number of jobs, including **ways to run huge number of single core jobs**.
Read more on [Capacity computing](capacity-computing/) page.