Compare revisions

89ac3df9 · 89ac3df9 · 89ac3df9 · 89ac3df9 · 89ac3df9 · 89ac3df9
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-cfx.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-cfx.md
+ANSYS CFX
+=========
+
+[ANSYS CFX](http://www.ansys.com/Products/Simulation+Technology/Fluid+Dynamics/Fluid+Dynamics+Products/ANSYS+CFX) software is a high-performance, general purpose fluid dynamics program that has been applied to solve wide-ranging fluid flow problems for over 20 years. At the heart of ANSYS CFX is its advanced solver technology, the key to achieving reliable and accurate solutions quickly and robustly. The modern, highly parallelized solver is the foundation for an abundant choice of physical models to capture virtually any type of phenomena related to fluid flow. The solver and its many physical models are wrapped in a modern, intuitive, and flexible GUI and user environment, with extensive capabilities for customization and automation using session files, scripting and a powerful expression language.
+
+To run ANSYS CFX in batch mode you can utilize/modify the default cfx.pbs script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l nodes=2:ppn=16
+#PBS -q qprod
+#PBS -N $USER-CFX-Project
+#PBS -A XX-YY-ZZ
+
+#! Mail to user when job terminate or abort
+#PBS -m ae
+
+#!change the working directory (default is home directory)
+#cd <working directory> (working directory must exists)
+WORK_DIR="/scratch/$USER/work"
+cd $WORK_DIR
+
+echo Running on host `hostname`
+echo Time is `date`
+echo Directory is `pwd`
+echo This jobs runs on the following processors:
+echo `cat $PBS_NODEFILE`
+
+module load ansys
+
+#### Set number of processors per host listing
+#### (set to 1 as $PBS_NODEFILE lists each node twice if :ppn=2)
+procs_per_host=1
+#### Create host list
+hl=""
+for host in `cat $PBS_NODEFILE`
+do
+ if [ "$hl" = "" ]
+ then hl="$host:$procs_per_host"
+ else hl="${hl}:$host:$procs_per_host"
+ fi
+done
+
+echo Machines: $hl
+
+#-dev input.def includes the input of CFX analysis in DEF format
+#-P the name of prefered license feature (aa_r=ANSYS Academic Research, ane3fl=Multiphysics(commercial))
+/ansys_inc/v145/CFX/bin/cfx5solve -def input.def -size 4 -size-ni 4x -part-large -start-method "Platform MPI Distributed Parallel" -par-dist $hl -P aa_r
+```
+
+Header of the pbs file (above) is common and description can be find on [this site](../../resource-allocation-and-job-execution/job-submission-and-execution/). SVS FEM recommends to utilize sources by keywords: nodes, ppn. These keywords allows to address directly the number of nodes (computers) and cores (ppn) which will be utilized in the job. Also the rest of code assumes such structure of allocated resources.
+
+Working directory has to be created before sending pbs job into the queue. Input file should be in working directory or full path to input file has to be specified. >Input file has to be defined by common CFX def file which is attached to the cfx solver via parameter -def
+
+**License** should be selected by parameter -P (Big letter **P**). Licensed products are the following: aa_r (ANSYS **Academic** Research), ane3fl (ANSYS Multiphysics)-**Commercial**.
+[More about licensing here](licensing/)
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-fluent.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-fluent.md
+ANSYS Fluent
+============
+
+[ANSYS Fluent](http://www.ansys.com/Products/Simulation+Technology/Fluid+Dynamics/Fluid+Dynamics+Products/ANSYS+Fluent)
+software contains the broad physical modeling capabilities needed to model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to combustion in a furnace, from bubble columns to oil platforms, from blood flow to semiconductor manufacturing, and from clean room design to wastewater treatment plants. Special models that give the software the ability to model in-cylinder combustion, aeroacoustics, turbomachinery, and multiphase systems have served to broaden its reach.
+
+1. Common way to run Fluent over pbs file
+-----------------------------------------
+To run ANSYS Fluent in batch mode you can utilize/modify the default fluent.pbs script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -S /bin/bash
+#PBS -l nodes=2:ppn=16
+#PBS -q qprod
+#PBS -N $USER-Fluent-Project
+#PBS -A XX-YY-ZZ
+
+#! Mail to user when job terminate or abort
+#PBS -m ae
+
+#!change the working directory (default is home directory)
+#cd <working directory> (working directory must exists)
+WORK_DIR="/scratch/$USER/work"
+cd $WORK_DIR
+
+echo Running on host `hostname`
+echo Time is `date`
+echo Directory is `pwd`
+echo This jobs runs on the following processors:
+echo `cat $PBS_NODEFILE`
+
+#### Load ansys module so that we find the cfx5solve command
+module load ansys
+
+# Use following line to specify MPI for message-passing instead
+NCORES=`wc -l $PBS_NODEFILE |awk '{print $1}'`
+
+/ansys_inc/v145/fluent/bin/fluent 3d -t$NCORES -cnf=$PBS_NODEFILE -g -i fluent.jou
+```
+
+Header of the pbs file (above) is common and description can be find on [this site](../../resource-allocation-and-job-execution/job-submission-and-execution.md). [SVS FEM](http://www.svsfem.cz) recommends to utilize sources by keywords: nodes, ppn. These keywords allows to address directly the number of nodes (computers) and cores (ppn) which will be utilized in the job. Also the rest of code assumes such structure of allocated resources.
+
+Working directory has to be created before sending pbs job into the queue. Input file should be in working directory or full path to input file has to be specified. Input file has to be defined by common Fluent journal file which is attached to the Fluent solver via parameter -i fluent.jou
+
+Journal file with definition of the input geometry and boundary conditions and defined process of solution has e.g. the following structure:
+
+```bash
+    /file/read-case aircraft_2m.cas.gz
+    /solve/init
+    init
+    /solve/iterate
+    10
+    /file/write-case-dat aircraft_2m-solution
+    /exit yes
+```
+
+The appropriate dimension of the problem has to be set by parameter (2d/3d).
+
+2. Fast way to run Fluent from command line
+--------------------------------------------------------
+
+```bash
+fluent solver_version [FLUENT_options] -i journal_file -pbs
+```
+
+This syntax will start the ANSYS FLUENT job under PBS Professional using the  qsub command in a batch manner. When resources are available, PBS Professional will start the job and return a job ID, usually in the form of *job_ID.hostname*. This job ID can then be used to query, control, or stop the job using standard PBS Professional commands, such as  qstat or qdel. The job will be run out of the current working directory, and all output will be written to the file fluent.o *job_ID*.
+
+3. Running Fluent via user's config file
+----------------------------------------
+The sample script uses a configuration file called pbs_fluent.conf  if no command line arguments are present. This configuration file should be present in the directory from which the jobs are submitted (which is also the directory in which the jobs are executed). The following is an example of what the content of  pbs_fluent.conf can be:
+
+```bash
+input="example_small.flin"
+case="Small-1.65m.cas"
+fluent_args="3d -pmyrinet"
+outfile="fluent_test.out"
+mpp="true"
+```
+
+The following is an explanation of the parameters:
+
+input is the name of the input file.
+
+case is the name of the .cas file that the input file will utilize.
+
+fluent_args are extra ANSYS FLUENT arguments. As shown in the previous example, you can specify the interconnect by using the  -p interconnect command. The available interconnects include ethernet (the default), myrinet, infiniband,  vendor, altix, and crayx. The MPI is selected automatically, based on the specified interconnect.
+
+outfile is the name of the file to which the standard output will be sent.
+
+ mpp="true" will tell the job script to execute the job across multiple processors.
+
+To run ANSYS Fluent in batch mode with user's config file you can utilize/modify the following script and execute it via the qsub command.
+
+```bash
+#!/bin/sh
+#PBS -l nodes=2:ppn=4
+#PBS -1 qprod
+#PBS -N $USE-Fluent-Project
+#PBS -A XX-YY-ZZ
+
+ cd $PBS_O_WORKDIR
+
+ #We assume that if they didn’t specify arguments then they should use the
+ #config file if [ "xx${input}${case}${mpp}${fluent_args}zz" = "xxzz" ]; then
+   if [ -f pbs_fluent.conf ]; then
+     . pbs_fluent.conf
+   else
+     printf "No command line arguments specified, "
+     printf "and no configuration file found.  Exiting n"
+   fi
+ fi
+
+
+ #Augment the ANSYS FLUENT command line arguments case "$mpp" in
+   true)
+     #MPI job execution scenario
+     num_nodes=‘cat $PBS_NODEFILE | sort -u | wc -l‘
+     cpus=‘expr $num_nodes * $NCPUS‘
+     #Default arguments for mpp jobs, these should be changed to suit your
+     #needs.
+     fluent_args="-t${cpus} $fluent_args -cnf=$PBS_NODEFILE"
+     ;;
+   *)
+     #SMP case
+     #Default arguments for smp jobs, should be adjusted to suit your
+     #needs.
+     fluent_args="-t$NCPUS $fluent_args"
+     ;;
+ esac
+ #Default arguments for all jobs
+ fluent_args="-ssh -g -i $input $fluent_args"
+
+ echo "---------- Going to start a fluent job with the following settings:
+ Input: $input
+ Case: $case
+ Output: $outfile
+ Fluent arguments: $fluent_args"
+
+ #run the solver
+ /ansys_inc/v145/fluent/bin/fluent $fluent_args  > $outfile
+```
+
+It runs the jobs out of the directory from which they are submitted (PBS_O_WORKDIR).
+
+4. Running Fluent in parralel
+-----------------------------
+Fluent could be run in parallel only under Academic Research license. To do so this ANSYS Academic Research license must be placed before ANSYS CFD license in user preferences. To make this change anslic_admin utility should be run
+
+```bash
+/ansys_inc/shared_les/licensing/lic_admin/anslic_admin
+```
+
+ANSLIC_ADMIN Utility will be run
+
+![](../../../img/Fluent_Licence_1.jpg)
+
+![](../../../img/Fluent_Licence_2.jpg)
+
+![](../../../img/Fluent_Licence_3.jpg)
+
+ANSYS Academic Research license should be moved up to the top of the list.
+
+![](../../../img/Fluent_Licence_4.jpg)
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-ls-dyna.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-ls-dyna.md
+ANSYS LS-DYNA
+=============
+
+**[ANSYSLS-DYNA](http://www.ansys.com/Products/Simulation+Technology/Structural+Mechanics/Explicit+Dynamics/ANSYS+LS-DYNA)** software provides convenient and easy-to-use access to the technology-rich, time-tested explicit solver without the need to contend with the complex input requirements of this sophisticated program. Introduced in 1996, ANSYS LS-DYNA capabilities have helped customers in numerous industries to resolve highly intricate design issues. ANSYS Mechanical users have been able take advantage of complex explicit solutions for a long time utilizing the traditional ANSYS Parametric Design Language (APDL) environment. These explicit capabilities are available to ANSYS Workbench users as well. The Workbench platform is a powerful, comprehensive, easy-to-use environment for engineering simulation. CAD import from all sources, geometry cleanup, automatic meshing, solution, parametric optimization, result visualization and comprehensive report generation are all available within a single fully interactive modern  graphical user environment.
+
+To run ANSYS LS-DYNA in batch mode you can utilize/modify the default ansysdyna.pbs script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l nodes=2:ppn=16
+#PBS -q qprod
+#PBS -N $USER-DYNA-Project
+#PBS -A XX-YY-ZZ
+
+#! Mail to user when job terminate or abort
+#PBS -m ae
+
+#!change the working directory (default is home directory)
+#cd <working directory>
+WORK_DIR="/scratch/$USER/work"
+cd $WORK_DIR
+
+echo Running on host `hostname`
+echo Time is `date`
+echo Directory is `pwd`
+echo This jobs runs on the following processors:
+echo `cat $PBS_NODEFILE`
+
+#! Counts the number of processors
+NPROCS=`wc -l < $PBS_NODEFILE`
+
+echo This job has allocated $NPROCS nodes
+
+module load ansys
+
+#### Set number of processors per host listing
+#### (set to 1 as $PBS_NODEFILE lists each node twice if :ppn=2)
+procs_per_host=1
+#### Create host list
+hl=""
+for host in `cat $PBS_NODEFILE`
+do
+ if [ "$hl" = "" ]
+ then hl="$host:$procs_per_host"
+ else hl="${hl}:$host:$procs_per_host"
+ fi
+done
+
+echo Machines: $hl
+
+/ansys_inc/v145/ansys/bin/ansys145 -dis -lsdynampp i=input.k -machines $hl
+```
+
+Header of the pbs file (above) is common and description can be find on [this site](../../resource-allocation-and-job-execution/job-submission-and-execution/). [SVS FEM](http://www.svsfem.cz) recommends to utilize sources by keywords: nodes, ppn. These keywords allows to address directly the number of nodes (computers) and cores (ppn) which will be utilized in the job. Also the rest of code assumes such structure of allocated resources.
+
+Working directory has to be created before sending pbs job into the queue. Input file should be in working directory or full path to input file has to be specified. Input file has to be defined by common LS-DYNA .**k** file which is attached to the ansys solver via parameter i=
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-mechanical-apdl.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ansys-mechanical-apdl.md
+ANSYS MAPDL
+===========
+
+**[ANSYS Multiphysics](http://www.ansys.com/Products/Simulation+Technology/Structural+Mechanics/ANSYS+Multiphysics)**
+software offers a comprehensive product solution for both multiphysics and single-physics analysis. The product includes structural, thermal, fluid and both high- and low-frequency electromagnetic analysis. The product also contains solutions for both direct and sequentially coupled physics problems including direct coupled-field elements and the ANSYS multi-field solver.
+
+To run ANSYS MAPDL in batch mode you can utilize/modify the default mapdl.pbs script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l nodes=2:ppn=16
+#PBS -q qprod
+#PBS -N $USER-ANSYS-Project
+#PBS -A XX-YY-ZZ
+
+#! Mail to user when job terminate or abort
+#PBS -m ae
+
+#!change the working directory (default is home directory)
+#cd <working directory> (working directory must exists)
+WORK_DIR="/scratch/$USER/work"
+cd $WORK_DIR
+
+echo Running on host `hostname`
+echo Time is `date`
+echo Directory is `pwd`
+echo This jobs runs on the following processors:
+echo `cat $PBS_NODEFILE`
+
+module load ansys
+
+#### Set number of processors per host listing
+#### (set to 1 as $PBS_NODEFILE lists each node twice if :ppn=2)
+procs_per_host=1
+#### Create host list
+hl=""
+for host in `cat $PBS_NODEFILE`
+do
+ if [ "$hl" = "" ]
+ then hl="$host:$procs_per_host"
+ else hl="${hl}:$host:$procs_per_host"
+ fi
+done
+
+echo Machines: $hl
+
+#-i input.dat includes the input of analysis in APDL format
+#-o file.out is output file from ansys where all text outputs will be redirected
+#-p the name of license feature (aa_r=ANSYS Academic Research, ane3fl=Multiphysics(commercial), aa_r_dy=Academic AUTODYN)
+/ansys_inc/v145/ansys/bin/ansys145 -b -dis -p aa_r -i input.dat -o file.out -machines $hl -dir $WORK_DIR
+```
+
+Header of the pbs file (above) is common and description can be find on [this site](../../resource-allocation-and-job-execution/job-submission-and-execution.md). [SVS FEM](http://www.svsfem.cz) recommends to utilize sources by keywords: nodes, ppn. These keywords allows to address directly the number of nodes (computers) and cores (ppn) which will be utilized in the job. Also the rest of code assumes such structure of allocated resources.
+
+Working directory has to be created before sending pbs job into the queue. Input file should be in working directory or full path to input file has to be specified. Input file has to be defined by common APDL file which is attached to the ansys solver via parameter -i
+
+**License** should be selected by parameter -p. Licensed products are the following: aa_r (ANSYS **Academic** Research), ane3fl (ANSYS Multiphysics)-**Commercial**, aa_r_dy (ANSYS **Academic** AUTODYN)
+[More about licensing here](licensing/)
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ansys.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ansys.md
+Overview of ANSYS Products
+==========================
+
+**[SVS FEM](http://www.svsfem.cz/)** as **[ANSYS Channel partner](http://www.ansys.com/)** for Czech Republic provided all ANSYS licenses for ANSELM cluster and supports of all ANSYS Products (Multiphysics, Mechanical, MAPDL, CFX, Fluent, Maxwell, LS-DYNA...) to IT staff and ANSYS users. If you are challenging to problem of ANSYS functionality contact please [hotline@svsfem.cz](mailto:hotline@svsfem.cz?subject=Ostrava%20-%20ANSELM)
+
+Anselm provides as commercial as academic variants. Academic variants are distinguished by "**Academic...**" word in the name of  license or by two letter preposition "**aa_**" in the license feature name. Change of license is realized on command line respectively directly in user's pbs file (see individual products). [ More  about licensing here](ansys/licensing/)
+
+To load the latest version of any ANSYS product (Mechanical, Fluent, CFX, MAPDL,...) load the module:
+
+```bash
+    $ module load ansys
+```
+
+ANSYS supports interactive regime, but due to assumed solution of extremely difficult tasks it is not recommended.
+
+If user needs to work in interactive regime we recommend to configure the RSM service on the client machine which allows to forward the solution to the Anselm directly from the client's Workbench project (see ANSYS RSM service).
+
--- a/docs.it4i/anselm-cluster-documentation/software/ansys/ls-dyna.md
+++ b/docs.it4i/anselm-cluster-documentation/software/ansys/ls-dyna.md
+LS-DYNA
+=======
+
+[LS-DYNA](http://www.lstc.com/) is a multi-purpose, explicit and implicit finite element program used to analyze the nonlinear dynamic response of structures. Its fully automated contact analysis capability, a wide range of constitutive models to simulate a whole range of engineering materials (steels, composites, foams, concrete, etc.), error-checking features and the high scalability have enabled users worldwide to solve successfully many complex problems. Additionally LS-DYNA is extensively used to simulate impacts on structures from drop tests, underwater shock, explosions or high-velocity impacts. Explosive forming, process engineering, accident reconstruction, vehicle dynamics, thermal brake disc analysis or nuclear safety are further areas in the broad range of possible applications. In leading-edge research LS-DYNA is used to investigate the behaviour of materials like composites, ceramics, concrete, or wood. Moreover, it is used in biomechanics, human modelling, molecular structures, casting, forging, or virtual testing.
+
+Anselm provides **1 commercial license of LS-DYNA without HPC** support now.
+
+To run LS-DYNA in batch mode you can utilize/modify the default lsdyna.pbs script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l nodes=1:ppn=16
+#PBS -q qprod
+#PBS -N $USER-LSDYNA-Project
+#PBS -A XX-YY-ZZ
+
+#! Mail to user when job terminate or abort
+#PBS -m ae
+
+#!change the working directory (default is home directory)
+#cd <working directory> (working directory must exists)
+WORK_DIR="/scratch/$USER/work"
+cd $WORK_DIR
+
+echo Running on host `hostname`
+echo Time is `date`
+echo Directory is `pwd`
+
+module load lsdyna
+
+/apps/engineering/lsdyna/lsdyna700s i=input.k
+```
+
+Header of the pbs file (above) is common and description can be find on [this site](../../resource-allocation-and-job-execution/job-submission-and-execution.html). [SVS FEM](http://www.svsfem.cz) recommends to utilize sources by keywords: nodes, ppn. These keywords allows to address directly the number of nodes (computers) and cores (ppn) which will be utilized in the job. Also the rest of code assumes such structure of allocated resources.
+
+Working directory has to be created before sending pbs job into the queue. Input file should be in working directory or full path to input file has to be specified. Input file has to be defined by common LS-DYNA **.k** file which is attached to the LS-DYNA solver via parameter i=
--- a/docs.it4i/anselm-cluster-documentation/software/chemistry/molpro.md
+++ b/docs.it4i/anselm-cluster-documentation/software/chemistry/molpro.md
+Molpro
+======
+
+Molpro is a complete system of ab initio programs for molecular electronic structure calculations.
+
+About Molpro
+------------
+Molpro is a software package used for accurate ab-initio quantum chemistry calculations. More information can be found at the [official webpage](http://www.molpro.net/).
+
+License
+-------
+Molpro software package is available only to users that have a valid license. Please contact support to enable access to Molpro if you have a valid license appropriate for running on our cluster (eg. academic research group licence, parallel execution).
+
+To run Molpro, you need to have a valid license token present in " $HOME/.molpro/token". You can download the token from [Molpro website](https://www.molpro.net/licensee/?portal=licensee).
+
+Installed version
+-----------------
+Currently on Anselm is installed version 2010.1, patch level 45, parallel version compiled with Intel compilers and Intel MPI.
+
+Compilation parameters are default:
+
+|Parameter|Value|
+|---|---|
+|max number of atoms|200|
+|max number of valence orbitals|300|
+|max number of basis functions|4095|
+|max number of states per symmmetry|20|
+|max number of state symmetries|16|
+|max number of records|200|
+|max number of primitives|maxbfn x [2]|
+
+Running
+------
+Molpro is compiled for parallel execution using MPI and OpenMP. By default, Molpro reads the number of allocated nodes from PBS and launches a data server on one node. On the remaining allocated nodes, compute processes are launched, one process per node, each with 16 threads. You can modify this behavior by using -n, -t and helper-server options. Please refer to the [Molpro documentation](http://www.molpro.net/info/2010.1/doc/manual/node9.html) for more details.
+
+!!! Note "Note"
+	The OpenMP parallelization in Molpro is limited and has been observed to produce limited scaling. We therefore recommend to use MPI parallelization only. This can be achieved by passing option  mpiprocs=16:ompthreads=1 to PBS.
+
+You are advised to use the -d option to point to a directory in [SCRATCH filesystem](../../storage/storage/). Molpro can produce a large amount of temporary data during its run, and it is important that these are placed in the fast scratch filesystem.
+
+### Example jobscript
+
+```bash
+    #PBS -A IT4I-0-0
+    #PBS -q qprod
+    #PBS -l select=1:ncpus=16:mpiprocs=16:ompthreads=1
+
+    cd $PBS_O_WORKDIR
+
+    # load Molpro module
+    module add molpro
+
+    # create a directory in the SCRATCH filesystem
+    mkdir -p /scratch/$USER/$PBS_JOBID
+
+    # copy an example input
+    cp /apps/chem/molpro/2010.1/molprop_2010_1_Linux_x86_64_i8/examples/caffeine_opt_diis.com .
+
+    # run Molpro with default options
+    molpro -d /scratch/$USER/$PBS_JOBID caffeine_opt_diis.com
+
+    # delete scratch directory
+    rm -rf /scratch/$USER/$PBS_JOBID
+```
--- a/docs.it4i/anselm-cluster-documentation/software/chemistry/nwchem.md
+++ b/docs.it4i/anselm-cluster-documentation/software/chemistry/nwchem.md
+NWChem
+======
+
+**High-Performance Computational Chemistry**
+
+Introduction
+-------------------------
+NWChem aims to provide its users with computational chemistry tools that are scalable both in their ability to treat large scientific computational chemistry problems efficiently, and in their use of available parallel computing resources from high-performance parallel supercomputers to conventional workstation clusters.
+
+[Homepage](http://www.nwchem-sw.org/index.php/Main_Page)
+
+Installed versions
+------------------
+The following versions are currently installed:
+
+-   6.1.1, not recommended, problems have been observed with this version
+-   6.3-rev2-patch1, current release with QMD patch applied. Compiled with Intel compilers, MKL and Intel MPI
+-   6.3-rev2-patch1-openmpi, same as above, but compiled with OpenMPI and NWChem provided BLAS instead of MKL. This version is expected to be slower
+-   6.3-rev2-patch1-venus, this version contains only libraries for VENUS interface linking. Does not provide standalone NWChem executable
+
+For a current list of installed versions, execute:
+
+```bash
+    module avail nwchem
+```
+
+Running
+-------
+NWChem is compiled for parallel MPI execution. Normal procedure for MPI jobs applies. Sample jobscript:
+
+```bash
+    #PBS -A IT4I-0-0
+    #PBS -q qprod
+    #PBS -l select=1:ncpus=16
+
+    module add nwchem/6.3-rev2-patch1
+    mpirun -np 16 nwchem h2o.nw
+```
+
+Options
+--------------------
+Please refer to [the documentation](http://www.nwchem-sw.org/index.php/Release62:Top-level) and in the input file set the following directives :
+
+-   MEMORY : controls the amount of memory NWChem will use
+-   SCRATCH_DIR : set this to a directory in [SCRATCH filesystem](../../storage/storage/#scratch) (or run the calculation completely in a scratch directory). For certain calculations, it might be advisable to reduce I/O by forcing "direct" mode, eg. "scf direct"
--- a/docs.it4i/anselm-cluster-documentation/software/compilers.md
+++ b/docs.it4i/anselm-cluster-documentation/software/compilers.md
+Compilers
+=========
+
+##Available compilers, including GNU, INTEL and UPC compilers
+
+Currently there are several compilers for different programming languages available on the Anselm cluster:
+
+-   C/C++
+-   Fortran 77/90/95
+-   Unified Parallel C
+-   Java
+-   nVidia CUDA
+
+The C/C++ and Fortran compilers are divided into two main groups GNU and Intel.
+
+Intel Compilers
+---------------
+For information about the usage of Intel Compilers and other Intel products, please read the [Intel Parallel studio](intel-suite/) page.
+
+GNU C/C++ and Fortran Compilers
+-------------------------------
+For compatibility reasons there are still available the original (old 4.4.6-4) versions of GNU compilers as part of the OS. These are accessible in the search path  by default.
+
+It is strongly recommended to use the up to date version (4.8.1) which comes with the module gcc:
+
+```bash
+    $ module load gcc
+    $ gcc -v
+    $ g++ -v
+    $ gfortran -v
+```
+
+With the module loaded two environment variables are predefined. One for maximum optimizations on the Anselm cluster architecture, and the other for debugging purposes:
+
+```bash
+    $ echo $OPTFLAGS
+    -O3 -march=corei7-avx
+
+    $ echo $DEBUGFLAGS
+    -O0 -g
+```
+
+For more informations about the possibilities of the compilers, please see the man pages.
+
+Unified Parallel C
+------------------
+ UPC is supported by two compiler/runtime implementations:
+
+-   GNU - SMP/multi-threading support only
+-   Berkley - multi-node support as well as SMP/multi-threading support
+
+### GNU UPC Compiler
+
+To use the GNU UPC compiler and run the compiled binaries use the module gupc
+
+```bash
+    $ module add gupc
+    $ gupc -v
+    $ g++ -v
+```
+
+Simple program to test the compiler
+
+```bash
+    $ cat count.upc
+
+    /* hello.upc - a simple UPC example */
+    #include <upc.h>
+    #include <stdio.h>
+
+    int main() {
+      if (MYTHREAD == 0) {
+        printf("Welcome to GNU UPC!!!n");
+      }
+      upc_barrier;
+      printf(" - Hello from thread %in", MYTHREAD);
+      return 0;
+    }
+```
+
+To compile the example use
+
+```bash
+    $ gupc -o count.upc.x count.upc
+```
+
+To run the example with 5 threads issue
+
+```bash
+    $ ./count.upc.x -fupc-threads-5
+```
+
+For more informations see the man pages.
+
+### Berkley UPC Compiler
+
+To use the Berkley UPC compiler and runtime environment to run the binaries use the module bupc
+
+```bash
+    $ module add bupc
+    $ upcc -version
+```
+
+As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only.
+
+For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that **the selection of the network is done at the compile time** and not at runtime (as expected)!
+
+Example UPC code:
+
+```bash
+    $ cat hello.upc
+
+    /* hello.upc - a simple UPC example */
+    #include <upc.h>
+    #include <stdio.h>
+
+    int main() {
+      if (MYTHREAD == 0) {
+        printf("Welcome to Berkeley UPC!!!n");
+      }
+      upc_barrier;
+      printf(" - Hello from thread %in", MYTHREAD);
+      return 0;
+    }
+```
+
+To compile the example with the "ibv" UPC network use
+
+```bash
+    $ upcc -network=ibv -o hello.upc.x hello.upc
+```
+
+To run the example with 5 threads issue
+
+```bash
+    $ upcrun -n 5 ./hello.upc.x
+```
+
+To run the example on two compute nodes using all 32 cores, with 32 threads, issue
+
+```bash
+    $ qsub -I -q qprod -A PROJECT_ID -l select=2:ncpus=16
+    $ module add bupc
+    $ upcrun -n 32 ./hello.upc.x
+```
+
+For more informations see the man pages.
+
+Java
+----
+For information how to use Java (runtime and/or compiler), please read the [Java page](java/).
+
+nVidia CUDA
+-----------
+For information how to work with nVidia CUDA, please read the [nVidia CUDA page](nvidia-cuda/).
\ No newline at end of file
--- a/docs.it4i/anselm-cluster-documentation/software/comsol-multiphysics.md
+++ b/docs.it4i/anselm-cluster-documentation/software/comsol-multiphysics.md
+COMSOL Multiphysics®
+====================
+
+Introduction
+-------------------------
+[COMSOL](http://www.comsol.com) is a powerful environment for modelling and solving various engineering and scientific problems based on partial differential equations. COMSOL is designed to solve coupled or multiphysics phenomena. For many
+standard engineering problems COMSOL provides add-on products such as electrical, mechanical, fluid flow, and chemical
+applications.
+
+-   [Structural Mechanics Module](http://www.comsol.com/structural-mechanics-module),
+-   [Heat Transfer Module](http://www.comsol.com/heat-transfer-module),
+-   [CFD Module](http://www.comsol.com/cfd-module),
+-   [Acoustics Module](http://www.comsol.com/acoustics-module),
+-   and [many others](http://www.comsol.com/products)
+
+COMSOL also allows an interface support for equation-based modelling of partial differential equations.
+
+Execution
+----------------------
+On the Anselm cluster COMSOL is available in the latest stable version. There are two variants of the release:
+
+-   **Non commercial** or so called **EDU variant**, which can be used for research and educational purposes.
+-   **Commercial** or so called **COM variant**, which can used also for commercial activities. **COM variant** has only subset of features compared to the **EDU variant** available. More about licensing will be posted here soon.
+
+To load the of COMSOL load the module
+
+```bash
+	$ module load comsol
+```
+
+By default the **EDU variant** will be loaded. If user needs other version or variant, load the particular version. To obtain the list of available versions use
+
+```bash
+	$ module avail comsol
+```
+
+If user needs to prepare COMSOL jobs in the interactive mode it is recommend to use COMSOL on the compute nodes via PBS Pro scheduler. In order run the COMSOL Desktop GUI on Windows is recommended to use the Virtual Network Computing (VNC).
+
+```bash
+    $ xhost +
+    $ qsub -I -X -A PROJECT_ID -q qprod -l select=1:ncpus=16
+    $ module load comsol
+    $ comsol
+```
+
+To run COMSOL in batch mode, without the COMSOL Desktop GUI environment, user can utilized the default (comsol.pbs) job script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l select=3:ncpus=16
+#PBS -q qprod
+#PBS -N JOB_NAME
+#PBS -A  PROJECT_ID
+
+cd /scratch/$USER/ || exit
+
+echo Time is `date`
+echo Directory is `pwd`
+echo '**PBS_NODEFILE***START*******'
+cat $PBS_NODEFILE
+echo '**PBS_NODEFILE***END*********'
+
+text_nodes < cat $PBS_NODEFILE
+
+module load comsol
+# module load comsol/43b-COM
+
+ntask=$(wc -l $PBS_NODEFILE)
+
+comsol -nn ${ntask} batch -configuration /tmp –mpiarg –rmk –mpiarg pbs -tmpdir /scratch/$USER/ -inputfile name_input_f.mph -outputfile name_output_f.mph -batchlog name_log_f.log
+```
+
+Working directory has to be created before sending the (comsol.pbs) job script into the queue. Input file (name_input_f.mph) has to be in working directory or full path to input file has to be specified. The appropriate path to the temp directory of the job has to be set by command option (-tmpdir).
+
+LiveLink™* *for MATLAB®^
+-------------------------
+COMSOL is the software package for the numerical solution of the partial differential equations. LiveLink for MATLAB allows connection to the COMSOL**®** API (Application Programming Interface) with the benefits of the programming language and computing environment of the MATLAB.
+
+LiveLink for MATLAB is available in both **EDU** and **COM** **variant** of the COMSOL release. On Anselm 1 commercial (**COM**) license and the 5 educational (**EDU**) licenses of LiveLink for MATLAB (please see the [ISV Licenses](../isv_licenses/)) are available.
+Following example shows how to start COMSOL model from MATLAB via LiveLink in the interactive mode.
+
+```bash
+$ xhost +
+$ qsub -I -X -A PROJECT_ID -q qexp -l select=1:ncpus=16
+$ module load matlab
+$ module load comsol
+$ comsol server matlab
+```
+
+At the first time to launch the LiveLink for MATLAB (client-MATLAB/server-COMSOL connection) the login and password is requested and this information is not requested again.
+
+To run LiveLink for MATLAB in batch mode with (comsol_matlab.pbs) job script you can utilize/modify the following script and execute it via the qsub command.
+
+```bash
+#!/bin/bash
+#PBS -l select=3:ncpus=16
+#PBS -q qprod
+#PBS -N JOB_NAME
+#PBS -A  PROJECT_ID
+
+cd /scratch/$USER || exit
+
+echo Time is `date`
+echo Directory is `pwd`
+echo '**PBS_NODEFILE***START*******'
+cat $PBS_NODEFILE
+echo '**PBS_NODEFILE***END*********'
+
+text_nodes < cat $PBS_NODEFILE
+
+module load matlab
+module load comsol/43b-EDU
+
+ntask=$(wc -l $PBS_NODEFILE)
+
+comsol -nn ${ntask} server -configuration /tmp -mpiarg -rmk -mpiarg pbs -tmpdir /scratch/$USER &
+cd /apps/engineering/comsol/comsol43b/mli
+matlab -nodesktop -nosplash -r "mphstart; addpath /scratch/$USER; test_job"
+```
+
+This example shows how to run Livelink for MATLAB with following configuration: 3 nodes and 16 cores per node. Working directory has to be created before submitting (comsol_matlab.pbs) job script into the queue. Input file (test_job.m) has to be in working directory or full path to input file has to be specified. The Matlab command option (-r ”mphstart”) created a connection with a COMSOL server using the default port number.
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/allinea-ddt.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/allinea-ddt.md
+Allinea Forge (DDT,MAP)
+=======================
+
+Allinea Forge consist of two tools - debugger DDT and profiler MAP.
+
+Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process - even if these processes are distributed across a cluster using an MPI implementation.
+
+Allinea MAP is a profiler for C/C++/Fortran HPC codes. It is designed for profiling parallel code, which uses pthreads, OpenMP or MPI.
+
+License and Limitations for Anselm Users
+----------------------------------------
+On Anselm users can debug OpenMP or MPI code that runs up to 64 parallel processes. In case of debugging GPU or Xeon Phi accelerated codes the limit is 8 accelerators. These limitation means that:
+
+-   1 user can debug up 64 processes, or
+-   32 users can debug 2 processes, etc.
+
+In case of debugging on accelerators:
+
+-   1 user can debug on up to 8 accelerators, or
+-   8 users can debug on single accelerator.
+
+Compiling Code to run with DDT
+------------------------------
+
+### Modules
+
+Load all necessary modules to compile the code. For example:
+
+```bash
+    $ module load intel
+    $ module load impi   ... or ... module load openmpi/X.X.X-icc
+```
+
+Load the Allinea DDT module:
+
+```bash
+    $ module load Forge
+```
+
+Compile the code:
+
+```bash
+$ mpicc -g -O0 -o test_debug test.c
+
+$ mpif90 -g -O0 -o test_debug test.f
+```
+
+### Compiler flags
+
+Before debugging, you need to compile your code with theses flags:
+
+!!! Note "Note"
+	- **g** : Generates extra debugging information usable by GDB. -g3 includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
+
+	- **O0** : Suppress all optimizations.
+
+Starting a Job with DDT
+-----------------------
+Be sure to log in with an  X window forwarding enabled. This could mean using the -X in the ssh:
+
+```bash
+    $ ssh -X username@anselm.it4i.cz
+```
+
+Other options is to access login node using VNC. Please see the detailed information on how to [use graphic user interface on Anselm](https://docs.it4i.cz/anselm-cluster-documentation/software/debuggers/resolveuid/11e53ad0d2fd4c5187537f4baeedff33)
+
+From the login node an interactive session **with X windows forwarding** (-X option) can be started by following command:
+
+```bash
+    $ qsub -I -X -A NONE-0-0 -q qexp -lselect=1:ncpus=16:mpiprocs=16,walltime=01:00:00
+```
+
+Then launch the debugger with the ddt command followed by the name of the executable to debug:
+
+```bash
+    $ ddt test_debug
+```
+
+A submission window that appears have a prefilled path to the executable to debug. You can select the number of MPI processors and/or OpenMP threads on which to run and press run. Command line arguments to a program can be entered to the "Arguments " box.
+
+![](../../../img/ddt1.png)
+
+To start the debugging directly without the submission window, user can specify the debugging and execution parameters from the command line. For example the number of MPI processes is set by option "-np 4". Skipping the dialog is done by "-start" option. To see the list of the "ddt" command line parameters, run "ddt --help".
+
+```bash
+    ddt -start -np 4 ./hello_debug_impi
+```
+
+Documentation
+-------------
+Users can find original User Guide after loading the DDT module:
+
+```bash
+    $DDTPATH/doc/userguide.pdf
+```
+
+[1] Discipline, Magic, Inspiration and Science: Best Practice Debugging with Allinea DDT, Workshop conducted at LLNL by Allinea on May 10, 2013, [link](https://computing.llnl.gov/tutorials/allineaDDT/index.html)
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/allinea-performance-reports.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/allinea-performance-reports.md
+Allinea Performance Reports
+===========================
+
+##quick application profiling
+
+Introduction
+------------
+Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs.
+
+The Allinea Performance Reports is most useful in profiling MPI programs.
+
+Our license is limited to 64 MPI processes.
+
+Modules
+-------
+Allinea Performance Reports version 6.0 is available
+
+```bash
+    $ module load PerformanceReports/6.0
+```
+
+The module sets up environment variables, required for using the Allinea Performance Reports. This particular command loads the default module, which is performance reports version 4.2.
+
+Usage
+-----
+!!! Note "Note"
+	Use the the perf-report wrapper on your (MPI) program.
+
+Instead of [running your MPI program the usual way](../mpi/), use the the perf report wrapper:
+
+```bash
+    $ perf-report mpirun ./mympiprog.x
+```
+
+The mpi program will run as usual. The perf-report creates two additional files, in *.txt and *.html format, containing the performance report. Note that [demanding MPI codes should be run within the queue system](../../resource-allocation-and-job-execution/job-submission-and-execution/).
+
+Example
+-------
+In this example, we will be profiling the mympiprog.x MPI program, using Allinea performance reports. Assume that the code is compiled with intel compilers and linked against intel MPI library:
+
+First, we allocate some nodes via the express queue:
+
+```bash
+    $ qsub -q qexp -l select=2:ncpus=16:mpiprocs=16:ompthreads=1 -I
+    qsub: waiting for job 262197.dm2 to start
+    qsub: job 262197.dm2 ready
+```
+
+Then we load the modules and run the program the usual way:
+
+```bash
+    $ module load intel impi allinea-perf-report/4.2
+    $ mpirun ./mympiprog.x
+```
+
+Now lets profile the code:
+
+```bash
+    $ perf-report mpirun ./mympiprog.x
+```
+
+Performance report files [mympiprog_32p*.txt](mympiprog_32p_2014-10-15_16-56.txt) and [mympiprog_32p*.html](mympiprog_32p_2014-10-15_16-56.html) were created. We can see that the code is very efficient on MPI and is CPU bounded.
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/cube.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/cube.md
+CUBE
+====
+
+Introduction
+------------
+CUBE is a graphical performance report explorer for displaying data from Score-P and Scalasca (and other compatible tools). The name comes from the fact that it displays performance data in a three-dimensions :
+
+-   **performance metric**, where a number of metrics are available, such as communication time or cache misses,
+-   **call path**, which contains the call tree of your program
+-   s**ystem resource**, which contains system's nodes, processes and threads, depending on the parallel programming model.
+
+Each dimension is organized in a tree, for example the time performance metric is divided into Execution time and Overhead time, call path dimension is organized by files and routines in your source code etc.
+
+![](../../../img/Snmekobrazovky20141204v12.56.36.png)
+
+*Figure 1. Screenshot of CUBE displaying data from Scalasca.*
+
+Each node in the tree is colored by severity (the color scheme is displayed at the bottom of the window, ranging from the least severe blue to the most severe being red). For example in Figure 1, we can see that most of the point-to-point MPI communication happens in routine exch_qbc, colored red.
+
+Installed versions
+------------------
+Currently, there are two versions of CUBE 4.2.3 available as [modules](../../environment-and-modules/):
+
+-    cube/4.2.3-gcc, compiled with GCC
+-    cube/4.2.3-icc, compiled with Intel compiler
+
+Usage
+-----
+CUBE is a graphical application. Refer to Graphical User Interface documentation for a list of methods to launch graphical applications on Anselm.
+
+!!! Note "Note"
+	Analyzing large data sets can consume large amount of CPU and RAM. Do not perform large analysis on login nodes.
+
+After loading the apropriate module, simply launch cube command, or alternatively you can use  scalasca -examine command to launch the GUI. Note that for Scalasca datasets, if you do not analyze the data with scalasca -examine before to opening them with CUBE, not all performance data will be available.
+
+References
+1.  <http://www.scalasca.org/software/cube-4.x/download.html>
+
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/debuggers.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/debuggers.md
+Debuggers and profilers summary
+===============================
+
+Introduction
+------------
+We provide state of the art programms and tools to develop, profile and debug HPC codes at IT4Innovations. On these pages, we provide an overview of the profiling and debugging tools available on Anslem at IT4I.
+
+Intel debugger
+--------------
+The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use X display for running the GUI.
+
+```bash
+    $ module load intel
+    $ idb
+```
+
+Read more at the [Intel Debugger](intel-suite/intel-debugger/) page.
+
+Allinea Forge (DDT/MAP)
+-----------------------
+Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process even if these processes are distributed across a cluster using an MPI implementation.
+
+```bash
+    $ module load Forge
+    $ forge
+```
+
+Read more at the [Allinea DDT](debuggers/allinea-ddt/) page.
+
+Allinea Performance Reports
+---------------------------
+Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs. Our license is limited to 64 MPI processes.
+
+```bash
+    $ module load PerformanceReports/6.0
+    $ perf-report mpirun -n 64 ./my_application argument01 argument02
+```
+
+Read more at the [Allinea Performance Reports](debuggers/allinea-performance-reports/) page.
+
+RougeWave Totalview
+-------------------
+TotalView is a source- and machine-level debugger for multi-process, multi-threaded programs. Its wide range of tools provides ways to analyze, organize, and test programs, making it easy to isolate and identify problems in individual threads and processes in programs of great complexity.
+
+```bash
+    $ module load totalview
+    $ totalview
+```
+
+Read more at the [Totalview](debuggers/total-view/) page.
+
+Vampir trace analyzer
+---------------------
+Vampir is a GUI trace analyzer for traces in OTF format.
+
+```bash
+    $ module load Vampir/8.5.0
+    $ vampir
+```
+
+Read more at the [Vampir](../../salomon/software/debuggers/vampir/) page.
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/intel-performance-counter-monitor.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/intel-performance-counter-monitor.md
+Intel Performance Counter Monitor
+=================================
+
+Introduction
+------------
+Intel PCM (Performance Counter Monitor) is a tool to monitor performance hardware counters on Intel>® processors, similar to [PAPI](papi/). The difference between PCM and PAPI is that PCM supports only Intel hardware, but PCM can monitor also uncore metrics, like memory controllers and >QuickPath Interconnect links.
+
+Installed version
+------------------------------
+Currently installed version 2.6. To load the [module](../../environment-and-modules/), issue:
+
+```bash
+    $ module load intelpcm
+```
+
+Command line tools
+------------------
+PCM provides a set of tools to monitor system/or application.
+
+### pcm-memory
+
+ Measures memory bandwidth of your application or the whole system. Usage:
+
+```bash
+    $ pcm-memory.x <delay>|[external_program parameters]
+```
+
+Specify either a delay of updates in seconds or an external program to monitor. If you get an error about PMU in use, respond "y" and relaunch the program.
+
+Sample output:
+
+```bash
+    ---------------------------------------||---------------------------------------
+    --             Socket 0              --||--             Socket 1              --
+    ---------------------------------------||---------------------------------------
+    ---------------------------------------||---------------------------------------
+    ---------------------------------------||---------------------------------------
+    --   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
+    ---------------------------------------||---------------------------------------
+    --  Mem Ch 0: Reads (MB/s):    2.44  --||--  Mem Ch 0: Reads (MB/s):    0.26  --
+    --            Writes(MB/s):    2.16  --||--            Writes(MB/s):    0.08  --
+    --  Mem Ch 1: Reads (MB/s):    0.35  --||--  Mem Ch 1: Reads (MB/s):    0.78  --
+    --            Writes(MB/s):    0.13  --||--            Writes(MB/s):    0.65  --
+    --  Mem Ch 2: Reads (MB/s):    0.32  --||--  Mem Ch 2: Reads (MB/s):    0.21  --
+    --            Writes(MB/s):    0.12  --||--            Writes(MB/s):    0.07  --
+    --  Mem Ch 3: Reads (MB/s):    0.36  --||--  Mem Ch 3: Reads (MB/s):    0.20  --
+    --            Writes(MB/s):    0.13  --||--            Writes(MB/s):    0.07  --
+    -- NODE0 Mem Read (MB/s):      3.47  --||-- NODE1 Mem Read (MB/s):      1.45  --
+    -- NODE0 Mem Write (MB/s):     2.55  --||-- NODE1 Mem Write (MB/s):     0.88  --
+    -- NODE0 P. Write (T/s) :     31506  --||-- NODE1 P. Write (T/s):       9099  --
+    -- NODE0 Memory (MB/s):        6.02  --||-- NODE1 Memory (MB/s):        2.33  --
+    ---------------------------------------||---------------------------------------
+    --                   System Read Throughput(MB/s):      4.93                  --
+    --                  System Write Throughput(MB/s):      3.43                  --
+    --                 System Memory Throughput(MB/s):      8.35                  --
+    ---------------------------------------||--------------------------------------- 
+```
+
+### pcm-msr
+
+Command  pcm-msr.x can be used to read/write model specific registers of the CPU.
+
+### pcm-numa
+
+NUMA monitoring utility does not work on Anselm.
+
+### pcm-pcie
+
+Can be used to monitor PCI Express bandwith. Usage: pcm-pcie.x &lt;delay&gt;
+
+### pcm-power
+
+Displays energy usage and thermal headroom for CPU and DRAM sockets. Usage: pcm-power.x &lt;delay&gt; | &lt;external program&gt;
+
+### pcm
+
+This command provides an overview of performance counters and memory usage. Usage: pcm.x &lt;delay&gt; | &lt;external program&gt;
+
+Sample output :
+
+```bash
+    $ pcm.x ./matrix
+
+     Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)
+
+     Copyright (c) 2009-2013 Intel Corporation
+
+    Number of physical cores: 16
+    Number of logical cores: 16
+    Threads (logical cores) per physical core: 1
+    Num sockets: 2
+    Core PMU (perfmon) version: 3
+    Number of core PMU generic (programmable) counters: 8
+    Width of generic (programmable) counters: 48 bits
+    Number of core PMU fixed counters: 3
+    Width of fixed counters: 48 bits
+    Nominal core frequency: 2400000000 Hz
+    Package thermal spec power: 115 Watt; Package minimum power: 51 Watt; Package maximum power: 180 Watt;
+    Socket 0: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
+    Socket 1: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
+    Number of PCM instances: 2
+    Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)
+
+    Detected Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"
+
+     Executing "./matrix" command:
+
+    Exit code: 0
+
+     EXEC  : instructions per nominal CPU cycle
+     IPC   : instructions per CPU cycle
+     FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
+     AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
+     L3MISS: L3 cache misses
+     L2MISS: L2 cache misses (including other core's L2 cache *hits*)
+     L3HIT : L3 cache hit ratio (0.00-1.00)
+     L2HIT : L2 cache hit ratio (0.00-1.00)
+     L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
+     L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
+     READ  : bytes read from memory controller (in GBytes)
+     WRITE : bytes written to memory controller (in GBytes)
+     TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
+
+     Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP
+
+       0    0     0.00   0.64   0.01    0.80    5592       11 K    0.49    0.13    0.32    0.06     N/A     N/A     67
+       1    0     0.00   0.18   0.00    0.69    3086     5552      0.44    0.07    0.48    0.08     N/A     N/A     68
+       2    0     0.00   0.23   0.00    0.81     300      562      0.47    0.06    0.43    0.08     N/A     N/A     67
+       3    0     0.00   0.21   0.00    0.99     437      862      0.49    0.06    0.44    0.09     N/A     N/A     73
+       4    0     0.00   0.23   0.00    0.93     293      559      0.48    0.07    0.42    0.09     N/A     N/A     73
+       5    0     0.00   0.21   0.00    1.00     423      849      0.50    0.06    0.43    0.10     N/A     N/A     69
+       6    0     0.00   0.23   0.00    0.94     285      558      0.49    0.06    0.41    0.09     N/A     N/A     71
+       7    0     0.00   0.18   0.00    0.81     674     1130      0.40    0.05    0.53    0.08     N/A     N/A     65
+       8    1     0.00   0.47   0.01    1.26    6371       13 K    0.51    0.35    0.31    0.07     N/A     N/A     64
+       9    1     2.30   1.80   1.28    1.29     179 K     15 M    0.99    0.59    0.04    0.71     N/A     N/A     60
+      10    1     0.00   0.22   0.00    1.26     315      570      0.45    0.06    0.43    0.08     N/A     N/A     67
+      11    1     0.00   0.23   0.00    0.74     321      579      0.45    0.05    0.45    0.07     N/A     N/A     66
+      12    1     0.00   0.22   0.00    1.25     305      570      0.46    0.05    0.42    0.07     N/A     N/A     68
+      13    1     0.00   0.22   0.00    1.26     336      581      0.42    0.04    0.44    0.06     N/A     N/A     69
+      14    1     0.00   0.22   0.00    1.25     314      565      0.44    0.06    0.43    0.07     N/A     N/A     69
+      15    1     0.00   0.29   0.00    1.19    2815     6926      0.59    0.39    0.29    0.08     N/A     N/A     69
+    -------------------------------------------------------------------------------------------------------------------
+     SKT    0     0.00   0.46   0.00    0.79      11 K     21 K    0.47    0.10    0.38    0.07    0.00    0.00     65
+     SKT    1     0.29   1.79   0.16    1.29     190 K     15 M    0.99    0.59    0.05    0.70    0.01    0.01     61
+    -------------------------------------------------------------------------------------------------------------------
+     TOTAL  *     0.14   1.78   0.08    1.28     201 K     15 M    0.99    0.59    0.05    0.70    0.01    0.01     N/A
+
+     Instructions retired: 1345 M ; Active cycles:  755 M ; Time (TSC):  582 Mticks ; C0 (active,non-halted) core residency: 6.30 %
+
+     C1 core residency: 0.14 %; C3 core residency: 0.20 %; C6 core residency: 0.00 %; C7 core residency: 93.36 %;
+     C2 package residency: 48.81 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
+
+     PHYSICAL CORE IPC                 : 1.78 => corresponds to 44.50 % utilization for cores in active state
+     Instructions per nominal CPU cycle: 0.14 => corresponds to 3.60 % core utilization over time interval
+
+    Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):
+
+                   QPI0     QPI1    |  QPI0   QPI1
+    ----------------------------------------------------------------------------------------------
+     SKT    0        0        0     |    0%     0%
+     SKT    1        0        0     |    0%     0%
+    ----------------------------------------------------------------------------------------------
+    Total QPI incoming data traffic:    0       QPI data traffic/Memory controller traffic: 0.00
+
+    Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):
+
+                   QPI0     QPI1    |  QPI0   QPI1
+    ----------------------------------------------------------------------------------------------
+     SKT    0        0        0     |    0%     0%
+     SKT    1        0        0     |    0%     0%
+    ----------------------------------------------------------------------------------------------
+    Total QPI outgoing data and non-data traffic:    0
+
+    ----------------------------------------------------------------------------------------------
+     SKT    0 package consumed 4.06 Joules
+     SKT    1 package consumed 9.40 Joules
+    ----------------------------------------------------------------------------------------------
+     TOTAL:                    13.46 Joules
+
+    ----------------------------------------------------------------------------------------------
+     SKT    0 DIMMs consumed 4.18 Joules
+     SKT    1 DIMMs consumed 4.28 Joules
+    ----------------------------------------------------------------------------------------------
+     TOTAL:                  8.47 Joules
+    Cleaning up
+```
+
+### pcm-sensor
+
+Can be used as a sensor for ksysguard GUI, which is currently not installed on Anselm.
+
+API
+---
+In a similar fashion to PAPI, PCM provides a C++ API to access the performance counter from within your application. Refer to the [doxygen documentation](http://intel-pcm-api-documentation.github.io/classPCM.html) for details of the API.
+
+!!! Note "Note"
+	Due to security limitations, using PCM API to monitor your applications is currently not possible on Anselm. (The application must be run as root user)
+
+Sample program using the API :
+
+```cpp
+    #include <stdlib.h>
+    #include <stdio.h>
+    #include "cpucounters.h"
+
+    #define SIZE 1000
+
+    using namespace std;
+
+    int main(int argc, char **argv) {
+      float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
+      float real_time, proc_time, mflops;
+      long long flpins;
+      int retval;
+      int i,j,k;
+
+      PCM * m = PCM::getInstance();
+
+      if (m->program() != PCM::Success) return 1;
+
+      SystemCounterState before_sstate = getSystemCounterState();
+
+      /* Initialize the Matrix arrays */
+      for ( i=0; i<SIZE*SIZE; i++ ){
+        mresult[0][i] = 0.0;
+        matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1; }
+
+      /* A naive Matrix-Matrix multiplication */
+      for (i=0;i<SIZE;i++)
+        for(j=0;j<SIZE;j++)
+          for(k=0;k<SIZE;k++)
+            mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];
+
+      SystemCounterState after_sstate = getSystemCounterState();
+
+      cout << "Instructions per clock:" << getIPC(before_sstate,after_sstate)
+      << "L3 cache hit ratio:" << getL3CacheHitRatio(before_sstate,after_sstate)
+      << "Bytes read:" << getBytesReadFromMC(before_sstate,after_sstate);
+
+      for (i=0; i<SIZE;i++)
+        for (j=0; j<SIZE; j++)
+           if (mresult[i][j] == -1) printf("x");
+
+      return 0;
+    }
+```
+
+Compile it with :
+
+```bash
+    $ icc matrix.cpp -o matrix -lpthread -lpcm
+```
+
+Sample output:
+
+```bash
+    $ ./matrix
+    Number of physical cores: 16
+    Number of logical cores: 16
+    Threads (logical cores) per physical core: 1
+    Num sockets: 2
+    Core PMU (perfmon) version: 3
+    Number of core PMU generic (programmable) counters: 8
+    Width of generic (programmable) counters: 48 bits
+    Number of core PMU fixed counters: 3
+    Width of fixed counters: 48 bits
+    Nominal core frequency: 2400000000 Hz
+    Package thermal spec power: 115 Watt; Package minimum power: 51 Watt; Package maximum power: 180 Watt;
+    Socket 0: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
+    Socket 1: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
+    Number of PCM instances: 2
+    Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)
+    Instructions per clock:1.7
+    L3 cache hit ratio:1.0
+    Bytes read:12513408
+```
+
+References
+----------
+1.  <https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization>
+2.  <https://software.intel.com/sites/default/files/m/3/2/2/xeon-e5-2600-uncore-guide.pdf> Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring Guide.
+3.  <http://intel-pcm-api-documentation.github.io/classPCM.html> API Documentation
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/intel-vtune-amplifier.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/intel-vtune-amplifier.md
+Intel VTune Amplifier
+=====================
+
+Introduction
+------------
+Intel*® *VTune™ Amplifier, part of Intel Parallel studio, is a GUI profiling tool designed for Intel processors. It offers a graphical performance analysis of single core and multithreaded applications. A highlight of the features:
+
+-   Hotspot analysis
+-   Locks and waits analysis
+-   Low level specific counters, such as branch analysis and memory
+    bandwidth
+-   Power usage analysis - frequency and sleep states.
+
+![screenshot](../../../img/vtune-amplifier.png)
+
+Usage
+-----
+To launch the GUI, first load the module:
+
+```bash
+    $ module add VTune/2016_update1
+```
+
+and launch the GUI :
+
+```bash
+    $ amplxe-gui
+```
+
+!!! Note "Note"
+	To profile an application with VTune Amplifier, special kernel modules need to be loaded. The modules are not loaded on Anselm login nodes, thus direct profiling on login nodes is not possible. Use VTune on compute nodes and refer to the documentation on using GUI applications.
+
+The GUI will open in new window. Click on "*New Project...*" to create a new project. After clicking *OK*, a new window with project properties will appear.  At "*Application:*", select the bath to your binary you want to profile (the binary should be compiled with -g flag). Some additional options such as command line arguments can be selected. At "*Managed code profiling mode:*" select "*Native*" (unless you want to profile managed mode .NET/Mono applications). After clicking *OK*, your project is created.
+
+To run a new analysis, click "*New analysis...*". You will see a list of possible analysis. Some of them will not be possible on the current CPU (eg. Intel Atom analysis is not possible on Sandy Bridge CPU), the GUI will show an error box if you select the wrong analysis. For example, select "*Advanced Hotspots*". Clicking on *Start *will start profiling of the application.
+
+Remote Analysis
+---------------
+VTune Amplifier also allows a form of remote analysis. In this mode, data for analysis is collected from the command line without GUI, and the results are then loaded to GUI on another machine. This allows profiling without interactive graphical jobs. To perform a remote analysis, launch a GUI somewhere, open the new analysis window and then click the button "*Command line*" in bottom right corner. It will show the command line needed to perform the selected analysis.
+
+The command line will look like this:
+
+```bash
+    /apps/all/VTune/2016_update1/vtune_amplifier_xe_2016.1.1.434111/bin64/amplxe-cl -collect advanced-hotspots -knob collection-detail=stack-and-callcount -mrte-mode=native -target-duration-type=veryshort -app-working-dir /home/sta545/test -- /home/sta545/test_pgsesv
+```
+
+Copy the line to clipboard and then you can paste it in your jobscript or in command line. After the collection is run, open the GUI once again, click the menu button in the upper right corner, and select "*Open &gt; Result...*". The GUI will load the results from the run.
+
+Xeon Phi
+--------
+!!! Note "Note"
+	This section is outdated. It will be updated with new information soon.
+
+It is possible to analyze both native and offload Xeon Phi applications. For offload mode, just specify the path to the binary. For native mode, you need to specify in project properties:
+
+Application:  ssh
+
+Application parameters:  mic0 source ~/.profile && /path/to/your/bin
+
+Note that we include  source ~/.profile in the command to setup environment paths [as described here](../intel-xeon-phi/).
+
+!!! Note "Note"
+	If the analysis is interrupted or aborted, further analysis on the card might be impossible and you will get errors like "ERROR connecting to MIC card". In this case please contact our support to reboot the MIC card.
+
+You may also use remote analysis to collect data from the MIC and then analyze it in the GUI later :
+
+```bash
+    $ amplxe-cl -collect knc-hotspots -no-auto-finalize -- ssh mic0
+    "export LD_LIBRARY_PATH=/apps/intel/composer_xe_2015.2.164/compiler/lib/mic/:/apps/intel/composer_xe_2015.2.164/mkl/lib/mic/; export KMP_AFFINITY=compact; /tmp/app.mic"
+```
+
+References
+----------
+1.  <https://www.rcac.purdue.edu/tutorials/phi/PerformanceTuningXeonPhi-Tullos.pdf> Performance Tuning for Intel® Xeon Phi™ Coprocessors
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/papi.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/papi.md
+PAPI
+====
+
+Introduction
+------------
+Performance Application Programming Interface (PAPI)  is a portable interface to access hardware performance counters (such as instruction counts and cache misses) found in most modern architectures. With the new component framework, PAPI is not limited only to CPU counters, but offers also components for CUDA, network, Infiniband etc.
+
+PAPI provides two levels of interface - a simpler, high level interface and more detailed low level interface.
+
+PAPI can be used with parallel as well as serial programs.
+
+Usage
+-----
+To use PAPI, load [module](../../environment-and-modules/) papi:
+
+```bash
+    $ module load papi
+```
+
+This will load the default version. Execute module avail papi for a list of installed versions.
+
+Utilites
+--------
+The  bin directory of PAPI (which is automatically added to  $PATH upon loading the module) contains various utilites.
+
+### papi_avail
+
+Prints which preset events are available on the current CPU. The third column indicated whether the preset event is available on the current CPU.
+
+```bash
+    $ papi_avail
+    Available events and hardware information.
+    --------------------------------------------------------------------------------
+    PAPI Version : 5.3.2.0
+    Vendor string and code : GenuineIntel (1)
+    Model string and code : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45)
+    CPU Revision : 7.000000
+    CPUID Info : Family: 6 Model: 45 Stepping: 7
+    CPU Max Megahertz : 2601
+    CPU Min Megahertz : 1200
+    Hdw Threads per core : 1
+    Cores per Socket : 8
+    Sockets : 2
+    NUMA Nodes : 2
+    CPUs per Node : 8
+    Total CPUs : 16
+    Running in a VM : no
+    Number Hardware Counters : 11
+    Max Multiplex Counters : 32
+    --------------------------------------------------------------------------------
+    Name Code Avail Deriv Description (Note)
+    PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
+    PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
+    PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
+    PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses
+    PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
+    PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
+    PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses
+    PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses
+    PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses
+    ....
+```
+
+### papi_native_avail
+
+Prints which native events are available on the current CPU.
+
+###  papi_cost
+
+Measures the cost (in cycles) of basic PAPI operations.
+
+###papi_mem_info
+
+Prints information about the memory architecture of the current CPU.
+
+PAPI API
+--------
+PAPI provides two kinds of events:
+
+-   **Preset events** is a set of predefined common CPU events, standardized across platforms.
+-   **Native events **is a set of all events supported by the current hardware. This is a larger set of features than preset. For other components than CPU, only native events are usually available.
+
+To use PAPI in your application, you need to link the appropriate include file.
+
+-    papi.h for C
+-    f77papi.h for Fortran 77
+-    f90papi.h for Fortran 90
+-    fpapi.h for Fortran with preprocessor
+
+The include path is automatically added by papi module to $INCLUDE.
+
+### High level API
+
+Please refer to <http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:High_Level> for a description of the High level API.
+
+### Low level API
+
+Please refer to <http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:Low_Level> for a description of the Low level API.
+
+### Timers
+
+PAPI provides the most accurate timers the platform can support. See <http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:Timers>
+
+### System information
+
+PAPI can be used to query some system infromation, such as CPU name and MHz. See <http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:System_Information>
+
+Example
+-------
+
+The following example prints MFLOPS rate of a naive matrix-matrix multiplication:
+
+```bash
+    #include <stdlib.h>
+    #include <stdio.h>
+    #include "papi.h"
+    #define SIZE 1000
+
+    int main(int argc, char **argv) {
+     float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
+     float real_time, proc_time, mflops;
+     long long flpins;
+     int retval;
+     int i,j,k;
+
+     /* Initialize the Matrix arrays */
+     for ( i=0; i<SIZE*SIZE; i++ ){
+     mresult[0][i] = 0.0;
+     matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1; 
+     }
+
+     /* Setup PAPI library and begin collecting data from the counters */
+     if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
+     printf("Error!");
+
+     /* A naive Matrix-Matrix multiplication */
+     for (i=0;i<SIZE;i++)
+     for(j=0;j<SIZE;j++)
+     for(k=0;k<SIZE;k++)
+     mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];
+
+     /* Collect the data into the variables passed in */
+     if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
+     printf("Error!");
+
+     printf("Real_time:t%fnProc_time:t%fnTotal flpins:t%lldnMFLOPS:tt%fn", real_time, proc_time, flpins, mflops);
+     PAPI_shutdown();
+     return 0;
+    }
+```
+
+Now compile and run the example :
+
+```bash
+    $ gcc matrix.c -o matrix -lpapi
+    $ ./matrix
+    Real_time: 8.852785
+    Proc_time: 8.850000
+    Total flpins: 6012390908
+    MFLOPS: 679.366211
+```
+
+Let's try with optimizations enabled :
+
+```bash
+    $ gcc -O3 matrix.c -o matrix -lpapi
+    $ ./matrix
+    Real_time: 0.000020
+    Proc_time: 0.000000
+    Total flpins: 6
+    MFLOPS: inf
+```
+
+Now we see a seemingly strange result - the multiplication took no time and only 6 floating point instructions were issued. This is because the compiler optimizations have completely removed the multiplication loop, as the result is actually not used anywhere in the program. We can fix this by adding some "dummy" code at the end of the Matrix-Matrix multiplication routine :
+
+```cpp
+    for (i=0; i<SIZE;i++)
+     for (j=0; j<SIZE; j++)
+       if (mresult[i][j] == -1.0) printf("x");
+```
+
+Now the compiler won't remove the multiplication loop. (However it is still not that smart to see that the result won't ever be negative). Now run the code again:
+
+```bash
+    $ gcc -O3 matrix.c -o matrix -lpapi
+    $ ./matrix
+    Real_time: 8.795956
+    Proc_time: 8.790000
+    Total flpins: 18700983160
+    MFLOPS: 2127.529297
+```
+
+### Intel Xeon Phi
+
+!!! Note "Note"
+	PAPI currently supports only a subset of counters on the Intel Xeon Phi processor compared to Intel Xeon, for example the floating point operations counter is missing.
+
+To use PAPI in [Intel Xeon Phi](../intel-xeon-phi/) native applications, you need to load module with " -mic" suffix, for example " papi/5.3.2-mic" :
+
+```bash
+    $ module load papi/5.3.2-mic
+```
+
+Then, compile your application in the following way:
+
+```bash
+    $ module load intel
+    $ icc -mmic -Wl,-rpath,/apps/intel/composer_xe_2013.5.192/compiler/lib/mic matrix-mic.c -o matrix-mic -lpapi -lpfm
+```
+
+To execute the application on MIC, you need to manually set LD_LIBRARY_PATH:
+
+```bash
+    $ qsub -q qmic -A NONE-0-0 -I
+    $ ssh mic0
+    $ export LD_LIBRARY_PATH=/apps/tools/papi/5.4.0-mic/lib/
+    $ ./matrix-mic
+```
+
+Alternatively, you can link PAPI statically (-static flag), then LD_LIBRARY_PATH does not need to be set.
+
+You can also execute the PAPI tools on MIC :
+
+```bash
+    $ /apps/tools/papi/5.4.0-mic/bin/papi_native_avail
+```
+
+To use PAPI in offload mode, you need to provide both host and MIC versions of PAPI:
+
+```bash
+    $ module load papi/5.4.0
+    $ icc matrix-offload.c -o matrix-offload -offload-option,mic,compiler,"-L$PAPI_HOME-mic/lib -lpapi" -lpapi
+```
+
+References
+----------
+1.  <http://icl.cs.utk.edu/papi/> Main project page
+2.  <http://icl.cs.utk.edu/projects/papi/wiki/Main_Page> Wiki
+3.  <http://icl.cs.utk.edu/papi/docs/> API Documentation
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/scalasca.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/scalasca.md
+Scalasca
+========
+
+Introduction
+-------------------------
+[Scalasca](http://www.scalasca.org/) is a software tool that supports the performance optimization of parallel programs by measuring and analyzing their runtime behavior. The analysis identifies potential performance bottlenecks – in particular those concerning communication and synchronization – and offers guidance in exploring their causes.
+
+Scalasca supports profiling of MPI, OpenMP and hybrid MPI+OpenMP applications.
+
+Installed versions
+------------------
+There are currently two versions of Scalasca 2.0 [modules](../../environment-and-modules/) installed on Anselm:
+
+-   scalasca2/2.0-gcc-openmpi, for usage with [GNU Compiler](../compilers/) and [OpenMPI](../mpi/Running_OpenMPI/),
+-   scalasca2/2.0-icc-impi, for usage with [Intel Compiler](../compilers.html) and [Intel MPI](../mpi/running-mpich2/).
+
+Usage
+-----
+Profiling a parallel application with Scalasca consists of three steps:
+
+1.  Instrumentation, compiling the application such way, that the profiling data can be generated.
+2.  Runtime measurement, running the application with the Scalasca profiler to collect performance data.
+3.  Analysis of reports
+
+### Instrumentation
+
+Instrumentation via " scalasca -instrument" is discouraged. Use [Score-P instrumentation](score-p/).
+
+### Runtime measurement
+
+After the application is instrumented, runtime measurement can be performed with the " scalasca -analyze" command. The syntax is:
+
+scalasca -analyze [scalasca options] [launcher] [launcher options] [program] [program options]
+
+An example :
+
+```bash
+    $ scalasca -analyze mpirun -np 4 ./mympiprogram
+```
+
+Some notable Scalsca options are:
+
+**-t Enable trace data collection. By default, only summary data are collected.**
+**-e &lt;directory&gt; Specify a directory to save the collected data to. By default, Scalasca saves the data to a directory with prefix scorep_, followed by name of the executable and launch configuration.**
+
+!!! Note "Note"
+	Scalasca can generate a huge amount of data, especially if tracing is enabled. Please consider saving the data to a [scratch directory](../../storage/storage/).
+
+### Analysis of reports
+
+For the analysis, you must have [Score-P](score-p/) and [CUBE](cube/) modules loaded. The analysis is done in two steps, first, the data is preprocessed and then CUBE GUI tool is launched.
+
+To launch the analysis, run :
+
+```bash
+scalasca -examine [options] <experiment_directory>
+```
+
+If you do not wish to launch the GUI tool, use the "-s" option :
+
+```bash
+scalasca -examine -s <experiment_directory>
+```
+
+Alternatively you can open CUBE and load the data directly from here. Keep in mind that in that case the preprocessing is not done and not all metrics will be shown in the viewer.
+
+Refer to [CUBE documentation](cube/) on usage of the GUI viewer.
+
+References
+----------
+1.  <http://www.scalasca.org/>
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/score-p.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/score-p.md
+Score-P
+=======
+
+Introduction
+------------
+The [Score-P measurement infrastructure](http://www.vi-hps.org/projects/score-p/) is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications.
+
+Score-P can be used as an instrumentation tool for [Scalasca](scalasca/).
+
+Installed versions
+------------------
+There are currently two versions of Score-P version 1.2.6 [modules](../../environment-and-modules/) installed on Anselm :
+
+-   scorep/1.2.3-gcc-openmpi, for usage     with [GNU Compiler](../compilers/) and [OpenMPI](../mpi/Running_OpenMPI/)
+-   scorep/1.2.3-icc-impi, for usage with [Intel Compiler](../compilers.html)> and [Intel MPI](../mpi/running-mpich2/)>.
+
+Instrumentation
+---------------
+There are three ways to instrument your parallel applications in order to enable performance data collection:
+
+1.  Automated instrumentation using compiler
+2.  Manual instrumentation using API calls
+3.  Manual instrumentation using directives
+
+### Automated instrumentation
+
+is the easiest method. Score-P will automatically add instrumentation to every routine entry and exit using compiler hooks, and will intercept MPI calls and OpenMP regions. This method might, however, produce a large number of data. If you want to focus on profiler a specific regions of your code, consider using the manual instrumentation methods. To use automated instrumentation, simply prepend scorep to your compilation command. For example, replace:
+
+```bash
+$ mpif90 -c foo.f90
+$ mpif90 -c bar.f90
+$ mpif90 -o myapp foo.o bar.o
+```
+
+with:
+
+```bash
+$ scorep  mpif90 -c foo.f90
+$ scorep  mpif90 -c bar.f90
+$ scorep  mpif90 -o myapp foo.o bar.o
+```
+
+Usually your program is compiled using a Makefile or similar script, so it advisable to add the  scorep command to your definition of variables  CC, CXX, FCC etc.
+
+It is important that  scorep is prepended also to the linking command, in order to link with Score-P instrumentation libraries.
+
+###Manual instrumentation using API calls
+
+To use this kind of instrumentation, use scorep with switch --user. You will then mark regions to be instrumented by inserting API calls.
+
+An example in C/C++ :
+
+```cpp
+    #include <scorep/SCOREP_User.h>
+    void foo()
+    {
+        SCOREP_USER_REGION_DEFINE( my_region_handle )
+        // more declarations
+        SCOREP_USER_REGION_BEGIN( my_region_handle, "foo", SCOREP_USER_REGION_TYPE_COMMON )
+        // do something
+        SCOREP_USER_REGION_END( my_region_handle )
+    }
+```
+
+ and Fortran :
+
+```cpp
+    #include "scorep/SCOREP_User.inc"
+    subroutine foo
+        SCOREP_USER_REGION_DEFINE( my_region_handle )
+        ! more declarations
+        SCOREP_USER_REGION_BEGIN( my_region_handle, "foo", SCOREP_USER_REGION_TYPE_COMMON )
+        ! do something
+        SCOREP_USER_REGION_END( my_region_handle )
+    end subroutine foo
+```
+
+Please refer to the [documentation for description of the API](https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf).
+
+###Manual instrumentation using directives
+
+This method uses POMP2 directives to mark regions to be instrumented. To use this method, use command  scorep --pomp.
+
+Example directives in C/C++ :
+
+```cpp
+    void foo(...)
+    {
+        /* declarations */
+        #pragma pomp inst begin(foo)
+        ...
+        if (<condition>)
+        {
+            #pragma pomp inst altend(foo)
+            return;
+        }
+        ...
+        #pragma pomp inst end(foo)
+    }
+```
+
+and in Fortran :
+
+```cpp
+    subroutine foo(...)
+        !declarations
+        !POMP$ INST BEGIN(foo)
+        ...
+        if (<condition>) then
+     !POMP$ INST ALTEND(foo)
+     return
+     end if
+     ...
+     !POMP$ INST END(foo)
+    end subroutine foo
+```
+
+The directives are ignored if the program is compiled without Score-P. Again, please refer to the [documentation](https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf) for a more elaborate description.
--- a/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md
+++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md
+Total View
+==========
+
+##TotalView is a GUI-based source code multi-process, multi-thread debugger.
+
+License and Limitations for Anselm Users
+----------------------------------------
+On Anselm users can debug OpenMP or MPI code that runs up to 64 parallel processes. These limitation means that:
+
+```bash
+    1 user can debug up 64 processes, or
+    32 users can debug 2 processes, etc.
+```
+
+Debugging of GPU accelerated codes is also supported.
+
+You can check the status of the licenses here:
+
+```bash
+    cat /apps/user/licenses/totalview_features_state.txt
+
+    # totalview
+    # -------------------------------------------------
+    # FEATURE                       TOTAL   USED  AVAIL
+    # -------------------------------------------------
+    TotalView_Team                     64      0     64
+    Replay                             64      0     64
+    CUDA                               64      0     64
+```
+
+Compiling Code to run with TotalView
+------------------------------------
+### Modules
+
+Load all necessary modules to compile the code. For example:
+
+```bash
+    module load intel
+
+    module load impi   ... or ... module load openmpi/X.X.X-icc
+```
+
+Load the TotalView module:
+
+```bash
+    module load totalview/8.12
+```
+
+Compile the code:
+
+```bash
+    mpicc -g -O0 -o test_debug test.c
+
+    mpif90 -g -O0 -o test_debug test.f
+```
+
+### Compiler flags
+
+Before debugging, you need to compile your code with theses flags:
+
+!!! Note "Note"
+	**-g** : Generates extra debugging information usable by GDB. **-g3** includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
+
+	**-O0** : Suppress all optimizations.
+
+Starting a Job with TotalView
+-----------------------------
+Be sure to log in with an X window forwarding enabled. This could mean using the -X in the ssh:
+
+```bash
+    ssh -X username@anselm.it4i.cz
+```
+
+Other options is to access login node using VNC. Please see the detailed information on how to use graphic user interface on Anselm.
+
+From the login node an interactive session with X windows forwarding (-X option) can be started by following command:
+
+```bash
+    qsub -I -X -A NONE-0-0 -q qexp -lselect=1:ncpus=16:mpiprocs=16,walltime=01:00:00
+```
+
+Then launch the debugger with the totalview command followed by the name of the executable to debug.
+
+### Debugging a serial code
+
+To debug a serial code use:
+
+```bash
+    totalview test_debug
+```
+
+### Debugging a parallel code - option 1
+
+To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
+
+!!! Note "Note"
+	**Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file:
+
+```bash
+    proc mpi_auto_run_starter {loaded_id} {
+        set starter_programs {mpirun mpiexec orterun}
+        set executable_name [TV::symbol get $loaded_id full_pathname]
+        set file_component [file tail $executable_name]
+
+        if {[lsearch -exact $starter_programs $file_component] != -1} {
+            puts "*************************************"
+            puts "Automatically starting $file_component"
+            puts "*************************************"
+            dgo
+        }
+    }
+
+    # Append this function to TotalView's image load callbacks so that
+    # TotalView run this program automatically.
+
+    dlappend TV::image_load_callbacks mpi_auto_run_starter
+```
+The source code of this function can be also found in
+
+```bash
+    /apps/mpi/openmpi/intel/1.6.5/etc/openmpi-totalview.tcl
+```
+
+!!! Note "Note"
+	You can also add only following line to you ~/.tvdrc file instead of the entire function: 
+    **source /apps/mpi/openmpi/intel/1.6.5/etc/openmpi-totalview.tcl**
+
+You need to do this step only once.
+
+Now you can run the parallel debugger using:
+
+```bash
+    mpirun -tv -n 5 ./test_debug
+```
+
+When following dialog appears click on "Yes"
+
+![](../../../img/totalview1.png)
+
+At this point the main TotalView GUI window will appear and you can insert the breakpoints and start debugging:
+
+![](../../../img/totalview2.png)
+
+### Debugging a parallel code - option 2
+
+Other option to start new parallel debugging session from a command line is to let TotalView to execute mpirun by itself. In this case user has to specify a MPI implementation used to compile the source code.
+
+The following example shows how to start debugging session with Intel MPI:
+
+```bash
+    module load intel/13.5.192 impi/4.1.1.036 totalview/8/13
+
+    totalview -mpi "Intel MPI-Hydra" -np 8 ./hello_debug_impi
+```
+
+After running previous command you will see the same window as shown in the screenshot above.
+
+More information regarding the command line parameters of the TotalView can be found TotalView Reference Guide, Chapter 7: TotalView Command Syntax.
+
+Documentation
+-------------
+[1] The [TotalView documentation](http://www.roguewave.com/support/product-documentation/totalview-family.aspx#totalview) web page is a good resource for learning more about some of the advanced TotalView features.
No results found