Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 3494 additions and 0 deletions
Valgrind
========
Valgrind is a tool for memory debugging and profiling.
About Valgrind
--------------
Valgrind is an open-source tool, used mainly for debuggig memory-related problems, such as memory leaks, use of uninitalized memory etc. in C/C++ applications. The toolchain was however extended over time with more functionality, such as debugging of threaded applications, cache profiling, not limited only to C/C++.
Valgind is an extremely useful tool for debugging memory errors such as [off-by-one](http://en.wikipedia.org/wiki/Off-by-one_error). Valgrind uses a virtual machine and dynamic recompilation of binary code, because of that, you can expect that programs being debugged by Valgrind run 5-100 times slower.
The main tools available in Valgrind are :
- **Memcheck**, the original, must used and default tool. Verifies memory access in you program and can detect use of unitialized memory, out of bounds memory access, memory leaks, double free, etc.
- **Massif**, a heap profiler.
- **Hellgrind** and **DRD** can detect race conditions in multi-threaded applications.
- **Cachegrind**, a cache profiler.
- **Callgrind**, a callgraph analyzer.
- For a full list and detailed documentation, please refer to the [official Valgrind documentation](http://valgrind.org/docs/).
Installed versions
------------------
There are two versions of Valgrind available on Anselm.
- Version 3.6.0, installed by operating system vendor in /usr/bin/valgrind. This version is available by default, without the need to load any module. This version however does not provide additional MPI support.
- Version 3.9.0 with support for Intel MPI, available in [module](../../environment-and-modules/) valgrind/3.9.0-impi. After loading the module, this version replaces the default valgrind.
Usage
-----
Compile the application which you want to debug as usual. It is advisable to add compilation flags -g (to add debugging information to the binary so that you will see original source code lines in the output) and -O0 (to disable compiler optimizations).
For example, lets look at this C code, which has two problems :
```cpp
#include <stdlib.h>
void f(void)
{
int* x = malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
} // problem 2: memory leak -- x not freed
int main(void)
{
f();
return 0;
}
```
Now, compile it with Intel compiler :
```bash
$ module add intel
$ icc -g valgrind-example.c -o valgrind-example
```
Now, lets run it with Valgrind. The syntax is :
*valgrind [valgrind options] &lt;your program binary&gt; [your program options]*
If no Valgrind options are specified, Valgrind defaults to running Memcheck tool. Please refer to the Valgrind documentation for a full description of command line options.
```bash
$ valgrind ./valgrind-example
==12652== Memcheck, a memory error detector
==12652== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12652== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==12652== Command: ./valgrind-example
==12652==
==12652== Invalid write of size 4
==12652== at 0x40053E: f (valgrind-example.c:6)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652== Address 0x5861068 is 0 bytes after a block of size 40 alloc'd
==12652== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==12652== by 0x400528: f (valgrind-example.c:5)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652==
==12652==
==12652== HEAP SUMMARY:
==12652== in use at exit: 40 bytes in 1 blocks
==12652== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==12652==
==12652== LEAK SUMMARY:
==12652== definitely lost: 40 bytes in 1 blocks
==12652== indirectly lost: 0 bytes in 0 blocks
==12652== possibly lost: 0 bytes in 0 blocks
==12652== still reachable: 0 bytes in 0 blocks
==12652== suppressed: 0 bytes in 0 blocks
==12652== Rerun with --leak-check=full to see details of leaked memory
==12652==
==12652== For counts of detected and suppressed errors, rerun with: -v
==12652== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 6 from 6)
```
In the output we can see that Valgrind has detected both errors - the off-by-one memory access at line 5 and a memory leak of 40 bytes. If we want a detailed analysis of the memory leak, we need to run Valgrind with --leak-check=full option :
```bash
$ valgrind --leak-check=full ./valgrind-example
==23856== Memcheck, a memory error detector
==23856== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==23856== Command: ./valgrind-example
==23856==
==23856== Invalid write of size 4
==23856== at 0x40067E: f (valgrind-example.c:6)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856== Address 0x66e7068 is 0 bytes after a block of size 40 alloc'd
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856==
==23856== HEAP SUMMARY:
==23856== in use at exit: 40 bytes in 1 blocks
==23856== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==23856==
==23856== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856== LEAK SUMMARY:
==23856== definitely lost: 40 bytes in 1 blocks
==23856== indirectly lost: 0 bytes in 0 blocks
==23856== possibly lost: 0 bytes in 0 blocks
==23856== still reachable: 0 bytes in 0 blocks
==23856== suppressed: 0 bytes in 0 blocks
==23856==
==23856== For counts of detected and suppressed errors, rerun with: -v
==23856== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 6 from 6)
```
Now we can see that the memory leak is due to the malloc() at line 6.
Usage with MPI
---------------------------
Although Valgrind is not primarily a parallel debugger, it can be used to debug parallel applications as well. When launching your parallel applications, prepend the valgrind command. For example :
```bash
$ mpirun -np 4 valgrind myapplication
```
The default version without MPI support will however report a large number of false errors in the MPI library, such as :
```bash
==30166== Conditional jump or move depends on uninitialised value(s)
==30166== at 0x4C287E8: strlen (mc_replace_strmem.c:282)
==30166== by 0x55443BD: I_MPI_Processor_model_number (init_interface.c:427)
==30166== by 0x55439E0: I_MPI_Processor_arch_code (init_interface.c:171)
==30166== by 0x558D5AE: MPID_nem_impi_init_shm_configuration (mpid_nem_impi_extensions.c:1091)
==30166== by 0x5598F4C: MPID_nem_init_ckpt (mpid_nem_init.c:566)
==30166== by 0x5598B65: MPID_nem_init (mpid_nem_init.c:489)
==30166== by 0x539BD75: MPIDI_CH3_Init (ch3_init.c:64)
==30166== by 0x5578743: MPID_Init (mpid_init.c:193)
==30166== by 0x554650A: MPIR_Init_thread (initthread.c:539)
==30166== by 0x553369F: PMPI_Init (init.c:195)
==30166== by 0x4008BD: main (valgrind-example-mpi.c:18)
```
so it is better to use the MPI-enabled valgrind from module. The MPI version requires library /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so, which must be included in the LD_PRELOAD environment variable.
Lets look at this MPI example :
```cpp
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int *data = malloc(sizeof(int)*99);
MPI_Init(&argc, &argv);
MPI_Bcast(data, 100, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
```
There are two errors - use of uninitialized memory and invalid length of the buffer. Lets debug it with valgrind :
```bash
$ module add intel impi
$ mpicc -g valgrind-example-mpi.c -o valgrind-example-mpi
$ module add valgrind/3.9.0-impi
$ mpirun -np 2 -env LD_PRELOAD /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so valgrind ./valgrind-example-mpi
```
Prints this output : (note that there is output printed for every launched MPI process)
```bash
==31318== Memcheck, a memory error detector
==31318== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31318== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31318== Command: ./valgrind-example-mpi
==31318==
==31319== Memcheck, a memory error detector
==31319== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31319== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31319== Command: ./valgrind-example-mpi
==31319==
valgrind MPI wrappers 31319: Active for pid 31319
valgrind MPI wrappers 31319: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 31318: Active for pid 31318
valgrind MPI wrappers 31318: Try MPIWRAP_DEBUG=help for possible options
==31319== Unaddressable byte(s) found during client check request
==31319== at 0x4E35974: check_mem_is_addressable_untyped (libmpiwrap.c:960)
==31319== by 0x4E5D0FE: PMPI_Bcast (libmpiwrap.c:908)
==31319== by 0x400911: main (valgrind-example-mpi.c:20)
==31319== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31319== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31319== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31319==
==31318== Uninitialised byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x6929040 is 0 bytes inside a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318== Unaddressable byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318==
==31318== HEAP SUMMARY:
==31318== in use at exit: 3,172 bytes in 67 blocks
==31318== total heap usage: 191 allocs, 124 frees, 81,203 bytes allocated
==31318==
==31319==
==31319== HEAP SUMMARY:
==31319== in use at exit: 3,172 bytes in 67 blocks
==31319== total heap usage: 175 allocs, 108 frees, 48,435 bytes allocated
==31319==
==31318== LEAK SUMMARY:
==31318== definitely lost: 408 bytes in 3 blocks
==31318== indirectly lost: 256 bytes in 1 blocks
==31318== possibly lost: 0 bytes in 0 blocks
==31318== still reachable: 2,508 bytes in 63 blocks
==31318== suppressed: 0 bytes in 0 blocks
==31318== Rerun with --leak-check=full to see details of leaked memory
==31318==
==31318== For counts of detected and suppressed errors, rerun with: -v
==31318== Use --track-origins=yes to see where uninitialised values come from
==31318== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)
==31319== LEAK SUMMARY:
==31319== definitely lost: 408 bytes in 3 blocks
==31319== indirectly lost: 256 bytes in 1 blocks
==31319== possibly lost: 0 bytes in 0 blocks
==31319== still reachable: 2,508 bytes in 63 blocks
==31319== suppressed: 0 bytes in 0 blocks
==31319== Rerun with --leak-check=full to see details of leaked memory
==31319==
==31319== For counts of detected and suppressed errors, rerun with: -v
==31319== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
```
We can see that Valgrind has reported use of unitialised memory on the master process (which reads the array to be broadcasted) and use of unaddresable memory on both processes.
hVampir
======
Vampir is a commercial trace analysis and visualisation tool. It can work with traces in OTF and OTF2 formats. It does not have the functionality to collect traces, you need to use a trace collection tool (such as [Score-P](../../../salomon/software/debuggers/score-p/)) first to collect the traces.
![](../../../img/Snmekobrazovky20160708v12.33.35.png)
Installed versions
------------------
Version 8.5.0 is currently installed as module Vampir/8.5.0 :
```bash
$ module load Vampir/8.5.0
$ vampir &
```
User manual
-----------
You can find the detailed user manual in PDF format in $EBROOTVAMPIR/doc/vampir-manual.pdf
References
----------
[1]. <https://www.vampir.eu>
GPI-2
=====
##A library that implements the GASPI specification
Introduction
------------
Programming Next Generation Supercomputers: GPI-2 is an API library for asynchronous interprocess, cross-node communication. It provides a flexible, scalable and fault tolerant interface for parallel applications.
The GPI-2 library ([www.gpi-site.com/gpi2/](http://www.gpi-site.com/gpi2/)) implements the GASPI specification (Global Address Space Programming Interface, [www.gaspi.de](http://www.gaspi.de/en/project.html)). GASPI is a Partitioned Global Address Space (PGAS) API. It aims at scalable, flexible and failure tolerant computing in massively parallel environments.
Modules
-------
The GPI-2, version 1.0.2 is available on Anselm via module gpi2:
```bash
$ module load gpi2
```
The module sets up environment variables, required for linking and running GPI-2 enabled applications. This particular command loads the default module, which is gpi2/1.0.2
Linking
-------
!!! Note "Note"
Link with -lGPI2 -libverbs
Load the gpi2 module. Link using **-lGPI2** and **-libverbs** switches to link your code against GPI-2. The GPI-2 requires the OFED infinband communication library ibverbs.
### Compiling and linking with Intel compilers
```bash
$ module load intel
$ module load gpi2
$ icc myprog.c -o myprog.x -Wl,-rpath=$LIBRARY_PATH -lGPI2 -libverbs
```
### Compiling and linking with GNU compilers
```bash
$ module load gcc
$ module load gpi2
$ gcc myprog.c -o myprog.x -Wl,-rpath=$LIBRARY_PATH -lGPI2 -libverbs
```
Running the GPI-2 codes
-----------------------
!!! Note "Note"
gaspi_run starts the GPI-2 application
The gaspi_run utility is used to start and run GPI-2 applications:
```bash
$ gaspi_run -m machinefile ./myprog.x
```
A machine file (**machinefile**) with the hostnames of nodes where the application will run, must be provided. The machinefile lists all nodes on which to run, one entry per node per process. This file may be hand created or obtained from standard $PBS_NODEFILE:
```bash
$ cut -f1 -d"." $PBS_NODEFILE > machinefile
```
machinefile:
```bash
cn79
cn80
```
This machinefile will run 2 GPI-2 processes, one on node cn79 other on node cn80.
machinefle:
```bash
cn79
cn79
cn80
cn80
```
This machinefile will run 4 GPI-2 processes, 2 on node cn79 o 2 on node cn80.
!!! Note "Note"
Use the **mpiprocs** to control how many GPI-2 processes will run per node
Example:
```bash
$ qsub -A OPEN-0-0 -q qexp -l select=2:ncpus=16:mpiprocs=16 -I
```
This example will produce $PBS_NODEFILE with 16 entries per node.
### gaspi_logger
!!! Note "Note"
gaspi_logger views the output form GPI-2 application ranks
The gaspi_logger utility is used to view the output from all nodes except the master node (rank 0). The gaspi_logger is started, on another session, on the master node - the node where the gaspi_run is executed. The output of the application, when called with gaspi_printf(), will be redirected to the gaspi_logger. Other I/O routines (e.g. printf) will not.
Example
-------
Following is an example GPI-2 enabled code:
```cpp
#include <GASPI.h>
#include <stdlib.h>
void success_or_exit ( const char* file, const int line, const int ec)
{
if (ec != GASPI_SUCCESS)
{
gaspi_printf ("Assertion failed in %s[%i]:%dn", file, line, ec);
exit (1);
}
}
#define ASSERT(ec) success_or_exit (__FILE__, __LINE__, ec);
int main(int argc, char *argv[])
{
gaspi_rank_t rank, num;
gaspi_return_t ret;
/* Initialize GPI-2 */
ASSERT( gaspi_proc_init(GASPI_BLOCK) );
/* Get ranks information */
ASSERT( gaspi_proc_rank(&rank) );
ASSERT( gaspi_proc_num(&num) );
gaspi_printf("Hello from rank %d of %dn",
rank, num);
/* Terminate */
ASSERT( gaspi_proc_term(GASPI_BLOCK) );
return 0;
}
```
Load modules and compile:
```bash
$ module load gcc gpi2
$ gcc helloworld_gpi.c -o helloworld_gpi.x -Wl,-rpath=$LIBRARY_PATH -lGPI2 -libverbs
```
Submit the job and run the GPI-2 application
```bash
$ qsub -q qexp -l select=2:ncpus=1:mpiprocs=1,place=scatter,walltime=00:05:00 -I
qsub: waiting for job 171247.dm2 to start
qsub: job 171247.dm2 ready
cn79 $ module load gpi2
cn79 $ cut -f1 -d"." $PBS_NODEFILE > machinefile
cn79 $ gaspi_run -m machinefile ./helloworld_gpi.x
Hello from rank 0 of 2
```
At the same time, in another session, you may start the gaspi logger:
```bash
$ ssh cn79
cn79 $ gaspi_logger
GASPI Logger (v1.1)
[cn80:0] Hello from rank 1 of 2
```
In this example, we compile the helloworld_gpi.c code using the **gnu compiler** (gcc) and link it to the GPI-2 and ibverbs library. The library search path is compiled in. For execution, we use the qexp queue, 2 nodes 1 core each. The GPI module must be loaded on the master compute node (in this example the cn79), gaspi_logger is used from different session to view the output of the second process.
Intel Compilers
===============
The Intel compilers version 13.1.1 are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```bash
$ module load intel
$ icc -v
$ ifort -v
```
The intel compilers provide for vectorization of the code, via the AVX instructions and support threading parallelization via OpenMP
For maximum performance on the Anselm cluster, compile your programs using the AVX instructions, with reporting where the vectorization was used. We recommend following compilation options for high performance
```bash
$ icc -ipo -O3 -vec -xAVX -vec-report1 myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -vec -xAVX -vec-report1 myprog.f mysubroutines.f -o myprog.x
```
In this example, we compile the program enabling interprocedural optimizations between source files (-ipo), aggresive loop optimizations (-O3) and vectorization (-vec -xAVX)
The compiler recognizes the omp, simd, vector and ivdep pragmas for OpenMP parallelization and AVX vectorization. Enable the OpenMP parallelization by the **-openmp** compiler switch.
```bash
$ icc -ipo -O3 -vec -xAVX -vec-report1 -openmp myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -vec -xAVX -vec-report1 -openmp myprog.f mysubroutines.f -o myprog.x
```
Read more at <http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm>
Sandy Bridge/Haswell binary compatibility
-----------------------------------------
Anselm nodes are currently equipped with Sandy Bridge CPUs, while Salomon will use Haswell architecture. >The new processors are backward compatible with the Sandy Bridge nodes, so all programs that ran on the Sandy Bridge processors, should also run on the new Haswell nodes. >To get optimal performance out of the Haswell processors a program should make use of the special AVX2 instructions for this processor. One can do this by recompiling codes with the compiler flags >designated to invoke these instructions. For the Intel compiler suite, there are two ways of doing this:
- Using compiler flag (both for Fortran and C): -xCORE-AVX2. This will create a binary with AVX2 instructions, specifically for the Haswell processors. Note that the executable will not run on Sandy Bridge nodes.
- Using compiler flags (both for Fortran and C): -xAVX -axCORE-AVX2. This will generate multiple, feature specific auto-dispatch code paths for Intel® processors, if there is a performance benefit. So this binary will run both on Sandy Bridge and Haswell processors. During runtime it will be decided which path to follow, dependent on which processor you are running on. In general this will result in larger binaries.
Intel Debugger
==============
Debugging serial applications
-----------------------------
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use X display for running the GUI.
```bash
$ module load intel
$ idb
```
The debugger may run in text mode. To debug in text mode, use
```bash
$ idbc
```
To debug on the compute nodes, module intel must be loaded. The GUI on compute nodes may be accessed using the same way as in the GUI section
Example:
```bash
$ qsub -q qexp -l select=1:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19654.srv11 ready
$ module load intel
$ module load java
$ icc -O0 -g myprog.c -o myprog.x
$ idb ./myprog.x
```
In this example, we allocate 1 full compute node, compile program myprog.c with debugging options -O0 -g and run the idb debugger interactively on the myprog.x executable. The GUI access is via X11 port forwarding provided by the PBS workload manager.
Debugging parallel applications
-------------------------------
Intel debugger is capable of debugging multithreaded and MPI parallel programs as well.
### Small number of MPI ranks
For debugging small number of MPI ranks, you may execute and debug each rank in separate xterm terminal (do not forget the X display. Using Intel MPI, this may be done in following way:
```bash
$ qsub -q qexp -l select=2:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ module load intel impi
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE --enable-x xterm -e idbc ./mympiprog.x
```
In this example, we allocate 2 full compute node, run xterm on each node and start idb debugger in command line mode, debugging two ranks of mympiprog.x application. The xterm will pop up for each rank, with idb prompt ready. The example is not limited to use of Intel MPI
### Large number of MPI ranks
Run the idb debugger from within the MPI debug option. This will cause the debugger to bind to all ranks and provide aggregated outputs across the ranks, pausing execution automatically just after startup. You may then set break points and step the execution manually. Using Intel MPI:
```bash
$ qsub -q qexp -l select=2:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ module load intel impi
$ mpirun -n 32 -idb ./mympiprog.x
```
### Debugging multithreaded application
Run the idb debugger in GUI mode. The menu Parallel contains number of tools for debugging multiple threads. One of the most useful tools is the **Serialize Execution** tool, which serializes execution of concurrent threads for easy orientation and identification of concurrency related bugs.
Further information
-------------------
Exhaustive manual on idb features and usage is published at [Intel website](http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/debugger/user_guide/index.htm)
Intel IPP
=========
Intel Integrated Performance Primitives
---------------------------------------
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX vector instructions is available, via module ipp. The IPP is a very rich library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax, as well as cryptographic functions, linear algebra functions and many more.
!!! Note "Note"
Check out IPP before implementing own math functions for data processing, it is likely already there.
```bash
$ module load ipp
```
The module sets up environment variables, required for linking and running ipp enabled applications.
IPP example
-----------
```cpp
#include "ipp.h"
#include <stdio.h>
int main(int argc, char* argv[])
{
const IppLibraryVersion *lib;
Ipp64u fm;
IppStatus status;
status= ippInit(); //IPP initialization with the best optimization layer
if( status != ippStsNoErr ) {
printf("IppInit() Error:n");
printf("%sn", ippGetStatusString(status) );
return -1;
}
//Get version info
lib = ippiGetLibVersion();
printf("%s %sn", lib->Name, lib->Version);
//Get CPU features enabled with selected library level
fm=ippGetEnabledCpuFeatures();
printf("SSE :%cn",(fm>1)&1?'Y':'N');
printf("SSE2 :%cn",(fm>2)&1?'Y':'N');
printf("SSE3 :%cn",(fm>3)&1?'Y':'N');
printf("SSSE3 :%cn",(fm>4)&1?'Y':'N');
printf("SSE41 :%cn",(fm>6)&1?'Y':'N');
printf("SSE42 :%cn",(fm>7)&1?'Y':'N');
printf("AVX :%cn",(fm>8)&1 ?'Y':'N');
printf("AVX2 :%cn", (fm>15)&1 ?'Y':'N' );
printf("----------n");
printf("OS Enabled AVX :%cn", (fm>9)&1 ?'Y':'N');
printf("AES :%cn", (fm>10)&1?'Y':'N');
printf("CLMUL :%cn", (fm>11)&1?'Y':'N');
printf("RDRAND :%cn", (fm>13)&1?'Y':'N');
printf("F16C :%cn", (fm>14)&1?'Y':'N');
return 0;
}
```
Compile above example, using any compiler and the ipp module.
```bash
$ module load intel
$ module load ipp
$ icc testipp.c -o testipp.x -lippi -lipps -lippcore
```
You will need the ipp module loaded to run the ipp enabled executable. This may be avoided, by compiling library search paths into the executable
```bash
$ module load intel
$ module load ipp
$ icc testipp.c -o testipp.x -Wl,-rpath=$LIBRARY_PATH -lippi -lipps -lippcore
```
Code samples and documentation
------------------------------
Intel provides number of [Code Samples for IPP](https://software.intel.com/en-us/articles/code-samples-for-intel-integrated-performance-primitives-library), illustrating use of IPP.
Read full documentation on IPP [on Intel website,](http://software.intel.com/sites/products/search/search.php?q=&x=15&y=6&product=ipp&version=7.1&docos=lin) in particular the [IPP Reference manual.](http://software.intel.com/sites/products/documentation/doclib/ipp_sa/71/ipp_manual/index.htm)
Intel MKL
=========
Intel Math Kernel Library
-------------------------
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL provides these basic math kernels:
- BLAS (level 1, 2, and 3) and LAPACK linear algebra routines, offering vector, vector-matrix, and matrix-matrix operations.
- The PARDISO direct sparse solver, an iterative sparse solver, and supporting sparse BLAS (level 1, 2, and 3) routines for solving sparse systems of equations.
- ScaLAPACK distributed processing linear algebra routines for Linux* and Windows* operating systems, as well as the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS).
- Fast Fourier transform (FFT) functions in one, two, or three dimensions with support for mixed radices (not limited to sizes that are powers of 2), as well as distributed versions of these functions.
- Vector Math Library (VML) routines for optimized mathematical operations on vectors.
- Vector Statistical Library (VSL) routines, which offer high-performance vectorized random number generators (RNG) for several probability distributions, convolution and correlation routines, and summary statistics functions.
- Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search.
- Extended Eigensolver, a shared memory version of an eigensolver based on the Feast Eigenvalue Solver.
For details see the [Intel MKL Reference Manual](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mklman/index.htm).
Intel MKL version 13.5.192 is available on Anselm
```bash
$ module load mkl
```
The module sets up environment variables, required for linking and running mkl enabled applications. The most important variables are the $MKLROOT, $MKL_INC_DIR, $MKL_LIB_DIR and $MKL_EXAMPLES
!!! Note "Note"
The MKL library may be linked using any compiler. With intel compiler use -mkl option to link default threaded MKL.
### Interfaces
The MKL library provides number of interfaces. The fundamental once are the LP64 and ILP64. The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 231^-1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type.
|Interface|Integer type|
|---|---|
|LP64|32-bit, int, integer(kind=4), MPI_INT|
|ILP64|64-bit, long int, integer(kind=8), MPI_INT64|
### Linking
Linking MKL libraries may be complex. Intel [mkl link line advisor](http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor) helps. See also [examples](intel-mkl/#examples) below.
You will need the mkl module loaded to run the mkl enabled executable. This may be avoided, by compiling library search paths into the executable. Include rpath on the compile line:
```bash
$ icc .... -Wl,-rpath=$LIBRARY_PATH ...
```
### Threading
!!! Note "Note"
Advantage in using the MKL library is that it brings threaded parallelization to applications that are otherwise not parallel.
For this to work, the application must link the threaded MKL library (default). Number and behaviour of MKL threads may be controlled via the OpenMP environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY. MKL_NUM_THREADS takes precedence over OMP_NUM_THREADS
```bash
$ export OMP_NUM_THREADS=16
$ export KMP_AFFINITY=granularity=fine,compact,1,0
```
The application will run with 16 threads with affinity optimized for fine grain parallelization.
Examples
------------
Number of examples, demonstrating use of the MKL library and its linking is available on Anselm, in the $MKL_EXAMPLES directory. In the examples below, we demonstrate linking MKL to Intel and GNU compiled program for multi-threaded matrix multiplication.
### Working with examples
```bash
$ module load intel
$ module load mkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ make sointel64 function=cblas_dgemm
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL example suite installed on Anselm.
### Example: MKL and Intel compiler
```bash
$ module load intel
$ module load mkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$
$ icc -w source/cblas_dgemmx.c source/common_func.c -mkl -o cblas_dgemmx.x
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL with icc -mkl option. Using the -mkl option is equivalent to:
```bash
$ icc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x
-I$MKL_INC_DIR -L$MKL_LIB_DIR -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
```
In this example, we compile and link the cblas_dgemm example, using LP64 interface to threaded MKL and Intel OMP threads implementation.
### Example: MKL and GNU compiler
```bash
$ module load gcc
$ module load mkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ gcc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x
-lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lm
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, using LP64 interface to threaded MKL and gnu OMP threads implementation.
MKL and MIC accelerators
------------------------
The MKL is capable to automatically offload the computations o the MIC accelerator. See section [Intel XeonPhi](../intel-xeon-phi/) for details.
Further reading
---------------
Read more on [Intel website](http://software.intel.com/en-us/intel-mkl), in particular the [MKL users guide](https://software.intel.com/en-us/intel-mkl/documentation/linux).
Intel TBB
=========
Intel Threading Building Blocks
-------------------------------
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. The tasks are executed by a runtime scheduler and may
be offloaded to [MIC accelerator](../intel-xeon-phi/).
Intel TBB version 4.1 is available on Anselm
```bash
$ module load tbb
```
The module sets up environment variables, required for linking and running tbb enabled applications.
!!! Note "Note"
Link the tbb library, using -ltbb
Examples
--------
Number of examples, demonstrating use of TBB and its built-in scheduler is available on Anselm, in the $TBB_EXAMPLES directory.
```bash
$ module load intel
$ module load tbb
$ cp -a $TBB_EXAMPLES/common $TBB_EXAMPLES/parallel_reduce /tmp/
$ cd /tmp/parallel_reduce/primes
$ icc -O2 -DNDEBUG -o primes.x main.cpp primes.cpp -ltbb
$ ./primes.x
```
In this example, we compile, link and run the primes example, demonstrating use of parallel task-based reduce in computation of prime numbers.
You will need the tbb module loaded to run the tbb enabled executable. This may be avoided, by compiling library search paths into the executable.
```bash
$ icc -O2 -o primes.x main.cpp primes.cpp -Wl,-rpath=$LIBRARY_PATH -ltbb
```
Further reading
---------------
Read more on Intel website, <http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm>
Intel Parallel Studio
=====================
The Anselm cluster provides following elements of the Intel Parallel Studio XE
|Intel Parallel Studio XE|
|-------------------------------------------------|
|Intel Compilers|
|Intel Debugger|
|Intel MKL Library|
|Intel Integrated Performance Primitives Library|
|Intel Threading Building Blocks Library|
Intel compilers
---------------
The Intel compilers version 13.1.3 are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```bash
$ module load intel
$ icc -v
$ ifort -v
```
Read more at the [Intel Compilers](intel-compilers/) page.
Intel debugger
--------------
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use X display for running the GUI.
```bash
$ module load intel
$ idb
```
Read more at the [Intel Debugger](intel-debugger/) page.
Intel Math Kernel Library
-------------------------
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL unites and provides these basic components: BLAS, LAPACK, ScaLapack, PARDISO, FFT, VML, VSL, Data fitting, Feast Eigensolver and many more.
```bash
$ module load mkl
```
Read more at the [Intel MKL](intel-mkl/) page.
Intel Integrated Performance Primitives
---------------------------------------
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX is available, via module ipp. The IPP is a library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax and many more.
```bash
$ module load ipp
```
Read more at the [Intel IPP](intel-integrated-performance-primitives/) page.
Intel Threading Building Blocks
-------------------------------
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner.
```bash
$ module load tbb
```
Read more at the [Intel TBB](intel-tbb/) page.
Intel Xeon Phi
==============
##A guide to Intel Xeon Phi usage
Intel Xeon Phi can be programmed in several modes. The default mode on Anselm is offload mode, but all modes described in this document are supported.
Intel Utilities for Xeon Phi
----------------------------
To get access to a compute node with Intel Xeon Phi accelerator, use the PBS interactive session
```bash
$ qsub -I -q qmic -A NONE-0-0
```
To set up the environment module "Intel" has to be loaded
```bash
$ module load intel/13.5.192
```
Information about the hardware can be obtained by running the micinfo program on the host.
```bash
$ /usr/bin/micinfo
```
The output of the "micinfo" utility executed on one of the Anselm node is as follows. (note: to get PCIe related details the command has to be run with root privileges)
```bash
MicInfo Utility Log
Created Mon Jul 22 00:23:50 2013
System Info
HOST OS : Linux
OS Version : 2.6.32-279.5.2.bl6.Bull.33.x86_64
Driver Version : 6720-15
MPSS Version : 2.1.6720-15
Host Physical Memory : 98843 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.03.0386
SMC Firmware Version : 1.15.4830
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8-g2593b11
Device Serial Number : ADKC30102482
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1032000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 49 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
```
Offload Mode
------------
To compile a code for Intel Xeon Phi a MPSS stack has to be installed on the machine where compilation is executed. Currently the MPSS stack is only installed on compute nodes equipped with accelerators.
```bash
$ qsub -I -q qmic -A NONE-0-0
$ module load intel/13.5.192
```
For debugging purposes it is also recommended to set environment variable "OFFLOAD_REPORT". Value can be set from 0 to 3, where higher number means more debugging information.
```bash
export OFFLOAD_REPORT=3
```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator.
```bash
$ vim source-offload.cpp
#include <iostream>
int main(int argc, char* argv[])
{
const int niter = 100000;
double result = 0;
#pragma offload target(mic)
for (int i = 0; i < niter; ++i) {
const double t = (i + 0.5) / niter;
result += 4.0 / (t * t + 1.0);
}
result /= niter;
std::cout << "Pi ~ " << result << 'n';
}
```
To compile a code using Intel compiler run
```bash
$ icc source-offload.cpp -o bin-offload
```
To execute the code, run the following command on the host
```bash
./bin-offload
```
### Parallelization in Offload Mode Using OpenMP
One way of paralelization a code for Xeon Phi is using OpenMP directives. The following example shows code for parallel vector addition.
```bash
$ vim ./vect-add
#include <stdio.h>
typedef int T;
#define SIZE 1000
#pragma offload_attribute(push, target(mic))
T in1[SIZE];
T in2[SIZE];
T res[SIZE];
#pragma offload_attribute(pop)
// MIC function to add two vectors
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to add two vectors
void add_cpu (T *a, T *b, T *c, int size) {
int i;
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare(T *a, T *b, T size ){
int pass = 0;
int i;
for (i = 0; i < size; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %dn",i, a[i], b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passedn"); else printf ("Test Failedn");
return pass;
}
int main()
{
int i;
random_T(in1, SIZE);
random_T(in2, SIZE);
#pragma offload target(mic) in(in1,in2) inout(res)
{
// Parallel loop from main function
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
// or parallel loop is called inside the function
add_mic(in1, in2, res, SIZE);
}
//Check the results with CPU implementation
T res_cpu[SIZE];
add_cpu(in1, in2, res_cpu, SIZE);
compare(res, res_cpu, SIZE);
}
```
During the compilation Intel compiler shows which loops have been vectorized in both host and accelerator. This can be enabled with compiler option "-vec-report2". To compile and execute the code run
```bash
$ icc vect-add.c -openmp_report2 -vec-report2 -o vect-add
$ ./vect-add
```
Some interesting compiler flags useful not only for code debugging are:
!!! Note "Note"
Debugging
openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic level
Performance ooptimization
xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions) instructions.
Automatic Offload using Intel MKL Library
-----------------------------------------
Intel MKL includes an Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Intel Xeon Phi coprocessors automatically and transparently.
Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [ here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
The Automatic Offload may be enabled by either an MKL function call within the code:
```cpp
mkl_mic_enable();
```
or by setting environment variable
```bash
$ export MKL_MIC_ENABLE=1
```
To get more information about automatic offload please refer to "[Using Intel® MKL Automatic Offload on Intel ® Xeon Phi™ Coprocessors](http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf)" white paper or [ Intel MKL documentation](https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation).
### Automatic offload example
At first get an interactive PBS session on a node with MIC accelerator and load "intel" module that automatically loads "mkl" module as well.
```bash
$ qsub -I -q qmic -A OPEN-0-0 -l select=1:ncpus=16
$ module load intel
```
Following example show how to automatically offload an SGEMM (single precision - g dir="auto">eneral matrix multiply) function to MIC coprocessor. The code can be copied to a file and compiled without any necessary modification.
```bash
$ vim sgemm-ao-short.c
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include "mkl.h"
int main(int argc, char **argv)
{
float *A, *B, *C; /* Matrices */
MKL_INT N = 2560; /* Matrix dimensions */
MKL_INT LD = N; /* Leading dimension */
int matrix_bytes; /* Matrix size in bytes */
int matrix_elements; /* Matrix size in elements */
float alpha = 1.0, beta = 1.0; /* Scaling factors */
char transa = 'N', transb = 'N'; /* Transposition options */
int i, j; /* Counters */
matrix_elements = N * N;
matrix_bytes = sizeof(float) * matrix_elements;
/* Allocate the matrices */
A = malloc(matrix_bytes); B = malloc(matrix_bytes); C = malloc(matrix_bytes);
/* Initialize the matrices */
for (i = 0; i < matrix_elements; i++) {
A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
}
printf("Computing SGEMM on the hostn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
printf("Enabling Automatic Offloadn");
/* Alternatively, set environment variable MKL_MIC_ENABLE=1 */
mkl_mic_enable();
int ndevices = mkl_mic_get_device_count(); /* Number of MIC devices */
printf("Automatic Offload enabled: %d MIC devices presentn", ndevices);
printf("Computing SGEMM with automatic workdivisionn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
/* Free the matrix memory */
free(A); free(B); free(C);
printf("Donen");
return 0;
}
```
!!! Note "Note"
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use:
```bash
$ icc -mkl sgemm-ao-short.c -o sgemm
```
For debugging purposes enable the offload report to see more information about automatic offloading.
```bash
$ export OFFLOAD_REPORT=2
```
The output of a code should look similar to following listing, where lines starting with [MKL] are generated by offload reporting:
```bash
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 1 MIC devices present
Computing SGEMM with automatic workdivision
[MKL] [MIC --] [AO Function] SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision] 0.00 1.00
[MKL] [MIC 00] [AO SGEMM CPU Time] 0.463351 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time] 0.179608 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 52428800 bytes
[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 26214400 bytes
Done
```
Native Mode
-----------
In the native mode a program is executed directly on Intel Xeon Phi without involvement of the host machine. Similarly to offload mode, the code is compiled on the host computer with Intel compilers.
To compile a code user has to be connected to a compute with MIC and load Intel compilers module. To get an interactive session on a compute node with an Intel Xeon Phi and load the module use following commands:
```bash
$ qsub -I -q qmic -A NONE-0-0
$ module load intel/13.5.192
```
!!! Note "Note"
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
```bash
$ icc -xhost -no-offload -fopenmp vect-add.c -o vect-add-host
```
To run this code on host, use:
```bash
$ ./vect-add-host
```
The second example shows how to compile the same code for Intel Xeon Phi:
```bash
$ icc -mmic -fopenmp vect-add.c -o vect-add-mic
```
### Execution of the Program in Native Mode on Intel Xeon Phi
The user access to the Intel Xeon Phi is through the SSH. Since user home directories are mounted using NFS on the accelerator, users do not have to copy binary files or libraries between the host and accelerator.
To connect to the accelerator run:
```bash
$ ssh mic0
```
If the code is sequential, it can be executed directly:
```bash
mic0 $ ~/path_to_binary/vect-add-seq-mic
```
If the code is parallelized using OpenMP a set of additional libraries is required for execution. To locate these libraries new path has to be added to the LD_LIBRARY_PATH environment variable prior to the execution:
```bash
mic0 $ export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
```
!!! Note "Note"
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
!!! Note "Note"
/apps/intel/composer_xe_2013.5.192/compiler/lib/mic
- libiomp5.so
- libimf.so
- libsvml.so
- libirng.so
- libintlc.so.5
Finally, to run the compiled code use:
```bash
$ ~/path_to_binary/vect-add-mic
```
OpenCL
-------------------
OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming for diverse mix of multi-core CPUs, GPU coprocessors, and other parallel processors. OpenCL provides a flexible execution model and uniform programming environment for software developers to write portable code for systems running on both the CPU and graphics processors or accelerators like the Intel® Xeon Phi.
On Anselm OpenCL is installed only on compute nodes with MIC accelerator, therefore OpenCL code can be compiled only on these nodes.
```bash
module load opencl-sdk opencl-rt
```
Always load "opencl-sdk" (providing devel files like headers) and "opencl-rt" (providing dynamic library libOpenCL.so) modules to compile and link OpenCL code. Load "opencl-rt" for running your compiled code.
There are two basic examples of OpenCL code in the following directory:
```bash
/apps/intel/opencl-examples/
```
First example "CapsBasic" detects OpenCL compatible hardware, here CPU and MIC, and prints basic information about the capabilities of it.
```bash
/apps/intel/opencl-examples/CapsBasic/capsbasic
```
To compile and run the example copy it to your home directory, get a PBS interactive session on of the nodes with MIC and run make for compilation. Make files are very basic and shows how the OpenCL code can be compiled on Anselm.
```bash
$ cp /apps/intel/opencl-examples/CapsBasic/* .
$ qsub -I -q qmic -A NONE-0-0
$ make
```
The compilation command for this example is:
```bash
$ g++ capsbasic.cpp -lOpenCL -o capsbasic -I/apps/intel/opencl/include/
```
After executing the complied binary file, following output should be displayed.
```bash
./capsbasic
Number of available platforms: 1
Platform names:
[0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
CL_DEVICE_TYPE_CPU: 1
CL_DEVICE_TYPE_GPU: 0
CL_DEVICE_TYPE_ACCELERATOR: 1
** Detailed information for each device ***
CL_DEVICE_TYPE_CPU[0]
CL_DEVICE_NAME: Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
CL_DEVICE_AVAILABLE: 1
...
CL_DEVICE_TYPE_ACCELERATOR[0]
CL_DEVICE_NAME: Intel(R) Many Integrated Core Acceleration Card
CL_DEVICE_AVAILABLE: 1
...
```
!!! Note "Note"
More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>
The second example that can be found in "/apps/intel/opencl-examples" directory is General Matrix Multiply. You can follow the the same procedure to download the example to your directory and compile it.
```bash
$ cp -r /apps/intel/opencl-examples/* .
$ qsub -I -q qmic -A NONE-0-0
$ cd GEMM
$ make
```
The compilation command for this example is:
```bash
$ g++ cmdoptions.cpp gemm.cpp ../common/basic.cpp ../common/cmdparser.cpp ../common/oclobject.cpp -I../common -lOpenCL -o gemm -I/apps/intel/opencl/include/
```
To see the performance of Intel Xeon Phi performing the DGEMM run the example as follows:
```bash
./gemm -d 1
Platforms (1):
[0] Intel(R) OpenCL [Selected]
Devices (2):
[0] Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
[1] Intel(R) Many Integrated Core Acceleration Card [Selected]
Build program options: "-DT=float -DTILE_SIZE_M=1 -DTILE_GROUP_M=16 -DTILE_SIZE_N=128 -DTILE_GROUP_N=1 -DTILE_SIZE_K=8"
Running gemm_nn kernel with matrix size: 3968x3968
Memory row stride to ensure necessary alignment: 15872 bytes
Size of memory region for one matrix: 62980096 bytes
Using alpha = 0.57599 and beta = 0.872412
...
Host time: 0.292953 sec.
Host perf: 426.635 GFLOPS
Host time: 0.293334 sec.
Host perf: 426.081 GFLOPS
...
```
!!! Note "Note"
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
MPI
-----------------
### Environment setup and compilation
Again an MPI code for Intel Xeon Phi has to be compiled on a compute node with accelerator and MPSS software stack installed. To get to a compute node with accelerator use:
```bash
$ qsub -I -q qmic -A NONE-0-0
```
The only supported implementation of MPI standard for Intel Xeon Phi is Intel MPI. To setup a fully functional development environment a combination of Intel compiler and Intel MPI has to be used. On a host load following modules before compilation:
```bash
$ module load intel/13.5.192 impi/4.1.1.036
```
To compile an MPI code for host use:
```bash
$ mpiicc -xhost -o mpi-test mpi-test.c
```bash
To compile the same code for Intel Xeon Phi architecture use:
```bash
$ mpiicc -mmic -o mpi-test-mic mpi-test.c
```
An example of basic MPI version of "hello-world" example in C language, that can be executed on both host and Xeon Phi is (can be directly copy and pasted to a .c file)
```cpp
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size;
int len;
char node[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(node,&len);
printf( "Hello world from process %d of %d on host %s n", rank, size, node );
MPI_Finalize();
return 0;
}
```
### MPI programming models
Intel MPI for the Xeon Phi coprocessors offers different MPI programming models:
!!! Note "Note"
**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
**Coprocessor-only model** - all MPI ranks reside only on the coprocessors.
**Symmetric model** - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.
###Host-only model
In this case all environment variables are set by modules, so to execute the compiled MPI program on a single node, use:
```bash
$ mpirun -np 4 ./mpi-test
```
The output should be similar to:
```bash
Hello world from process 1 of 4 on host cn207
Hello world from process 3 of 4 on host cn207
Hello world from process 2 of 4 on host cn207
Hello world from process 0 of 4 on host cn207
```
### Coprocessor-only model
There are two ways how to execute an MPI code on a single coprocessor: 1.) lunch the program using "**mpirun**" from the
coprocessor; or 2.) lunch the task using "**mpiexec.hydra**" from a host.
**Execution on coprocessor**
Similarly to execution of OpenMP programs in native mode, since the environmental module are not supported on MIC, user has to setup paths to Intel MPI libraries and binaries manually. One time setup can be done by creating a "**.profile**" file in user's home directory. This file sets up the environment on the MIC automatically once user access to the accelerator through the SSH.
```bash
$ vim ~/.profile
PS1='[u@h W]$ '
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#OpenMP
export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
#Intel MPI
export LD_LIBRARY_PATH=/apps/intel/impi/4.1.1.036/mic/lib/:$LD_LIBRARY_PATH
export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH
```
!!! Note "Note"
Please note:
- this file sets up both environmental variable for both MPI and OpenMP libraries.
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use:
```bash
$ ssh mic0
```
or in case you need specify a MIC accelerator on a particular node, use:
```bash
$ ssh cn207-mic0
```
To run the MPI code in parallel on multiple core of the accelerator, use:
```bash
$ mpirun -np 4 ./mpi-test-mic
```
The output should be similar to:
```bash
Hello world from process 1 of 4 on host cn207-mic0
Hello world from process 2 of 4 on host cn207-mic0
Hello world from process 3 of 4 on host cn207-mic0
Hello world from process 0 of 4 on host cn207-mic0
```
**Execution on host**
If the MPI program is launched from host instead of the coprocessor, the environmental variables are not set using the ".profile" file. Therefore user has to specify library paths from the command line when calling "mpiexec".
First step is to tell mpiexec that the MPI should be executed on a local accelerator by setting up the environmental variable "I_MPI_MIC"
```bash
$ export I_MPI_MIC=1
```
Now the MPI program can be executed as:
```bash
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
```
or using mpirun
```bash
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
```
!!! Note "Note"
Please note:
- the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to:
```bash
Hello world from process 1 of 4 on host cn207-mic0
Hello world from process 2 of 4 on host cn207-mic0
Hello world from process 3 of 4 on host cn207-mic0
Hello world from process 0 of 4 on host cn207-mic0
```
!!! Note "Note"
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute:
```bash
$ ssh mic0 ls /bin/pmi_proxy
/bin/pmi_proxy
```
**Execution on host - MPI processes distributed over multiple accelerators on multiple nodes**
To get access to multiple nodes with MIC accelerator, user has to use PBS to allocate the resources. To start interactive session, that allocates 2 compute nodes = 2 MIC accelerators run qsub command with following parameters:
```bash
$ qsub -I -q qmic -A NONE-0-0 -l select=2:ncpus=16
$ module load intel/13.5.192 impi/4.1.1.036
```
This command connects user through ssh to one of the nodes immediately. To see the other nodes that have been allocated use:
```bash
$ cat $PBS_NODEFILE
```
For example:
```bash
cn204.bullx
cn205.bullx
```
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note "Note"
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : ** $ ssh cn205**
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0**
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
```bash
$ export I_MPI_MIC=1
```
The launch the MPI program use:
```bash
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204-mic0 -n 4 ~/mpi-test-mic
: -host cn205-mic0 -n 6 ~/mpi-test-mic
```
or using mpirun:
```bash
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204-mic0 -n 4 ~/mpi-test-mic
: -host cn205-mic0 -n 6 ~/mpi-test-mic
```
In this case four MPI processes are executed on accelerator cn204-mic and six processes are executed on accelerator cn205-mic0. The sample output (sorted after execution) is:
```bash
Hello world from process 0 of 10 on host cn204-mic0
Hello world from process 1 of 10 on host cn204-mic0
Hello world from process 2 of 10 on host cn204-mic0
Hello world from process 3 of 10 on host cn204-mic0
Hello world from process 4 of 10 on host cn205-mic0
Hello world from process 5 of 10 on host cn205-mic0
Hello world from process 6 of 10 on host cn205-mic0
Hello world from process 7 of 10 on host cn205-mic0
Hello world from process 8 of 10 on host cn205-mic0
Hello world from process 9 of 10 on host cn205-mic0
```
The same way MPI program can be executed on multiple hosts:
```bash
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204 -n 4 ~/mpi-test
: -host cn205 -n 6 ~/mpi-test
```
###Symmetric model
In a symmetric mode MPI programs are executed on both host computer(s) and MIC accelerator(s). Since MIC has a different
architecture and requires different binary file produced by the Intel compiler two different files has to be compiled before MPI program is executed.
In the previous section we have compiled two binary files, one for hosts "**mpi-test**" and one for MIC accelerators "**mpi-test-mic**". These two binaries can be executed at once using mpiexec.hydra:
```bash
$ mpiexec.hydra
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-host cn205 -n 2 ~/mpi-test
: -host cn205-mic0 -n 2 ~/mpi-test-mic
```
In this example the first two parameters (line 2 and 3) sets up required environment variables for execution. The third line specifies binary that is executed on host (here cn205) and the last line specifies the binary that is execute on the accelerator (here cn205-mic0).
The output of the program is:
```bash
Hello world from process 0 of 4 on host cn205
Hello world from process 1 of 4 on host cn205
Hello world from process 2 of 4 on host cn205-mic0
Hello world from process 3 of 4 on host cn205-mic0
```
The execution procedure can be simplified by using the mpirun command with the machine file a a parameter. Machine file contains list of all nodes and accelerators that should used to execute MPI processes.
An example of a machine file that uses 2 >hosts (**cn205** and **cn206**) and 2 accelerators **(cn205-mic0** and **cn206-mic0**) to run 2 MPI processes on each of them:
```bash
$ cat hosts_file_mix
cn205:2
cn205-mic0:2
cn206:2
cn206-mic0:2
```
In addition if a naming convention is set in a way that the name of the binary for host is **"bin_name"** and the name of the binary for the accelerator is **"bin_name-mic"** then by setting up the environment variable **I_MPI_MIC_POSTFIX** to **"-mic"** user do not have to specify the names of booth binaries. In this case mpirun needs just the name of the host binary file (i.e. "mpi-test") and uses the suffix to get a name of the binary for accelerator (i..e. "mpi-test-mic").
```bash
$ export I_MPI_MIC_POSTFIX=-mic
```
To run the MPI code using mpirun and the machine file "hosts_file_mix" use:
```bash
$ mpirun
-genv I_MPI_FABRICS shm:tcp
-genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-machinefile hosts_file_mix
~/mpi-test
```
A possible output of the MPI "hello-world" example executed on two hosts and two accelerators is:
```bash
Hello world from process 0 of 8 on host cn204
Hello world from process 1 of 8 on host cn204
Hello world from process 2 of 8 on host cn204-mic0
Hello world from process 3 of 8 on host cn204-mic0
Hello world from process 4 of 8 on host cn205
Hello world from process 5 of 8 on host cn205
Hello world from process 6 of 8 on host cn205-mic0
Hello world from process 7 of 8 on host cn205-mic0
```
!!! Note "Note"
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files**
PBS also generates a set of node-files that can be used instead of manually creating a new one every time. Three node-files are genereated:
!!! Note "Note"
**Host only node-file:**
- /lscratch/${PBS_JOBID}/nodefile-cn MIC only node-file:
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
Optimization
------------
For more details about optimization techniques please read Intel document [Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors](http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization "http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization")
ISV Licenses
============
##A guide to managing Independent Software Vendor licences
On Anselm cluster there are also installed commercial software applications, also known as ISV (Independent Software Vendor), which are subjects to licensing. The licenses are limited and their usage may be restricted only to some users or user groups.
Currently Flex License Manager based licensing is supported on the cluster for products Ansys, Comsol and Matlab. More information about the applications can be found in the general software section.
If an ISV application was purchased for educational (research) purposes and also for commercial purposes, then there are always two separate versions maintained and suffix "edu" is used in the name of the non-commercial version.
Overview of the licenses usage
------------------------------
!!! Note "Note"
The overview is generated every minute and is accessible from web or command line interface.
### Web interface
For each license there is a table, which provides the information about the name, number of available (purchased/licensed), number of used and number of free license features
<https://extranet.it4i.cz/anselm/licenses>
### Text interface
For each license there is a unique text file, which provides the information about the name, number of available (purchased/licensed), number of used and number of free license features. The text files are accessible from the Anselm command prompt.
|Product|File with license state|Note|
|---|---|
|ansys|/apps/user/licenses/ansys_features_state.txt|Commercial|
|comsol|/apps/user/licenses/comsol_features_state.txt|Commercial|
|comsol-edu|/apps/user/licenses/comsol-edu_features_state.txt|Non-commercial only|
|matlab|/apps/user/licenses/matlab_features_state.txt|Commercial|
|matlab-edu|/apps/user/licenses/matlab-edu_features_state.txt|Non-commercial only|
The file has a header which serves as a legend. All the info in the legend starts with a hash (#) so it can be easily filtered when parsing the file via a script.
Example of the Commercial Matlab license state:
```bash
$ cat /apps/user/licenses/matlab_features_state.txt
# matlab
# -------------------------------------------------
# FEATURE TOTAL USED AVAIL
# -------------------------------------------------
MATLAB 1 1 0
SIMULINK 1 0 1
Curve_Fitting_Toolbox 1 0 1
Signal_Blocks 1 0 1
GADS_Toolbox 1 0 1
Image_Toolbox 1 0 1
Compiler 1 0 1
Neural_Network_Toolbox 1 0 1
Optimization_Toolbox 1 0 1
Signal_Toolbox 1 0 1
Statistics_Toolbox 1 0 1
```
License tracking in PBS Pro scheduler and users usage
-----------------------------------------------------
Each feature of each license is accounted and checked by the scheduler of PBS Pro. If you ask for certain licences, the scheduler won't start the job until the asked licenses are free (available). This prevents to crash batch jobs, just because of unavailability of the needed licenses.
The general format of the name is:
**feature__APP__FEATURE**
Names of applications (APP):
- ansys
- comsol
- comsol-edu
- matlab
- matlab-edu
To get the FEATUREs of a license take a look into the corresponding state file ([see above](isv_licenses/#Licence)), or use:
**Application and List of provided features**
- **ansys** $ grep -v &quot;#&quot; /apps/user/licenses/ansys_features_state.txt | cut -f1 -d&#39; &#39;
- **comsol** $ grep -v &quot;#&quot; /apps/user/licenses/comsol_features_state.txt | cut -f1 -d&#39; &#39;
- **comsol-ed** $ grep -v &quot;#&quot; /apps/user/licenses/comsol-edu_features_state.txt | cut -f1 -d&#39; &#39;
- **matlab** $ grep -v &quot;#&quot; /apps/user/licenses/matlab_features_state.txt | cut -f1 -d&#39; &#39;
- **matlab-edu** $ grep -v &quot;#&quot; /apps/user/licenses/matlab-edu_features_state.txt | cut -f1 -d&#39; &#39;
Example of PBS Pro resource name, based on APP and FEATURE name:
|Application |Feature |PBS Pro resource name |
| --- | --- |
|ansys |acfd |feature_ansys_acfd |
|ansys |aa_r |feature_ansys_aa_r |
|comsol |COMSOL |feature_comsol_COMSOL |
|comsol |HEATTRANSFER |feature_comsol_HEATTRANSFER |
|comsol-edu |COMSOLBATCH |feature_comsol-edu_COMSOLBATCH |
|comsol-edu |STRUCTURALMECHANICS |feature_comsol-edu_STRUCTURALMECHANICS |
|matlab |MATLAB |feature_matlab_MATLAB |
|matlab |Image_Toolbox |feature_matlab_Image_Toolbox |
|matlab-edu |MATLAB_Distrib_Comp_Engine |feature_matlab-edu_MATLAB_Distrib_Comp_Engine |
|matlab-edu |Image_Acquisition_Toolbox |feature_matlab-edu_Image_Acquisition_Toolbox\ |
**Be aware, that the resource names in PBS Pro are CASE SENSITIVE!**
### Example of qsub statement
Run an interactive PBS job with 1 Matlab EDU license, 1 Distributed Computing Toolbox and 32 Distributed Computing Engines (running on 32 cores):
```bash
$ qsub -I -q qprod -A PROJECT_ID -l select=2:ncpus=16 -l feature__matlab-edu__MATLAB=1 -l feature__matlab-edu__Distrib_Computing_Toolbox=1 -l feature__matlab-edu__MATLAB_Distrib_Comp_Engine=32
```
The license is used and accounted only with the real usage of the product. So in this example, the general Matlab is used after Matlab is run vy the user and not at the time, when the shell of the interactive job is started. Also the Distributed Computing licenses are used at the time, when the user uses the distributed parallel computation in Matlab (e. g. issues pmode start, matlabpool, etc.).
Java
====
##Java on ANSELM
Java is available on Anselm cluster. Activate java by loading the java module
```bash
$ module load java
```
Note that the java module must be loaded on the compute nodes as well, in order to run java on compute nodes.
Check for java version and path
```bash
$ java -version
$ which java
```
With the module loaded, not only the runtime environment (JRE), but also the development environment (JDK) with the compiler is available.
```bash
$ javac -version
$ which javac
```
Java applications may use MPI for interprocess communication, in conjunction with OpenMPI. Read more on <http://www.open-mpi.org/faq/?category=java>. This functionality is currently not supported on Anselm cluster. In case you require the java interface to MPI, please contact [Anselm support](https://support.it4i.cz/rt/).
Virtualization
==============
##Running virtual machines on compute nodes
Introduction
------------
There are situations when Anselm's environment is not suitable for user needs.
- Application requires different operating system (e.g Windows), application is not available for Linux
- Application requires different versions of base system libraries and tools
- Application requires specific setup (installation, configuration) of complex software stack
- Application requires privileged access to operating system
- ... and combinations of above cases
We offer solution for these cases - **virtualization**. Anselm's environment gives the possibility to run virtual machines on compute nodes. Users can create their own images of operating system with specific software stack and run instances of these images as virtual machines on compute nodes. Run of virtual machines is provided by standard mechanism of [Resource Allocation and Job Execution](../../resource-allocation-and-job-execution/introduction/).
Solution is based on QEMU-KVM software stack and provides hardware-assisted x86 virtualization.
Limitations
-----------
Anselm's infrastructure was not designed for virtualization. Anselm's environment is not intended primary for virtualization, compute nodes, storages and all infrastructure of Anselm is intended and optimized for running HPC jobs, this implies suboptimal configuration of virtualization and limitations.
Anselm's virtualization does not provide performance and all features of native environment. There is significant performance hit (degradation) in I/O performance (storage, network). Anselm's virtualization is not suitable for I/O (disk, network) intensive workloads.
Virtualization has also some drawbacks, it is not so easy to setup efficient solution.
Solution described in chapter [HOWTO](virtualization/#howto) is suitable for single node tasks, does not introduce virtual machine clustering.
!!! Note "Note"
Please consider virtualization as last resort solution for your needs.
Please consult use of virtualization with IT4Innovation's support.
For running Windows application (when source code and Linux native application are not available) consider use of Wine, Windows compatibility layer. Many Windows applications can be run using Wine with less effort and better performance than when using virtualization.
Licensing
---------
IT4Innovations does not provide any licenses for operating systems and software of virtual machines. Users are ( in accordance with [Acceptable use policy document](http://www.it4i.cz/acceptable-use-policy.pdf)) fully responsible for licensing all software running in virtual machines on Anselm. Be aware of complex conditions of licensing software in virtual environments.
!!! Note "Note"
Users are responsible for licensing OS e.g. MS Windows and all software running in their virtual machines.
HOWTO
----------
### Virtual Machine Job Workflow
We propose this job workflow:
![Workflow](../../img/virtualization-job-workflow "Virtualization Job Workflow")
Our recommended solution is that job script creates distinct shared job directory, which makes a central point for data exchange between Anselm's environment, compute node (host) (e.g HOME, SCRATCH, local scratch and other local or cluster filesystems) and virtual machine (guest). Job script links or copies input data and instructions what to do (run script) for virtual machine to job directory and virtual machine process input data according instructions in job directory and store output back to job directory. We recommend, that virtual machine is running in so called [snapshot mode](virtualization/#snapshot-mode), image is immutable - image does not change, so one image can be used for many concurrent jobs.
### Procedure
1. Prepare image of your virtual machine
2. Optimize image of your virtual machine for Anselm's virtualization
3. Modify your image for running jobs
4. Create job script for executing virtual machine
5. Run jobs
### Prepare image of your virtual machine
You can either use your existing image or create new image from scratch.
QEMU currently supports these image types or formats:
- raw
- cloop
- cow
- qcow
- qcow2
- vmdk - VMware 3 & 4, or 6 image format, for exchanging images with that product
- vdi - VirtualBox 1.1 compatible image format, for exchanging images with VirtualBox.
You can convert your existing image using qemu-img convert command. Supported formats of this command are: blkdebug blkverify bochs cloop cow dmg file ftp ftps host_cdrom host_device host_floppy http https nbd parallels qcow qcow2 qed raw sheepdog tftp vdi vhdx vmdk vpc vvfat.
We recommend using advanced QEMU native image format qcow2.
[More about QEMU Images](http://en.wikibooks.org/wiki/QEMU/Images)
### Optimize image of your virtual machine
Use virtio devices (for disk/drive and network adapter) and install virtio drivers (paravirtualized drivers) into virtual machine. There is significant performance gain when using virtio drivers. For more information see [Virtio Linux](http://www.linux-kvm.org/page/Virtio) and [Virtio Windows](http://www.linux-kvm.org/page/WindowsGuestDrivers/Download_Drivers).
Disable all unnecessary services and tasks. Restrict all unnecessary operating system operations.
Remove all unnecessary software and files.
Remove all paging space, swap files, partitions, etc.
Shrink your image. (It is recommended to zero all free space and reconvert image using qemu-img.)
### Modify your image for running jobs
Your image should run some kind of operating system startup script. Startup script should run application and when application exits run shutdown or quit virtual machine.
We recommend, that startup script
- maps Job Directory from host (from compute node)
- runs script (we call it "run script") from Job Directory and waits for application's exit
- for management purposes if run script does not exist wait for some time period (few minutes)
- shutdowns/quits OS
For Windows operating systems we suggest using Local Group Policy Startup script, for Linux operating systems rc.local, runlevel init script or similar service.
Example startup script for Windows virtual machine:
```bash
@echo off
set LOG=c:startup.log
set MAPDRIVE=z:
set SCRIPT=%MAPDRIVE%run.bat
set TIMEOUT=300
echo %DATE% %TIME% Running startup script>%LOG%
rem Mount share
echo %DATE% %TIME% Mounting shared drive>%LOG%
net use z: 10.0.2.4qemu >%LOG% 2>&1
dir z: >%LOG% 2>&1
echo. >%LOG%
if exist %MAPDRIVE% (
echo %DATE% %TIME% The drive "%MAPDRIVE%" exists>%LOG%
if exist %SCRIPT% (
echo %DATE% %TIME% The script file "%SCRIPT%"exists>%LOG%
echo %DATE% %TIME% Running script %SCRIPT%>%LOG%
set TIMEOUT=0
call %SCRIPT%
) else (
echo %DATE% %TIME% The script file "%SCRIPT%"does not exist>%LOG%
)
) else (
echo %DATE% %TIME% The drive "%MAPDRIVE%" does not exist>%LOG%
)
echo. >%LOG%
timeout /T %TIMEOUT%
echo %DATE% %TIME% Shut down>%LOG%
shutdown /s /t 0
```
Example startup script maps shared job script as drive z: and looks for run script called run.bat. If run script is found it is run else wait for 5 minutes, then shutdown virtual machine.
### Create job script for executing virtual machine
Create job script according recommended
[Virtual Machine Job Workflow](virtualization.html#virtual-machine-job-workflow).
Example job for Windows virtual machine:
```bash
#/bin/sh
JOB_DIR=/scratch/$USER/win/${PBS_JOBID}
#Virtual machine settings
VM_IMAGE=~/work/img/win.img
VM_MEMORY=49152
VM_SMP=16
# Prepare job dir
mkdir -p ${JOB_DIR} && cd ${JOB_DIR} || exit 1
ln -s ~/work/win .
ln -s /scratch/$USER/data .
ln -s ~/work/win/script/run/run-appl.bat run.bat
# Run virtual machine
export TMPDIR=/lscratch/${PBS_JOBID}
module add qemu
qemu-system-x86_64
-enable-kvm
-cpu host
-smp ${VM_SMP}
-m ${VM_MEMORY}
-vga std
-localtime
-usb -usbdevice tablet
-device virtio-net-pci,netdev=net0
-netdev user,id=net0,smb=${JOB_DIR},hostfwd=tcp::3389-:3389
-drive file=${VM_IMAGE},media=disk,if=virtio
-snapshot
-nographic
```
Job script links application data (win), input data (data) and run script (run.bat) into job directory and runs virtual machine.
Example run script (run.bat) for Windows virtual machine:
```bash
z:
cd winappl
call application.bat z:data z:output
```
Run script runs application from shared job directory (mapped as drive z:), process input data (z:data) from job directory and store output to job directory (z:output).
### Run jobs
Run jobs as usual, see [Resource Allocation and Job Execution](../../resource-allocation-and-job-execution/introduction/). Use only full node allocation for virtualization jobs.
### Running Virtual Machines
Virtualization is enabled only on compute nodes, virtualization does not work on login nodes.
Load QEMU environment module:
```bash
$ module add qemu
```
Get help
```bash
$ man qemu
```
Run virtual machine (simple)
```bash
$ qemu-system-x86_64 -hda linux.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -vnc :0
$ qemu-system-x86_64 -hda win.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -vnc :0
```
You can access virtual machine by VNC viewer (option -vnc) connecting to IP address of compute node. For VNC you must use [VPN network](../../accessing-the-cluster/vpn-access/).
Install virtual machine from iso file
```bash
$ qemu-system-x86_64 -hda linux.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -cdrom linux-install.iso -boot d -vnc :0
$ qemu-system-x86_64 -hda win.img -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -cdrom win-install.iso -boot d -vnc :0
```
Run virtual machine using optimized devices, user network backend with sharing and port forwarding, in snapshot mode
```bash
$ qemu-system-x86_64 -drive file=linux.img,media=disk,if=virtio -enable-kvm -cpu host -smp 16 -m 32768 -vga std -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::2222-:22 -vnc :0 -snapshot
$ qemu-system-x86_64 -drive file=win.img,media=disk,if=virtio -enable-kvm -cpu host -smp 16 -m 32768 -vga std -localtime -usb -usbdevice tablet -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::3389-:3389 -vnc :0 -snapshot
```
Thanks to port forwarding you can access virtual machine via SSH (Linux) or RDP (Windows) connecting to IP address of compute node (and port 2222 for SSH). You must use [VPN network](../../accessing-the-cluster/vpn-access/).
!!! Note "Note"
Keep in mind, that if you use virtio devices, you must have virtio drivers installed on your virtual machine.
### Networking and data sharing
For networking virtual machine we suggest to use (default) user network backend (sometimes called slirp). This network backend NATs virtual machines and provides useful services for virtual machines as DHCP, DNS, SMB sharing, port forwarding.
In default configuration IP network 10.0.2.0/24 is used, host has IP address 10.0.2.2, DNS server 10.0.2.3, SMB server 10.0.2.4 and virtual machines obtain address from range 10.0.2.15-10.0.2.31. Virtual machines have access to Anselm's network via NAT on compute node (host).
Simple network setup
```bash
$ qemu-system-x86_64 ... -net nic -net user
```
(It is default when no -net options are given.)
Simple network setup with sharing and port forwarding (obsolete but simpler syntax, lower performance)
```bash
$ qemu-system-x86_64 ... -net nic -net user,smb=/scratch/$USER/tmp,hostfwd=tcp::3389-:3389
```
Optimized network setup with sharing and port forwarding
```bash
$ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net0 -netdev user,id=net0,smb=/scratch/$USER/tmp,hostfwd=tcp::2222-:22
```
### Advanced networking
**Internet access**
Sometime your virtual machine needs access to internet (install software, updates, software activation, etc). We suggest solution using Virtual Distributed Ethernet (VDE) enabled QEMU with SLIRP running on login node tunnelled to compute node. Be aware, this setup has very low performance, the worst performance of all described solutions.
Load VDE enabled QEMU environment module (unload standard QEMU module first if necessary).
```bash
$ module add qemu/2.1.2-vde2
```
Create virtual network switch.
```bash
$ vde_switch -sock /tmp/sw0 -mgmt /tmp/sw0.mgmt -daemon
```
Run SLIRP daemon over SSH tunnel on login node and connect it to virtual network switch.
```bash
$ dpipe vde_plug /tmp/sw0 = ssh login1 $VDE2_DIR/bin/slirpvde -s - --dhcp &
```
Run qemu using vde network backend, connect to created virtual switch.
Basic setup (obsolete syntax)
```bash
$ qemu-system-x86_64 ... -net nic -net vde,sock=/tmp/sw0
```
Setup using virtio device (obsolete syntax)
```bash
$ qemu-system-x86_64 ... -net nic,model=virtio -net vde,sock=/tmp/sw0
```
Optimized setup
```bash
$ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net0 -netdev vde,id=net0,sock=/tmp/sw0
```
**TAP interconnect**
Both user and vde network backend have low performance. For fast interconnect (10Gbps and more) of compute node (host) and virtual machine (guest) we suggest using Linux kernel TAP device.
Cluster Anselm provides TAP device tap0 for your job. TAP interconnect does not provide any services (like NAT, DHCP, DNS, SMB, etc.) just raw networking, so you should provide your services if you need them.
Run qemu with TAP network backend:
```bash
$ qemu-system-x86_64 ... -device virtio-net-pci,netdev=net1
-netdev tap,id=net1,ifname=tap0,script=no,downscript=no
```
Interface tap0 has IP address 192.168.1.1 and network mask 255.255.255.0 (/24). In virtual machine use IP address from range 192.168.1.2-192.168.1.254. For your convenience some ports on tap0 interface are redirected to higher numbered ports, so you as non-privileged user can provide services on these ports.
Redirected ports:
- DNS udp/53-&gt;udp/3053, tcp/53-&gt;tcp3053
- DHCP udp/67-&gt;udp3067
- SMB tcp/139-&gt;tcp3139, tcp/445-&gt;tcp3445).
You can configure IP address of virtual machine statically or dynamically. For dynamic addressing provide your DHCP server on port 3067 of tap0 interface, you can also provide your DNS server on port 3053 of tap0 interface for example:
```bash
$ dnsmasq --interface tap0 --bind-interfaces -p 3053 --dhcp-alternate-port=3067,68 --dhcp-range=192.168.1.15,192.168.1.32 --dhcp-leasefile=/tmp/dhcp.leasefile
```
You can also provide your SMB services (on ports 3139, 3445) to obtain high performance data sharing.
Example smb.conf (not optimized)
```bash
[global]
socket address=192.168.1.1
smb ports = 3445 3139
private dir=/tmp/qemu-smb
pid directory=/tmp/qemu-smb
lock directory=/tmp/qemu-smb
state directory=/tmp/qemu-smb
ncalrpc dir=/tmp/qemu-smb/ncalrpc
log file=/tmp/qemu-smb/log.smbd
smb passwd file=/tmp/qemu-smb/smbpasswd
security = user
map to guest = Bad User
unix extensions = no
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
log level = 1
guest account = USER
[qemu]
path=/scratch/USER/tmp
read only=no
guest ok=yes
writable=yes
follow symlinks=yes
wide links=yes
force user=USER
```
(Replace USER with your login name.)
Run SMB services
```bash
smbd -s /tmp/qemu-smb/smb.conf
```
Virtual machine can of course have more than one network interface controller, virtual machine can use more than one network backend. So, you can combine for example use network backend and TAP interconnect.
### Snapshot mode
In snapshot mode image is not written, changes are written to temporary file (and discarded after virtual machine exits). **It is strongly recommended mode for running your jobs.** Set TMPDIR environment variable to local scratch directory for placement temporary files.
```bash
$ export TMPDIR=/lscratch/${PBS_JOBID}
$ qemu-system-x86_64 ... -snapshot
```
### Windows guests
For Windows guests we recommend these options, life will be easier:
```bash
$ qemu-system-x86_64 ... -localtime -usb -usbdevice tablet
```
Running OpenMPI
==============
OpenMPI program execution
-------------------------
The OpenMPI programs may be executed only via the PBS Workload manager, by entering an appropriate queue. On Anselm, the **bullxmpi-1.2.4.1** and **OpenMPI 1.6.5** are OpenMPI based MPI implementations.
### Basic usage
!!! Note "Note"
Use the mpiexec to run the OpenMPI code.
Example:
```bash
$ qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ pwd
/home/username
$ module load openmpi
$ mpiexec -pernode ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host cn17
Hello world! from rank 1 of 4 on host cn108
Hello world! from rank 2 of 4 on host cn109
Hello world! from rank 3 of 4 on host cn110
```
!!! Note "Note"
Please be aware, that in this example, the directive **-pernode** is used to run only **one task per node**, which is normally an unwanted behaviour (unless you want to run hybrid code with just one MPI and 16 OpenMP tasks per node). In normal MPI programs **omit the -pernode directive** to run up to 16 MPI tasks per each node.
In this example, we allocate 4 nodes via the express queue interactively. We set up the openmpi environment and interactively run the helloworld_mpi.x program. Note that the executable helloworld_mpi.x must be available within the
same path on all nodes. This is automatically fulfilled on the /home and /scratch filesystem.
You need to preload the executable, if running on the local scratch /lscratch filesystem
```bash
$ pwd
/lscratch/15210.srv11
$ mpiexec -pernode --preload-binary ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host cn17
Hello world! from rank 1 of 4 on host cn108
Hello world! from rank 2 of 4 on host cn109
Hello world! from rank 3 of 4 on host cn110
```
In this example, we assume the executable helloworld_mpi.x is present on compute node cn17 on local scratch. We call the mpiexec whith the **--preload-binary** argument (valid for openmpi). The mpiexec will copy the executable from cn17 to the /lscratch/15210.srv11 directory on cn108, cn109 and cn110 and execute the program.
!!! Note "Note"
MPI process mapping may be controlled by PBS parameters.
The mpiprocs and ompthreads parameters allow for selection of number of running MPI processes per node as well as number of OpenMP threads per MPI process.
### One MPI process per node
Follow this example to run one MPI process per node, 16 threads per process.
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=1:ompthreads=16 -I
$ module load openmpi
$ mpiexec --bind-to-none ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 1 MPI processes per node and 16 threads per socket, on 4 nodes.
### Two MPI processes per node
Follow this example to run two MPI processes per node, 8 threads per process. Note the options to mpiexec.
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=2:ompthreads=8 -I
$ module load openmpi
$ mpiexec -bysocket -bind-to-socket ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 2 MPI processes per node and 8 threads per socket, each process and its threads bound to a separate processor socket of the node, on 4 nodes
### 16 MPI processes per node
Follow this example to run 16 MPI processes per node, 1 thread per process. Note the options to mpiexec.
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=16:ompthreads=1 -I
$ module load openmpi
$ mpiexec -bycore -bind-to-core ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 16 MPI processes per node, single threaded. Each process is bound to separate processor core, on 4 nodes.
### OpenMP thread affinity
!!! Note "Note"
Important! Bind every OpenMP thread to a core!
In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP:
```bash
$ export GOMP_CPU_AFFINITY="0-15"
```
or this one for Intel OpenMP:
```bash
$ export KMP_AFFINITY=granularity=fine,compact,1,0
``
As of OpenMP 4.0 (supported by GCC 4.9 and later and Intel 14.0 and later) the following variables may be used for Intel or GCC:
```bash
$ export OMP_PROC_BIND=true
$ export OMP_PLACES=cores
```
OpenMPI Process Mapping and Binding
------------------------------------------------
The mpiexec allows for precise selection of how the MPI processes will be mapped to the computational nodes and how these processes will bind to particular processor sockets and cores.
MPI process mapping may be specified by a hostfile or rankfile input to the mpiexec program. Altough all implementations of MPI provide means for process mapping and binding, following examples are valid for the openmpi only.
### Hostfile
Example hostfile
```bash
cn110.bullx
cn109.bullx
cn108.bullx
cn17.bullx
```
Use the hostfile to control process placement
```bash
$ mpiexec -hostfile hostfile ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host cn110
Hello world! from rank 1 of 4 on host cn109
Hello world! from rank 2 of 4 on host cn108
Hello world! from rank 3 of 4 on host cn17
```
In this example, we see that ranks have been mapped on nodes according to the order in which nodes show in the hostfile
### Rankfile
Exact control of MPI process placement and resource binding is provided by specifying a rankfile
!!! Note "Note"
Appropriate binding may boost performance of your application.
Example rankfile
```bash
rank 0=cn110.bullx slot=1:0,1
rank 1=cn109.bullx slot=0:*
rank 2=cn108.bullx slot=1:1-2
rank 3=cn17.bullx slot=0:1,1:0-2
rank 4=cn109.bullx slot=0:*,1:*
```
This rankfile assumes 5 ranks will be running on 4 nodes and provides exact mapping and binding of the processes to the processor sockets and cores
Explanation:
rank 0 will be bounded to cn110, socket1 core0 and core1
rank 1 will be bounded to cn109, socket0, all cores
rank 2 will be bounded to cn108, socket1, core1 and core2
rank 3 will be bounded to cn17, socket0 core1, socket1 core0, core1, core2
rank 4 will be bounded to cn109, all cores on both sockets
```bash
$ mpiexec -n 5 -rf rankfile --report-bindings ./helloworld_mpi.x
[cn17:11180] MCW rank 3 bound to socket 0[core 1] socket 1[core 0-2]: [. B . . . . . .][B B B . . . . .] (slot list 0:1,1:0-2)
[cn110:09928] MCW rank 0 bound to socket 1[core 0-1]: [. . . . . . . .][B B . . . . . .] (slot list 1:0,1)
[cn109:10395] MCW rank 1 bound to socket 0[core 0-7]: [B B B B B B B B][. . . . . . . .] (slot list 0:*)
[cn108:10406] MCW rank 2 bound to socket 1[core 1-2]: [. . . . . . . .][. B B . . . . .] (slot list 1:1-2)
[cn109:10406] MCW rank 4 bound to socket 0[core 0-7] socket 1[core 0-7]: [B B B B B B B B][B B B B B B B B] (slot list 0:*,1:*)
Hello world! from rank 3 of 5 on host cn17
Hello world! from rank 1 of 5 on host cn109
Hello world! from rank 0 of 5 on host cn110
Hello world! from rank 4 of 5 on host cn109
Hello world! from rank 2 of 5 on host cn108
```
In this example we run 5 MPI processes (5 ranks) on four nodes. The rankfile defines how the processes will be mapped on the nodes, sockets and cores. The **--report-bindings** option was used to print out the actual process location and bindings. Note that ranks 1 and 4 run on the same node and their core binding overlaps.
It is users responsibility to provide correct number of ranks, sockets and cores.
### Bindings verification
In all cases, binding and threading may be verified by executing for example:
```bash
$ mpiexec -bysocket -bind-to-socket --report-bindings echo
$ mpiexec -bysocket -bind-to-socket numactl --show
$ mpiexec -bysocket -bind-to-socket echo $OMP_NUM_THREADS
```
Changes in OpenMPI 1.8
----------------------
Some options have changed in OpenMPI version 1.8.
|version 1.6.5 |version 1.8.1 |
| --- | --- |
|--bind-to-none |--bind-to none |
|--bind-to-core |--bind-to core |
|--bind-to-socket |--bind-to socket |
|-bysocket |--map-by socket |
|-bycore |--map-by core |
|-pernode |--map-by ppr:1:node |
MPI
===
Setting up MPI Environment
--------------------------
The Anselm cluster provides several implementations of the MPI library:
|MPI Library |Thread support |
| --- | --- |
|The highly optimized and stable **bullxmpi 1.2.4.1** |Partial thread support up to MPI_THREAD_SERIALIZED |
|The **Intel MPI 4.1** |Full thread support up to MPI_THREAD_MULTIPLE |
|The [OpenMPI 1.6.5](href="http://www.open-mpi.org)| Full thread support up to MPI_THREAD_MULTIPLE, BLCR c/r support |
|The OpenMPI 1.8.1 |Full thread support up to MPI_THREAD_MULTIPLE, MPI-3.0 support |
|The **mpich2 1.9** |Full thread support up to MPI_THREAD_MULTIPLE, BLCR c/r support |
MPI libraries are activated via the environment modules.
Look up section modulefiles/mpi in module avail
```bash
$ module avail
------------------------- /opt/modules/modulefiles/mpi -------------------------
bullxmpi/bullxmpi-1.2.4.1 mvapich2/1.9-icc
impi/4.0.3.008 openmpi/1.6.5-gcc(default)
impi/4.1.0.024 openmpi/1.6.5-gcc46
impi/4.1.0.030 openmpi/1.6.5-icc
impi/4.1.1.036(default) openmpi/1.8.1-gcc
openmpi/1.8.1-gcc46
mvapich2/1.9-gcc(default) openmpi/1.8.1-gcc49
mvapich2/1.9-gcc46 openmpi/1.8.1-icc
```
There are default compilers associated with any particular MPI implementation. The defaults may be changed, the MPI libraries may be used in conjunction with any compiler. The defaults are selected via the modules in following way
|Module|MPI|Compiler suite|
|-------- |---|---|
|PrgEnv-gnu|bullxmpi-1.2.4.1|bullx GNU 4.4.6|
|PrgEnv-intel|Intel MPI 4.1.1|Intel 13.1.1|
|bullxmpi|bullxmpi-1.2.4.1|none, select via module|
|impi|Intel MPI 4.1.1|none, select via module|
|openmpi|OpenMPI 1.6.5|GNU compilers 4.8.1, GNU compilers 4.4.6, Intel Compilers|
|openmpi|OpenMPI 1.8.1|GNU compilers 4.8.1, GNU compilers 4.4.6, GNU compilers 4.9.0, Intel Compilers|
|mvapich2|MPICH2 1.9|GNU compilers 4.8.1, GNU compilers 4.4.6, Intel Compilers|
Examples:
```bash
$ module load openmpi
```
In this example, we activate the latest openmpi with latest GNU compilers
To use openmpi with the intel compiler suite, use
```bash
$ module load intel
$ module load openmpi/1.6.5-icc
```
In this example, the openmpi 1.6.5 using intel compilers is activated
Compiling MPI Programs
----------------------
!!! Note "Note"
After setting up your MPI environment, compile your program using one of the mpi wrappers
```bash
$ mpicc -v
$ mpif77 -v
$ mpif90 -v
```
Example program:
```cpp
// helloworld_mpi.c
#include <stdio.h>
#include<mpi.h>
int main(int argc, char **argv) {
int len;
int rank, size;
char node[MPI_MAX_PROCESSOR_NAME];
// Initiate MPI
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
// Get hostame and print
MPI_Get_processor_name(node,&len);
printf("Hello world! from rank %d of %d on host %sn",rank,size,node);
// Finalize and exit
MPI_Finalize();
return 0;
}
```
Compile the above example with
```bash
$ mpicc helloworld_mpi.c -o helloworld_mpi.x
```
Running MPI Programs
--------------------
!!! Note "Note"
The MPI program executable must be compatible with the loaded MPI module.
Always compile and execute using the very same MPI module.
It is strongly discouraged to mix mpi implementations. Linking an application with one MPI implementation and running mpirun/mpiexec form other implementation may result in unexpected errors.
The MPI program executable must be available within the same path on all nodes. This is automatically fulfilled on the /home and /scratch filesystem. You need to preload the executable, if running on the local scratch /lscratch filesystem.
### Ways to run MPI programs
Optimal way to run an MPI program depends on its memory requirements, memory access pattern and communication pattern.
!!! Note "Note"
Consider these ways to run an MPI program:
1. One MPI process per node, 16 threads per process
2. Two MPI processes per node, 8 threads per process
3. 16 MPI processes per node, 1 thread per process.
**One MPI** process per node, using 16 threads, is most useful for memory demanding applications, that make good use of processor cache memory and are not memory bound. This is also a preferred way for communication intensive applications as one process per node enjoys full bandwidth access to the network interface.
**Two MPI** processes per node, using 8 threads each, bound to processor socket is most useful for memory bandwidth bound applications such as BLAS1 or FFT, with scalable memory demand. However, note that the two processes will share access to the network interface. The 8 threads and socket binding should ensure maximum memory access bandwidth and minimize communication, migration and numa effect overheads.
!!! Note "Note"
Important! Bind every OpenMP thread to a core!
In the previous two cases with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You want to avoid this by setting the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables.
**16 MPI** processes per node, using 1 thread each bound to processor core is most suitable for highly scalable applications with low communication demand.
### Running OpenMPI
The **bullxmpi-1.2.4.1** and [**OpenMPI 1.6.5**](http://www.open-mpi.org/) are both based on OpenMPI. Read more on [how to run OpenMPI](Running_OpenMPI/) based MPI.
### Running MPICH2
The **Intel MPI** and **mpich2 1.9** are MPICH2 based implementations. Read more on [how to run MPICH2](running-mpich2/) based MPI.
The Intel MPI may run on the Intel Xeon Phi accelerators as well. Read more on [how to run Intel MPI on accelerators](../intel-xeon-phi/).
MPI4Py (MPI for Python)
=======================
Introduction
------------
MPI for Python provides bindings of the Message Passing Interface (MPI) standard for the Python programming language, allowing any Python program to exploit multiple processors.
This package is constructed on top of the MPI-1/2 specifications and provides an object oriented interface which closely follows MPI-2 C++ bindings. It supports point-to-point (sends, receives) and collective (broadcasts, scatters, gathers) communications of any picklable Python object, as well as optimized communications of Python object exposing the single-segment buffer interface (NumPy arrays, builtin bytes/string/array objects).
On Anselm MPI4Py is available in standard Python modules.
Modules
-------
MPI4Py is build for OpenMPI. Before you start with MPI4Py you need to load Python and OpenMPI modules.
```bash
$ module load python
$ module load openmpi
```
Execution
---------
You need to import MPI to your python program. Include the following line to the python script:
```cpp
from mpi4py import MPI
```
The MPI4Py enabled python programs [execute as any other OpenMPI](Running_OpenMPI/) code.The simpliest way is to run
```bash
$ mpiexec python <script>.py
```
For example
```bash
$ mpiexec python hello_world.py
```
Examples
--------
### Hello world!
```cpp
from mpi4py import MPI
comm = MPI.COMM_WORLD
print "Hello! I'm rank %d from %d running in total..." % (comm.rank, comm.size)
comm.Barrier() # wait for everybody to synchronize
```
###Collective Communication with NumPy arrays
```cpp
from mpi4py import MPI
from __future__ import division
import numpy as np
comm = MPI.COMM_WORLD
print("-"*78)
print(" Running on %d cores" % comm.size)
print("-"*78)
comm.Barrier()
# Prepare a vector of N=5 elements to be broadcasted...
N = 5
if comm.rank == 0:
A = np.arange(N, dtype=np.float64) # rank 0 has proper data
else:
A = np.empty(N, dtype=np.float64) # all other just an empty array
# Broadcast A from rank 0 to everybody
comm.Bcast( [A, MPI.DOUBLE] )
# Everybody should now have the same...
print "[%02d] %s" % (comm.rank, A)
```
Execute the above code as:
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=16:ompthreads=1 -I
$ module load python openmpi
$ mpiexec -bycore -bind-to-core python hello_world.py
```
In this example, we run MPI4Py enabled code on 4 nodes, 16 cores per node (total of 64 processes), each python process is bound to a different core. More examples and documentation can be found on [MPI for Python webpage](https://pythonhosted.org/mpi4py/usrman/index.html).
Running MPICH2
==============
MPICH2 program execution
------------------------
The MPICH2 programs use mpd daemon or ssh connection to spawn processes, no PBS support is needed. However the PBS allocation is required to access compute nodes. On Anselm, the **Intel MPI** and **mpich2 1.9** are MPICH2 based MPI implementations.
### Basic usage
!!! Note "Note"
Use the mpirun to execute the MPICH2 code.
Example:
```bash
$ qsub -q qexp -l select=4:ncpus=16 -I
qsub: waiting for job 15210.srv11 to start
qsub: job 15210.srv11 ready
$ module load impi
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host cn17
Hello world! from rank 1 of 4 on host cn108
Hello world! from rank 2 of 4 on host cn109
Hello world! from rank 3 of 4 on host cn110
```
In this example, we allocate 4 nodes via the express queue interactively. We set up the intel MPI environment and interactively run the helloworld_mpi.x program. We request MPI to spawn 1 process per node.
Note that the executable helloworld_mpi.x must be available within the same path on all nodes. This is automatically fulfilled on the /home and /scratch filesystem.
You need to preload the executable, if running on the local scratch /lscratch filesystem
```bash
$ pwd
/lscratch/15210.srv11
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE cp /home/username/helloworld_mpi.x .
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host cn17
Hello world! from rank 1 of 4 on host cn108
Hello world! from rank 2 of 4 on host cn109
Hello world! from rank 3 of 4 on host cn110
```
In this example, we assume the executable helloworld_mpi.x is present on shared home directory. We run the cp command via mpirun, copying the executable from shared home to local scratch . Second mpirun will execute the binary in the /lscratch/15210.srv11 directory on nodes cn17, cn108, cn109 and cn110, one process per node.
!!! Note "Note"
MPI process mapping may be controlled by PBS parameters.
The mpiprocs and ompthreads parameters allow for selection of number of running MPI processes per node as well as number of OpenMP threads per MPI process.
### One MPI process per node
Follow this example to run one MPI process per node, 16 threads per process. Note that no options to mpirun are needed
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=1:ompthreads=16 -I
$ module load mvapich2
$ mpirun ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 1 MPI processes per node and 16 threads per socket, on 4 nodes.
### Two MPI processes per node
Follow this example to run two MPI processes per node, 8 threads per process. Note the options to mpirun for mvapich2. No options are needed for impi.
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=2:ompthreads=8 -I
$ module load mvapich2
$ mpirun -bind-to numa ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 2 MPI processes per node and 8 threads per socket, each process and its threads bound to a separate processor socket of the node, on 4 nodes
### 16 MPI processes per node
Follow this example to run 16 MPI processes per node, 1 thread per process. Note the options to mpirun for mvapich2. No options are needed for impi.
```bash
$ qsub -q qexp -l select=4:ncpus=16:mpiprocs=16:ompthreads=1 -I
$ module load mvapich2
$ mpirun -bind-to core ./helloworld_mpi.x
```
In this example, we demonstrate recommended way to run an MPI application, using 16 MPI processes per node, single threaded. Each process is bound to separate processor core, on 4 nodes.
### OpenMP thread affinity
!!! Note "Note"
Important! Bind every OpenMP thread to a core!
In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP:
```bash
$ export GOMP_CPU_AFFINITY="0-15"
```
or this one for Intel OpenMP:
```bash
$ export KMP_AFFINITY=granularity=fine,compact,1,0
```
As of OpenMP 4.0 (supported by GCC 4.9 and later and Intel 14.0 and later) the following variables may be used for Intel or GCC:
```bash
$ export OMP_PROC_BIND=true
$ export OMP_PLACES=cores
```
MPICH2 Process Mapping and Binding
----------------------------------
The mpirun allows for precise selection of how the MPI processes will be mapped to the computational nodes and how these processes will bind to particular processor sockets and cores.
### Machinefile
Process mapping may be controlled by specifying a machinefile input to the mpirun program. Altough all implementations of MPI provide means for process mapping and binding, following examples are valid for the impi and mvapich2 only.
Example machinefile
```bash
cn110.bullx
cn109.bullx
cn108.bullx
cn17.bullx
cn108.bullx
```
Use the machinefile to control process placement
```bash
$ mpirun -machinefile machinefile helloworld_mpi.x
Hello world! from rank 0 of 5 on host cn110
Hello world! from rank 1 of 5 on host cn109
Hello world! from rank 2 of 5 on host cn108
Hello world! from rank 3 of 5 on host cn17
Hello world! from rank 4 of 5 on host cn108
```
In this example, we see that ranks have been mapped on nodes according to the order in which nodes show in the machinefile
### Process Binding
The Intel MPI automatically binds each process and its threads to the corresponding portion of cores on the processor socket of the node, no options needed. The binding is primarily controlled by environment variables. Read more about mpi process binding on [Intel website](https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Environment_Variables_Process_Pinning.htm). The MPICH2 uses the -bind-to option Use -bind-to numa or -bind-to core to bind the process on single core or entire socket.
### Bindings verification
In all cases, binding and threading may be verified by executing
```bash
$ mpirun -bindto numa numactl --show
$ mpirun -bindto numa echo $OMP_NUM_THREADS
```
Intel MPI on Xeon Phi
---------------------
The[MPI section of Intel Xeon Phi chapter](../intel-xeon-phi/) provides details on how to run Intel MPI code on Xeon Phi architecture.
Numerical languages
===================
Interpreted languages for numerical computations and analysis
Introduction
------------
This section contains a collection of high-level interpreted languages, primarily intended for numerical computations.
Matlab
------
MATLAB®^ is a high-level language and interactive environment for numerical computation, visualization, and programming.
```bash
$ module load MATLAB/2015b-EDU
$ matlab
```
Read more at the [Matlab page](matlab/).
Octave
------
GNU Octave is a high-level interpreted language, primarily intended for numerical computations. The Octave language is quite similar to Matlab so that most programs are easily portable.
```bash
$ module load Octave
$ octave
```
Read more at the [Octave page](octave/).
R
---
The R is an interpreted language and environment for statistical computing and graphics.
```bash
$ module load R
$ R
```
Read more at the [R page](r/).
Matlab
======
Introduction
------------
Matlab is available in versions R2015a and R2015b. There are always two variants of the release:
- Non commercial or so called EDU variant, which can be used for common research and educational purposes.
- Commercial or so called COM variant, which can used also for commercial activities. The licenses for commercial variant are much more expensive, so usually the commercial variant has only subset of features compared to the EDU available.
To load the latest version of Matlab load the module
```bash
$ module load MATLAB
```
By default the EDU variant is marked as default. If you need other version or variant, load the particular version. To obtain the list of available versions use
```bash
$ module avail MATLAB
```
If you need to use the Matlab GUI to prepare your Matlab programs, you can use Matlab directly on the login nodes. But for all computations use Matlab on the compute nodes via PBS Pro scheduler.
If you require the Matlab GUI, please follow the general informations about [running graphical applications](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/).
Matlab GUI is quite slow using the X forwarding built in the PBS (qsub -X), so using X11 display redirection either via SSH or directly by xauth (please see the "GUI Applications on Compute Nodes over VNC" part [here](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-system/)) is recommended.
To run Matlab with GUI, use
```bash
$ matlab
```
To run Matlab in text mode, without the Matlab Desktop GUI environment, use
```bash
$ matlab -nodesktop -nosplash
```
plots, images, etc... will be still available.
Running parallel Matlab using Distributed Computing Toolbox / Engine
------------------------------------------------------------------------
!!! Note "Note"
Distributed toolbox is available only for the EDU variant
The MPIEXEC mode available in previous versions is no longer available in MATLAB 2015. Also, the programming interface has changed. Refer to [Release Notes](http://www.mathworks.com/help/distcomp/release-notes.html#buanp9e-1).
Delete previously used file mpiLibConf.m, we have observed crashes when using Intel MPI.
To use Distributed Computing, you first need to setup a parallel profile. We have provided the profile for you, you can either import it in MATLAB command line:
```bash
>> parallel.importProfile('/apps/all/MATLAB/2015a-EDU/SalomonPBSPro.settings')
ans =
SalomonPBSPro
```
Or in the GUI, go to tab HOME -&gt; Parallel -&gt; Manage Cluster Profiles..., click Import and navigate to:
/apps/all/MATLAB/2015a-EDU/SalomonPBSPro.settings
With the new mode, MATLAB itself launches the workers via PBS, so you can either use interactive mode or a batch mode on one node, but the actual parallel processing will be done in a separate job started by MATLAB itself. Alternatively, you can use "local" mode to run parallel code on just a single node.
!!! Note "Note"
The profile is confusingly named Salomon, but you can use it also on Anselm.
### Parallel Matlab interactive session
Following example shows how to start interactive session with support for Matlab GUI. For more information about GUI based applications on Anselm see [this page](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-system/).
```bash
$ xhost +
$ qsub -I -v DISPLAY=$(uname -n):$(echo $DISPLAY | cut -d ':' -f 2) -A NONE-0-0 -q qexp -l select=1 -l walltime=00:30:00
-l feature__matlab__MATLAB=1
```
This qsub command example shows how to run Matlab on a single node.
The second part of the command shows how to request all necessary licenses. In this case 1 Matlab-EDU license and 48 Distributed Computing Engines licenses.
Once the access to compute nodes is granted by PBS, user can load following modules and start Matlab:
```bash
r1i0n17$ module load MATLAB/2015b-EDU
r1i0n17$ matlab &
```
### Parallel Matlab batch job in Local mode
To run matlab in batch mode, write an matlab script, then write a bash jobscript and execute via the qsub command. By default, matlab will execute one matlab worker instance per allocated core.
```bash
#!/bin/bash
#PBS -A PROJECT ID
#PBS -q qprod
#PBS -l select=1:ncpus=16:mpiprocs=16:ompthreads=1
# change to shared scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/matlabcode.m .
# load modules
module load MATLAB/2015a-EDU
# execute the calculation
matlab -nodisplay -r matlabcode > output.out
# copy output file to home
cp output.out $PBS_O_WORKDIR/.
```
This script may be submitted directly to the PBS workload manager via the qsub command. The inputs and matlab script are in matlabcode.m file, outputs in output.out file. Note the missing .m extension in the matlab -r matlabcodefile call, **the .m must not be included**. Note that the **shared /scratch must be used**. Further, it is **important to include quit** statement at the end of the matlabcode.m script.
Submit the jobscript using qsub
```bash
$ qsub ./jobscript
```
### Parallel Matlab Local mode program example
The last part of the configuration is done directly in the user Matlab script before Distributed Computing Toolbox is started.
```bash
cluster = parcluster('local')
```
This script creates scheduler object "cluster" of type "local" that starts workers locally.
!!! Note "Note"
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
```bash
parpool(cluster,16);
... parallel code ...
parpool close
```
The complete example showing how to use Distributed Computing Toolbox in local mode is shown here.
```bash
cluster = parcluster('local');
cluster
parpool(cluster,24);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
T;
whos % T and W are both distributed arrays here.
parpool close
quit
```
You can copy and paste the example in a .m file and execute. Note that the parpool size should correspond to **total number of cores** available on allocated nodes.
### Parallel Matlab Batch job using PBS mode (workers spawned in a separate job)
This mode uses PBS scheduler to launch the parallel pool. It uses the SalomonPBSPro profile that needs to be imported to Cluster Manager, as mentioned before. This methodod uses MATLAB's PBS Scheduler interface - it spawns the workers in a separate job submitted by MATLAB using qsub.
This is an example of m-script using PBS mode:
```bash
cluster = parcluster('SalomonPBSPro');
set(cluster, 'SubmitArguments', '-A OPEN-0-0');
set(cluster, 'ResourceTemplate', '-q qprod -l select=10:ncpus=16');
set(cluster, 'NumWorkers', 160);
pool = parpool(cluster, 160);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
whos % T and W are both distributed arrays here.
% shut down parallel pool
delete(pool)
```
Note that we first construct a cluster object using the imported profile, then set some important options, namely: SubmitArguments, where you need to specify accounting id, and ResourceTemplate, where you need to specify number of nodes to run the job.
You can start this script using batch mode the same way as in Local mode example.
### Parallel Matlab Batch with direct launch (workers spawned within the existing job)
This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.)
Please note that this method is experimental.
For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab/#running-parallel-matlab-using-distributed-computing-toolbox---engine)
This is an example of m-script using direct mode:
```bash
parallel.importProfile('/apps/all/MATLAB/2015a-EDU/SalomonDirect.settings')
cluster = parcluster('SalomonDirect');
set(cluster, 'NumWorkers', 48);
pool = parpool(cluster, 48);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
whos % T and W are both distributed arrays here.
% shut down parallel pool
delete(pool)
```
### Non-interactive Session and Licenses
If you want to run batch jobs with Matlab, be sure to request appropriate license features with the PBS Pro scheduler, at least the " -l _feature_matlab_MATLAB=1" for EDU variant of Matlab. More information about how to check the license features states and how to request them with PBS Pro, please [look here](../isv_licenses/).
In case of non-interactive session please read the [following information](../isv_licenses/) on how to modify the qsub command to test for available licenses prior getting the resource allocation.
### Matlab Distributed Computing Engines start up time
Starting Matlab workers is an expensive process that requires certain amount of time. For your information please see the following table:
|compute nodes|number of workers|start-up time[s]|
|---|---|---|
|16|384|831|
|8|192|807|
|4|96|483|
|2|48|16|
MATLAB on UV2000
-----------------
UV2000 machine available in queue "qfat" can be used for MATLAB computations. This is a SMP NUMA machine with large amount of RAM, which can be beneficial for certain types of MATLAB jobs. CPU cores are allocated in chunks of 8 for this machine.
You can use MATLAB on UV2000 in two parallel modes:
### Threaded mode
Since this is a SMP machine, you can completely avoid using Parallel Toolbox and use only MATLAB's threading. MATLAB will automatically detect the number of cores you have allocated and will set maxNumCompThreads accordingly and certain operations, such as fft, , eig, svd, etc. will be automatically run in threads. The advantage of this mode is that you don't need to modify your existing sequential codes.
### Local cluster mode
You can also use Parallel Toolbox on UV2000. Use l[ocal cluster mode](matlab/#parallel-matlab-batch-job-in-local-mode), "SalomonPBSPro" profile will not work.
Matlab 2013-2014
================
Introduction
------------
!!! Note "Note"
This document relates to the old versions R2013 and R2014. For MATLAB 2015, please use [this documentation instead](matlab/).
Matlab is available in the latest stable version. There are always two variants of the release:
- Non commercial or so called EDU variant, which can be used for common research and educational purposes.
- Commercial or so called COM variant, which can used also for commercial activities. The licenses for commercial variant are much more expensive, so usually the commercial variant has only subset of features compared to the EDU available.
To load the latest version of Matlab load the module
```bash
$ module load matlab
```
By default the EDU variant is marked as default. If you need other version or variant, load the particular version. To obtain the list of available versions use
```bash
$ module avail matlab
```
If you need to use the Matlab GUI to prepare your Matlab programs, you can use Matlab directly on the login nodes. But for all computations use Matlab on the compute nodes via PBS Pro scheduler.
If you require the Matlab GUI, please follow the general informations about running graphical applications
Matlab GUI is quite slow using the X forwarding built in the PBS (qsub -X), so using X11 display redirection either via SSH or directly by xauth (please see the "GUI Applications on Compute Nodes over VNC" part) is recommended.
To run Matlab with GUI, use
```bash
$ matlab
```
To run Matlab in text mode, without the Matlab Desktop GUI environment, use
```bash```bash
$ matlab -nodesktop -nosplash
```
plots, images, etc... will be still available.
Running parallel Matlab using Distributed Computing Toolbox / Engine
--------------------------------------------------------------------
Recommended parallel mode for running parallel Matlab on Anselm is MPIEXEC mode. In this mode user allocates resources through PBS prior to starting Matlab. Once resources are granted the main Matlab instance is started on the first compute node assigned to job by PBS and workers are started on all remaining nodes. User can use both interactive and non-interactive PBS sessions. This mode guarantees that the data processing is not performed on login nodes, but all processing is on compute nodes.
![Parallel Matlab](../../../img/Matlab.png "Parallel Matlab")
For the performance reasons Matlab should use system MPI. On Anselm the supported MPI implementation for Matlab is Intel MPI. To switch to system MPI user has to override default Matlab setting by creating new configuration file in its home directory. The path and file name has to be exactly the same as in the following listing:
```bash
$ vim ~/matlab/mpiLibConf.m
function [lib, extras] = mpiLibConf
%MATLAB MPI Library overloading for Infiniband Networks
mpich = '/opt/intel/impi/4.1.1.036/lib64/';
disp('Using Intel MPI 4.1.1.036 over Infiniband')
lib = strcat(mpich, 'libmpich.so');
mpl = strcat(mpich, 'libmpl.so');
opa = strcat(mpich, 'libopa.so');
extras = {};
```
System MPI library allows Matlab to communicate through 40Gbps Infiniband QDR interconnect instead of slower 1Gb ethernet network.
!!! Note "Note"
Please note: The path to MPI library in "mpiLibConf.m" has to match with version of loaded Intel MPI module. In this example the version 4.1.1.036 of Iintel MPI is used by Matlab and therefore module impi/4.1.1.036 has to be loaded prior to starting Matlab.
### Parallel Matlab interactive session
Once this file is in place, user can request resources from PBS. Following example shows how to start interactive session with support for Matlab GUI. For more information about GUI based applications on Anselm see.
```bash
$ xhost +
$ qsub -I -v DISPLAY=$(uname -n):$(echo $DISPLAY | cut -d ':' -f 2) -A NONE-0-0 -q qexp -l select=4:ncpus=16:mpiprocs=16 -l walltime=00:30:00
-l feature__matlab__MATLAB=1
```
This qsub command example shows how to run Matlab with 32 workers in following configuration: 2 nodes (use all 16 cores per node) and 16 workers = mpirocs per node (-l select=2:ncpus=16:mpiprocs=16). If user requires to run smaller number of workers per node then the "mpiprocs" parameter has to be changed.
The second part of the command shows how to request all necessary licenses. In this case 1 Matlab-EDU license and 32 Distributed Computing Engines licenses.
Once the access to compute nodes is granted by PBS, user can load following modules and start Matlab:
```bash
cn79$ module load matlab/R2013a-EDU
cn79$ module load impi/4.1.1.036
cn79$ matlab &
```
### Parallel Matlab batch job
To run matlab in batch mode, write an matlab script, then write a bash jobscript and execute via the qsub command. By default, matlab will execute one matlab worker instance per allocated core.
```bash
#!/bin/bash
#PBS -A PROJECT ID
#PBS -q qprod
#PBS -l select=2:ncpus=16:mpiprocs=16:ompthreads=1
# change to shared scratch directory
SCR=/scratch/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/matlabcode.m .
# load modules
module load matlab/R2013a-EDU
module load impi/4.1.1.036
# execute the calculation
matlab -nodisplay -r matlabcode > output.out
# copy output file to home
cp output.out $PBS_O_WORKDIR/.
```
This script may be submitted directly to the PBS workload manager via the qsub command. The inputs and matlab script are in matlabcode.m file, outputs in output.out file. Note the missing .m extension in the matlab -r matlabcodefile call, **the .m must not be included**. Note that the **shared /scratch must be used**. Further, it is **important to include quit** statement at the end of the matlabcode.m script.
Submit the jobscript using qsub
```bash
$ qsub ./jobscript
```
### Parallel Matlab program example
The last part of the configuration is done directly in the user Matlab script before Distributed Computing Toolbox is started.
```bash
sched = findResource('scheduler', 'type', 'mpiexec');
set(sched, 'MpiexecFileName', '/apps/intel/impi/4.1.1/bin/mpirun');
set(sched, 'EnvironmentSetMethod', 'setenv');
```
This script creates scheduler object "sched" of type "mpiexec" that starts workers using mpirun tool. To use correct version of mpirun, the second line specifies the path to correct version of system Intel MPI library.
!!! Note "Note"
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling matlabpool(sched, ...) function.
The last step is to start matlabpool with "sched" object and correct number of workers. In this case qsub asked for total number of 32 cores, therefore the number of workers is also set to 32.
```bash
matlabpool(sched,32);
... parallel code ...
matlabpool close
```
The complete example showing how to use Distributed Computing Toolbox is show here.
```bash
sched = findResource('scheduler', 'type', 'mpiexec');
set(sched, 'MpiexecFileName', '/apps/intel/impi/4.1.1/bin/mpirun')
set(sched, 'EnvironmentSetMethod', 'setenv')
set(sched, 'SubmitArguments', '')
sched
matlabpool(sched,32);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
T;
whos % T and W are both distributed arrays here.
matlabpool close
quit
```
You can copy and paste the example in a .m file and execute. Note that the matlabpool size should correspond to **total number of cores** available on allocated nodes.
### Non-interactive Session and Licenses
If you want to run batch jobs with Matlab, be sure to request appropriate license features with the PBS Pro scheduler, at least the " -l _feature_matlab_MATLAB=1" for EDU variant of Matlab. More information about how to check the license features states and how to request them with PBS Pro, please [look here](../isv_licenses/).
In case of non-interactive session please read the [following information](../isv_licenses/) on how to modify the qsub command to test for available licenses prior getting the resource allocation.
### Matlab Distributed Computing Engines start up time
Starting Matlab workers is an expensive process that requires certain amount of time. For your information please see the following table:
|compute nodes|number of workers|start-up time[s]|
|---|---|---|
|16|256|1008|
|8|128|534|
|4|64|333|
|2|32|210|