Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found
Select Git revision
  • MPDATABenchmark
  • Urx
  • anselm2
  • hot_fix
  • john_branch
  • master
  • mkdocs_update
  • patch-1
  • pbs
  • salomon_upgrade
  • tabs
  • virtual_environment2
  • 20180621-before_revision
  • 20180621-revision
14 results

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Select Git revision
  • chat
  • kru0052-master-patch-91081
  • lifecycle
  • master
  • 20180621-before_revision
  • 20180621-revision
6 results
Show changes
Showing
with 3646 additions and 0 deletions
# Score-P
## Introduction
The [Score-P measurement infrastructure](http://www.vi-hps.org/projects/score-p/) is a highly scalable and easy-to-use tool suite for profiling, event tracing, and online analysis of HPC applications.
Score-P can be used as an instrumentation tool for [Scalasca](/software/debuggers/scalasca/).
## Installed Versions
There are currently two versions of Score-P version 1.2.6 [modules](modules-matrix/) installed on Anselm :
* scorep/1.2.3-gcc-openmpi, for usage with [GNU Compiler](/software/compilers/) and [OpenMPI](software/mpi/Running_OpenMPI/)
* scorep/1.2.3-icc-impi, for usage with [Intel Compiler](/software/compilers/)> and [Intel MPI](software/mpi/running-mpich2/)>.
## Instrumentation
There are three ways to instrument your parallel applications in order to enable performance data collection:
1. Automated instrumentation using compiler
1. Manual instrumentation using API calls
1. Manual instrumentation using directives
### Automated Instrumentation
is the easiest method. Score-P will automatically add instrumentation to every routine entry and exit using compiler hooks, and will intercept MPI calls and OpenMP regions. This method might, however, produce a large number of data. If you want to focus on profiler a specific regions of your code, consider using the manual instrumentation methods. To use automated instrumentation, simply prepend scorep to your compilation command. For example, replace:
```console
$ mpif90 -c foo.f90
$ mpif90 -c bar.f90
$ mpif90 -o myapp foo.o bar.o
```
with:
```console
$ scorep mpif90 -c foo.f90
$ scorep mpif90 -c bar.f90
$ scorep mpif90 -o myapp foo.o bar.o
```
Usually your program is compiled using a Makefile or similar script, so it advisable to add the scorep command to your definition of variables CC, CXX, FCC etc.
It is important that scorep is prepended also to the linking command, in order to link with Score-P instrumentation libraries.
### Manual Instrumentation Using API Calls
To use this kind of instrumentation, use scorep with switch --user. You will then mark regions to be instrumented by inserting API calls.
An example in C/C++ :
```cpp
#include <scorep/SCOREP_User.h>
void foo()
{
SCOREP_USER_REGION_DEFINE( my_region_handle )
// more declarations
SCOREP_USER_REGION_BEGIN( my_region_handle, "foo", SCOREP_USER_REGION_TYPE_COMMON )
// do something
SCOREP_USER_REGION_END( my_region_handle )
}
```
and Fortran :
```fortran
#include "scorep/SCOREP_User.inc"
subroutine foo
SCOREP_USER_REGION_DEFINE( my_region_handle )
! more declarations
SCOREP_USER_REGION_BEGIN( my_region_handle, "foo", SCOREP_USER_REGION_TYPE_COMMON )
! do something
SCOREP_USER_REGION_END( my_region_handle )
end subroutine foo
```
Please refer to the [documentation for description of the API](https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf).
### Manual Instrumentation Using Directives
This method uses POMP2 directives to mark regions to be instrumented. To use this method, use command scorep --pomp.
Example directives in C/C++ :
```cpp
void foo(...)
{
/* declarations */
#pragma pomp inst begin(foo)
...
if (<condition>)
{
#pragma pomp inst altend(foo)
return;
}
...
#pragma pomp inst end(foo)
}
```
and in Fortran :
```fortran
subroutine foo(...)
!declarations
!POMP$ INST BEGIN(foo)
...
if (<condition>) then
!POMP$ INST ALTEND(foo)
return
end if
...
!POMP$ INST END(foo)
end subroutine foo
```
The directives are ignored if the program is compiled without Score-P. Again, refer to the [documentation](https://silc.zih.tu-dresden.de/scorep-current/pdf/scorep.pdf) for a more elaborate description.
# Total View
TotalView is a GUI-based source code multi-process, multi-thread debugger.
## License and Limitations for Cluster Users
On the cluster users can debug OpenMP or MPI code that runs up to 64 parallel processes. These limitation means that:
```console
1 user can debug up 64 processes, or
32 users can debug 2 processes, etc.
```
Debugging of GPU accelerated codes is also supported.
You can check the status of the licenses [here (Salomon)](https://extranet.it4i.cz/rsweb/anselm/license/Totalview) or type (Anselm):
```console
$ cat /apps/user/licenses/totalview_features_state.txt
# totalview
# -------------------------------------------------
# FEATURE TOTAL USED AVAIL
# -------------------------------------------------
TotalView_Team 64 0 64
Replay 64 0 64
CUDA 64 0 64
```
## Compiling Code to Run With TotalView
### Modules
Load all necessary modules to compile the code. For example:
```console
ml intel
```
Load the TotalView module:
```console
ml TotalView
ml totalview
```
Compile the code:
```console
mpicc -g -O0 -o test_debug test.c
mpif90 -g -O0 -o test_debug test.f
```
### Compiler Flags
Before debugging, you need to compile your code with theses flags:
!!! note
**-g** : Generates extra debugging information usable by GDB. -g3 includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
**-O0** : Suppress all optimizations.
## Starting a Job With TotalView
Be sure to log in with an X window forwarding enabled. This could mean using the -X in the ssh:
```console
ssh -X username@salomon.it4i.cz
```
Other options is to access login node using VNC. Please see the detailed information on how to use graphic user interface on Anselm.
From the login node an interactive session with X windows forwarding (-X option) can be started by following command (for Anselm use 16 threads):
```console
$ qsub -I -X -A NONE-0-0 -q qexp -lselect=1:ncpus=24:mpiprocs=24,walltime=01:00:00
```
Then launch the debugger with the totalview command followed by the name of the executable to debug.
### Debugging a Serial Code
To debug a serial code use:
```console
totalview test_debug
```
### Debugging a Parallel Code - Option 1
To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
!!! hint
To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file.
```console
proc mpi_auto_run_starter {loaded_id} {
set starter_programs {mpirun mpiexec orterun}
set executable_name [TV::symbol get $loaded_id full_pathname]
set file_component [file tail $executable_name]
if {[lsearch -exact $starter_programs $file_component] != -1} {
puts "*************************************"
puts "Automatically starting $file_component"
puts "*************************************"
dgo
}
}
# Append this function to TotalView's image load callbacks so that
# TotalView run this program automatically.
dlappend TV::image_load_callbacks mpi_auto_run_starter
```
The source code of this function can be also found in
```console
$ /apps/all/OpenMPI/1.10.1-GNU-4.9.3-2.25/etc/openmpi-totalview.tcl #Salomon
$ /apps/mpi/openmpi/intel/1.6.5/etc/openmpi-totalview.tcl #Anselm
```
You can also add only following line to you ~/.tvdrc file instead of
the entire function:
```console
$ source /apps/all/OpenMPI/1.10.1-GNU-4.9.3-2.25/etc/openmpi-totalview.tcl #Salomon
$ source /apps/mpi/openmpi/intel/1.6.5/etc/openmpi-totalview.tcl #Anselm
```
You need to do this step only once. See also [OpenMPI FAQ entry](https://www.open-mpi.org/faq/?category=running#run-with-tv)
Now you can run the parallel debugger using:
```console
$ mpirun -tv -n 5 ./test_debug
```
When following dialog appears click on "Yes"
![](../../img/totalview1.png)
At this point the main TotalView GUI window will appear and you can insert the breakpoints and start debugging:
![](../../img/totalview2.png)
### Debugging a Parallel Code - Option 2
Other option to start new parallel debugging session from a command line is to let TotalView to execute mpirun by itself. In this case user has to specify a MPI implementation used to compile the source code.
The following example shows how to start debugging session with Intel MPI:
```console
$ ml intel
$ ml TotalView/8.15.4-6-linux-x86-64
$ totalview -mpi "Intel MPI-Hydra" -np 8 ./hello_debug_impi
```
After running previous command you will see the same window as shown in the screenshot above.
More information regarding the command line parameters of the TotalView can be found TotalView Reference Guide, Chapter 7: TotalView Command Syntax.
## Documentation
[1] The [TotalView documentation](http://www.roguewave.com/support/product-documentation/totalview-family.aspx#totalview) web page is a good resource for learning more about some of the advanced TotalView features.
# Valgrind
Valgrind is a tool for memory debugging and profiling.
## About Valgrind
Valgrind is an open-source tool, used mainly for debuggig memory-related problems, such as memory leaks, use of uninitalized memory etc. in C/C++ applications. The toolchain was however extended over time with more functionality, such as debugging of threaded applications, cache profiling, not limited only to C/C++.
Valgind is an extremely useful tool for debugging memory errors such as [off-by-one](http://en.wikipedia.org/wiki/Off-by-one_error). Valgrind uses a virtual machine and dynamic recompilation of binary code, because of that, you can expect that programs being debugged by Valgrind run 5-100 times slower.
The main tools available in Valgrind are :
* **Memcheck**, the original, must used and default tool. Verifies memory access in you program and can detect use of unitialized memory, out of bounds memory access, memory leaks, double free, etc.
* **Massif**, a heap profiler.
* **Hellgrind** and **DRD** can detect race conditions in multi-threaded applications.
* **Cachegrind**, a cache profiler.
* **Callgrind**, a callgraph analyzer.
* For a full list and detailed documentation, refer to the [official Valgrind documentation](http://valgrind.org/docs/).
## Installed Versions
There are two versions of Valgrind available on Anselm.
* Version 3.6.0, installed by operating system vendor in /usr/bin/valgrind. This version is available by default, without the need to load any module. This version however does not provide additional MPI support.
* Version 3.9.0 with support for Intel MPI, available in [module](modules-matrix/) valgrind/3.9.0-impi. After loading the module, this version replaces the default valgrind.
There are two versions of Valgrind available on the Salomon.
* Version 3.8.1, installed by operating system vendor in /usr/bin/valgrind. This version is available by default, without the need to load any module. This version however does not provide additional MPI support. Also, it does not support AVX2 instructions, debugging of an AVX2-enabled executable with this version will fail
* Version 3.11.0 built by ICC with support for Intel MPI, available in module Valgrind/3.11.0-intel-2015b. After loading the module, this version replaces the default valgrind.
* Version 3.11.0 built by GCC with support for OpenMPI, module Valgrind/3.11.0-foss-2015b
## Usage
Compile the application which you want to debug as usual. It is advisable to add compilation flags -g (to add debugging information to the binary so that you will see original source code lines in the output) and -O0 (to disable compiler optimizations).
For example, lets look at this C code, which has two problems :
```cpp
#include <stdlib.h>
void f(void)
{
int* x = malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
} // problem 2: memory leak -- x not freed
int main(void)
{
f();
return 0;
}
```
Now, compile it with Intel compiler :
```console
$ module add intel
$ icc -g valgrind-example.c -o valgrind-example
```
Now, lets run it with Valgrind. The syntax is :
`valgrind [valgrind options] <your program binary> [your program options]`
If no Valgrind options are specified, Valgrind defaults to running Memcheck tool. Please refer to the Valgrind documentation for a full description of command line options.
```console
$ valgrind ./valgrind-example
==12652== Memcheck, a memory error detector
==12652== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12652== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==12652== Command: ./valgrind-example
==12652==
==12652== Invalid write of size 4
==12652== at 0x40053E: f (valgrind-example.c:6)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652== Address 0x5861068 is 0 bytes after a block of size 40 alloc'd
==12652== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==12652== by 0x400528: f (valgrind-example.c:5)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652==
==12652==
==12652== HEAP SUMMARY:
==12652== in use at exit: 40 bytes in 1 blocks
==12652== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==12652==
==12652== LEAK SUMMARY:
==12652== definitely lost: 40 bytes in 1 blocks
==12652== indirectly lost: 0 bytes in 0 blocks
==12652== possibly lost: 0 bytes in 0 blocks
==12652== still reachable: 0 bytes in 0 blocks
==12652== suppressed: 0 bytes in 0 blocks
==12652== Rerun with --leak-check=full to see details of leaked memory
==12652==
==12652== For counts of detected and suppressed errors, rerun with: -v
==12652== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 6 from 6)
```
In the output we can see that Valgrind has detected both errors - the off-by-one memory access at line 5 and a memory leak of 40 bytes. If we want a detailed analysis of the memory leak, we need to run Valgrind with --leak-check=full option :
```console
$ valgrind --leak-check=full ./valgrind-example
==23856== Memcheck, a memory error detector
==23856== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==23856== Command: ./valgrind-example
==23856==
==23856== Invalid write of size 4
==23856== at 0x40067E: f (valgrind-example.c:6)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856== Address 0x66e7068 is 0 bytes after a block of size 40 alloc'd
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856==
==23856== HEAP SUMMARY:
==23856== in use at exit: 40 bytes in 1 blocks
==23856== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==23856==
==23856== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856== LEAK SUMMARY:
==23856== definitely lost: 40 bytes in 1 blocks
==23856== indirectly lost: 0 bytes in 0 blocks
==23856== possibly lost: 0 bytes in 0 blocks
==23856== still reachable: 0 bytes in 0 blocks
==23856== suppressed: 0 bytes in 0 blocks
==23856==
==23856== For counts of detected and suppressed errors, rerun with: -v
==23856== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 6 from 6)
```
Now we can see that the memory leak is due to the malloc() at line 6.
## Usage With MPI
Although Valgrind is not primarily a parallel debugger, it can be used to debug parallel applications as well. When launching your parallel applications, prepend the valgrind command. For example :
```console
$ mpirun -np 4 valgrind myapplication
```
The default version without MPI support will however report a large number of false errors in the MPI library, such as :
```console
==30166== Conditional jump or move depends on uninitialised value(s)
==30166== at 0x4C287E8: strlen (mc_replace_strmem.c:282)
==30166== by 0x55443BD: I_MPI_Processor_model_number (init_interface.c:427)
==30166== by 0x55439E0: I_MPI_Processor_arch_code (init_interface.c:171)
==30166== by 0x558D5AE: MPID_nem_impi_init_shm_configuration (mpid_nem_impi_extensions.c:1091)
==30166== by 0x5598F4C: MPID_nem_init_ckpt (mpid_nem_init.c:566)
==30166== by 0x5598B65: MPID_nem_init (mpid_nem_init.c:489)
==30166== by 0x539BD75: MPIDI_CH3_Init (ch3_init.c:64)
==30166== by 0x5578743: MPID_Init (mpid_init.c:193)
==30166== by 0x554650A: MPIR_Init_thread (initthread.c:539)
==30166== by 0x553369F: PMPI_Init (init.c:195)
==30166== by 0x4008BD: main (valgrind-example-mpi.c:18)
```
so it is better to use the MPI-enabled valgrind from module. The MPI version requires library:
* Anselm: /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so
* Salomon: $EBROOTVALGRIND/lib/valgrind/libmpiwrap-amd64-linux.so
which must be included in the LD_PRELOAD environment variable.
Lets look at this MPI example :
```cpp
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int *data = malloc(sizeof(int)*99);
MPI_Init(&argc, &argv);
MPI_Bcast(data, 100, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
```
There are two errors - use of uninitialized memory and invalid length of the buffer. Lets debug it with valgrind :
```console
$ module add intel impi
$ mpicc -g valgrind-example-mpi.c -o valgrind-example-mpi
$ module add valgrind/3.9.0-impi
$ mpirun -np 2 -env LD_PRELOAD /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so valgrind ./valgrind-example-mpi
```
Prints this output : (note that there is output printed for every launched MPI process)
```console
==31318== Memcheck, a memory error detector
==31318== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31318== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31318== Command: ./valgrind-example-mpi
==31318==
==31319== Memcheck, a memory error detector
==31319== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31319== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31319== Command: ./valgrind-example-mpi
==31319==
valgrind MPI wrappers 31319: Active for pid 31319
valgrind MPI wrappers 31319: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 31318: Active for pid 31318
valgrind MPI wrappers 31318: Try MPIWRAP_DEBUG=help for possible options
==31319== Unaddressable byte(s) found during client check request
==31319== at 0x4E35974: check_mem_is_addressable_untyped (libmpiwrap.c:960)
==31319== by 0x4E5D0FE: PMPI_Bcast (libmpiwrap.c:908)
==31319== by 0x400911: main (valgrind-example-mpi.c:20)
==31319== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31319== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31319== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31319==
==31318== Uninitialised byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x6929040 is 0 bytes inside a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318== Unaddressable byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318==
==31318== HEAP SUMMARY:
==31318== in use at exit: 3,172 bytes in 67 blocks
==31318== total heap usage: 191 allocs, 124 frees, 81,203 bytes allocated
==31318==
==31319==
==31319== HEAP SUMMARY:
==31319== in use at exit: 3,172 bytes in 67 blocks
==31319== total heap usage: 175 allocs, 108 frees, 48,435 bytes allocated
==31319==
==31318== LEAK SUMMARY:
==31318== definitely lost: 408 bytes in 3 blocks
==31318== indirectly lost: 256 bytes in 1 blocks
==31318== possibly lost: 0 bytes in 0 blocks
==31318== still reachable: 2,508 bytes in 63 blocks
==31318== suppressed: 0 bytes in 0 blocks
==31318== Rerun with --leak-check=full to see details of leaked memory
==31318==
==31318== For counts of detected and suppressed errors, rerun with: -v
==31318== Use --track-origins=yes to see where uninitialised values come from
==31318== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)
==31319== LEAK SUMMARY:
==31319== definitely lost: 408 bytes in 3 blocks
==31319== indirectly lost: 256 bytes in 1 blocks
==31319== possibly lost: 0 bytes in 0 blocks
==31319== still reachable: 2,508 bytes in 63 blocks
==31319== suppressed: 0 bytes in 0 blocks
==31319== Rerun with --leak-check=full to see details of leaked memory
==31319==
==31319== For counts of detected and suppressed errors, rerun with: -v
==31319== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
```
We can see that Valgrind has reported use of uninitialized memory on the master process (which reads the array to be broadcast) and use of unaddressable memory on both processes.
# Vampir
Vampir is a commercial trace analysis and visualization tool. It can work with traces in OTF and OTF2 formats. It does not have the functionality to collect traces, you need to use a trace collection tool (such as [Score-P](/software/debuggers/score-p/)) first to collect the traces.
![](../../img/Snmekobrazovky20160708v12.33.35.png)
## Installed Versions
```console
$ ml av Vampir
```
```console
$ ml Vampir
$ vampir &
```
## User Manual
You can find the detailed user manual in PDF format in $EBROOTVAMPIR/doc/vampir-manual.pdf
## References
1. [https://www.vampir.eu](https://www.vampir.eu)
# Intel Advisor
is tool aiming to assist you in vectorization and threading of your code. You can use it to profile your application and identify loops, that could benefit from vectorization and/or threading parallelism.
## Installed Versions
The following versions are currently available on Salomon as modules:
2016 Update 2 - Advisor/2016_update2
## Usage
Your program should be compiled with -g switch to include symbol names. You should compile with -O2 or higher to see code that is already vectorized by the compiler.
Profiling is possible either directly from the GUI, or from command line.
To profile from GUI, launch Advisor:
```console
$ advixe-gui
```
Then select menu File -> New -> Project. Choose a directory to save project data to. After clicking OK, Project properties window will appear, where you can configure path to your binary, launch arguments, working directory etc. After clicking OK, the project is ready.
In the left pane, you can switch between Vectorization and Threading workflows. Each has several possible steps which you can execute by clicking Collect button. Alternatively, you can click on Command Line, to see the command line required to run the analysis directly from command line.
## References
1. [Intel® Advisor 2015 Tutorial: Find Where to Add Parallelism - C++ Sample](https://software.intel.com/en-us/intel-advisor-tutorial-vectorization-windows-cplusplus)
1. [Product page](https://software.intel.com/en-us/intel-advisor-xe)
1. [Documentation](https://software.intel.com/en-us/intel-advisor-2016-user-guide-linux)
# Intel Compilers
The Intel compilers in multiple versions are available, via module intel. The compilers include the icc C and C++ compiler and the ifort Fortran 77/90/95 compiler.
```console
$ ml intel
$ icc -v
$ ifort -v
```
The intel compilers provide for vectorization of the code, via the AVX2 instructions and support threading parallelization via OpenMP
For maximum performance on the Salomon cluster compute nodes, compile your programs using the AVX2 instructions, with reporting where the vectorization was used. We recommend following compilation options for high performance
```console
$ icc -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec myprog.f mysubroutines.f -o myprog.x
```
In this example, we compile the program enabling interprocedural optimizations between source files (-ipo), aggressive loop optimizations (-O3) and vectorization (-xCORE-AVX2)
The compiler recognizes the omp, simd, vector and ivdep pragmas for OpenMP parallelization and AVX2 vectorization. Enable the OpenMP parallelization by the **-openmp** compiler switch.
```console
$ icc -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec -openmp myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec -openmp myprog.f mysubroutines.f -o myprog.x
```
Read more at <https://software.intel.com/en-us/intel-cplusplus-compiler-16.0-user-and-reference-guide>
## Sandy Bridge/Ivy Bridge/Haswell Binary Compatibility
Anselm nodes are currently equipped with Sandy Bridge CPUs, while Salomon compute nodes are equipped with Haswell based architecture. The UV1 SMP compute server has Ivy Bridge CPUs, which are equivalent to Sandy Bridge (only smaller manufacturing technology). The new processors are backward compatible with the Sandy Bridge nodes, so all programs that ran on the Sandy Bridge processors, should also run on the new Haswell nodes. To get optimal performance out of the Haswell processors a program should make use of the special AVX2 instructions for this processor. One can do this by recompiling codes with the compiler flags designated to invoke these instructions. For the Intel compiler suite, there are two ways of doing this:
* Using compiler flag (both for Fortran and C): **-xCORE-AVX2**. This will create a binary with AVX2 instructions, specifically for the Haswell processors. Note that the executable will not run on Sandy Bridge/Ivy Bridge nodes.
* Using compiler flags (both for Fortran and C): **-xAVX -axCORE-AVX2**. This will generate multiple, feature specific auto-dispatch code paths for Intel® processors, if there is a performance benefit. So this binary will run both on Sandy Bridge/Ivy Bridge and Haswell processors. During runtime it will be decided which path to follow, dependent on which processor you are running on. In general this will result in larger binaries.
# Intel Debugger
IDB is no longer available since Intel Parallel Studio 2015
## Debugging Serial Applications
The intel debugger version is available, via module intel/13.5.192. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use [X display](/general/accessing-the-clusters/graphical-user-interface/x-window-system/) for running the GUI.
```console
$ ml intel/13.5.192
$ ml Java
$ idb
```
The debugger may run in text mode. To debug in text mode, use
```console
$ idbc
```
To debug on the compute nodes, module intel must be loaded. The GUI on compute nodes may be accessed using the same way as in [the GUI section](/general/accessing-the-clusters/graphical-user-interface/x-window-system/)
Example:
```console
$ qsub -q qexp -l select=1:ncpus=24 -X -I # use 16 threads for Anselm
qsub: waiting for job 19654.srv11 to start
qsub: job 19654.srv11 ready
$ ml intel
$ ml Java
$ icc -O0 -g myprog.c -o myprog.x
$ idb ./myprog.x
```
In this example, we allocate 1 full compute node, compile program myprog.c with debugging options -O0 -g and run the idb debugger interactively on the myprog.x executable. The GUI access is via X11 port forwarding provided by the PBS workload manager.
## Debugging Parallel Applications
Intel debugger is capable of debugging multithreaded and MPI parallel programs as well.
### Small Number of MPI Ranks
For debugging small number of MPI ranks, you may execute and debug each rank in separate xterm terminal (do not forget the [X display](/general/accessing-the-clusters/graphical-user-interface/x-window-system/)). Using Intel MPI, this may be done in following way:
```console
$ qsub -q qexp -l select=2:ncpus=24 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ ml intel
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE --enable-x xterm -e idbc ./mympiprog.x
```
In this example, we allocate 2 full compute node, run xterm on each node and start idb debugger in command line mode, debugging two ranks of mympiprog.x application. The xterm will pop up for each rank, with idb prompt ready. The example is not limited to use of Intel MPI
### Large Number of MPI Ranks
Run the idb debugger from within the MPI debug option. This will cause the debugger to bind to all ranks and provide aggregated outputs across the ranks, pausing execution automatically just after startup. You may then set break points and step the execution manually. Using Intel MPI:
```console
$ qsub -q qexp -l select=2:ncpus=24 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ ml intel
$ mpirun -n 48 -idb ./mympiprog.x
```
### Debugging Multithreaded Application
Run the idb debugger in GUI mode. The menu Parallel contains number of tools for debugging multiple threads. One of the most useful tools is the **Serialize Execution** tool, which serializes execution of concurrent threads for easy orientation and identification of concurrency related bugs.
## Further Information
Exhaustive manual on IDB features and usage is published at [Intel website](https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/).
# Intel Inspector
Intel Inspector is a dynamic memory and threading error checking tool for C/C++/Fortran applications. It can detect issues such as memory leaks, invalid memory references, uninitalized variables, race conditions, deadlocks etc.
## Installed Versions
The following versions are currently available on Salomon as modules:
2016 Update 1 - Inspector/2016_update1
## Usage
Your program should be compiled with -g switch to include symbol names. Optimizations can be turned on.
Debugging is possible either directly from the GUI, or from command line.
### GUI Mode
To debug from GUI, launch Inspector:
```console
$ inspxe-gui &
```
Then select menu File -> New -> Project. Choose a directory to save project data to. After clicking OK, Project properties window will appear, where you can configure path to your binary, launch arguments, working directory etc. After clicking OK, the project is ready.
In the main pane, you can start a predefined analysis type or define your own. Click Start to start the analysis. Alternatively, you can click on Command Line, to see the command line required to run the analysis directly from command line.
### Batch Mode
Analysis can be also run from command line in batch mode. Batch mode analysis is run with command inspxe-cl. To obtain the required parameters, either consult the documentation or you can configure the analysis in the GUI and then click "Command Line" button in the lower right corner to the respective command line.
Results obtained from batch mode can be then viewed in the GUI by selecting File -> Open -> Result...
## References
1. [Product page](https://software.intel.com/en-us/intel-inspector-xe)
1. [Documentation and Release Notes](https://software.intel.com/en-us/intel-inspector-xe-support/documentation)
1. [Tutorials](https://software.intel.com/en-us/articles/inspectorxe-tutorials)
# Intel IPP
## Intel Integrated Performance Primitives
Intel Integrated Performance Primitives, version 9.0.1, compiled for AVX2 vector instructions is available, via module ipp. The IPP is a very rich library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax, as well as cryptographic functions, linear algebra functions and many more.
Check out IPP before implementing own math functions for data processing, it is likely already there.
```console
$ ml ipp
```
The module sets up environment variables, required for linking and running ipp enabled applications.
## IPP Example
```cpp
#include "ipp.h"
#include <stdio.h>
int main(int argc, char* argv[])
{
const IppLibraryVersion *lib;
Ipp64u fm;
IppStatus status;
status= ippInit(); //IPP initialization with the best optimization layer
if( status != ippStsNoErr ) {
printf("IppInit() Error:n");
printf("%sn", ippGetStatusString(status) );
return -1;
}
//Get version info
lib = ippiGetLibVersion();
printf("%s %sn", lib->Name, lib->Version);
//Get CPU features enabled with selected library level
fm=ippGetEnabledCpuFeatures();
printf("SSE :%cn",(fm>1)&1?'Y':'N');
printf("SSE2 :%cn",(fm>2)&1?'Y':'N');
printf("SSE3 :%cn",(fm>3)&1?'Y':'N');
printf("SSSE3 :%cn",(fm>4)&1?'Y':'N');
printf("SSE41 :%cn",(fm>6)&1?'Y':'N');
printf("SSE42 :%cn",(fm>7)&1?'Y':'N');
printf("AVX :%cn",(fm>8)&1 ?'Y':'N');
printf("AVX2 :%cn", (fm>15)&1 ?'Y':'N' );
printf("----------n");
printf("OS Enabled AVX :%cn", (fm>9)&1 ?'Y':'N');
printf("AES :%cn", (fm>10)&1?'Y':'N');
printf("CLMUL :%cn", (fm>11)&1?'Y':'N');
printf("RDRAND :%cn", (fm>13)&1?'Y':'N');
printf("F16C :%cn", (fm>14)&1?'Y':'N');
return 0;
}
```
Compile above example, using any compiler and the ipp module.
```console
$ ml intel
$ ml ipp
$ icc testipp.c -o testipp.x -lippi -lipps -lippcore
```
You will need the ipp module loaded to run the ipp enabled executable. This may be avoided, by compiling library search paths into the executable
```console
$ ml intel
$ ml ipp
$ icc testipp.c -o testipp.x -Wl,-rpath=$LIBRARY_PATH -lippi -lipps -lippcore
```
## Code Samples and Documentation
Intel provides number of [Code Samples for IPP](https://software.intel.com/en-us/articles/code-samples-for-intel-integrated-performance-primitives-library), illustrating use of IPP.
Read full documentation on IPP [on Intel website,](http://software.intel.com/sites/products/search/search.php?q=&x=15&y=6&product=ipp&version=7.1&docos=lin) in particular the [IPP Reference manual.](http://software.intel.com/sites/products/documentation/doclib/ipp_sa/71/ipp_manual/index.htm)
# Intel MKL
## Intel Math Kernel Library
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL provides these basic math kernels:
* BLAS (level 1, 2, and 3) and LAPACK linear algebra routines, offering vector, vector-matrix, and matrix-matrix operations.
* The PARDISO direct sparse solver, an iterative sparse solver, and supporting sparse BLAS (level 1, 2, and 3) routines for solving sparse systems of equations.
* ScaLAPACK distributed processing linear algebra routines for Linux and Windows operating systems, as well as the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS).
* Fast Fourier transform (FFT) functions in one, two, or three dimensions with support for mixed radices (not limited to sizes that are powers of 2), as well as distributed versions of these functions.
* Vector Math Library (VML) routines for optimized mathematical operations on vectors.
* Vector Statistical Library (VSL) routines, which offer high-performance vectorized random number generators (RNG) for several probability distributions, convolution and correlation routines, and summary statistics functions.
* Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search.
* Extended Eigensolver, a shared memory version of an eigensolver based on the Feast Eigenvalue Solver.
For details see the [Intel MKL Reference Manual](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mklman/index.htm).
Intel MKL is available on the cluster
```console
$ ml av imkl
$ ml imkl
```
The module sets up environment variables, required for linking and running mkl enabled applications. The most important variables are the $MKLROOT, $CPATH, $LD_LIBRARY_PATH and $MKL_EXAMPLES
Intel MKL library may be linked using any compiler. With intel compiler use -mkl option to link default threaded MKL.
### Interfaces
Intel MKL library provides number of interfaces. The fundamental once are the LP64 and ILP64. The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 231^-1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type.
| Interface | Integer type |
| --------- | -------------------------------------------- |
| LP64 | 32-bit, int, integer(kind=4), MPI_INT |
| ILP64 | 64-bit, long int, integer(kind=8), MPI_INT64 |
### Linking
Linking Intel MKL libraries may be complex. Intel [mkl link line advisor](http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor) helps. See also [examples](#examples) below.
You will need the mkl module loaded to run the mkl enabled executable. This may be avoided, by compiling library search paths into the executable. Include rpath on the compile line:
```console
$ icc .... -Wl,-rpath=$LIBRARY_PATH ...
```
### Threading
Advantage in using Intel MKL library is that it brings threaded parallelization to applications that are otherwise not parallel.
For this to work, the application must link the threaded MKL library (default). Number and behaviour of MKL threads may be controlled via the OpenMP environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY. MKL_NUM_THREADS takes precedence over OMP_NUM_THREADS
```console
$ export OMP_NUM_THREADS=24 # 16 for Anselm
$ export KMP_AFFINITY=granularity=fine,compact,1,0
```
The application will run with 24 threads with affinity optimized for fine grain parallelization.
## Examples
Number of examples, demonstrating use of the Intel MKL library and its linking is available on clusters, in the $MKL_EXAMPLES directory. In the examples below, we demonstrate linking Intel MKL to Intel and GNU compiled program for multi-threaded matrix multiplication.
### Working With Examples
```console
$ ml intel
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ make sointel64 function=cblas_dgemm
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL example suite installed on clusters.
### Example: MKL and Intel Compiler
```console
$ ml intel
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$
$ icc -w source/cblas_dgemmx.c source/common_func.c -mkl -o cblas_dgemmx.x
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL with icc -mkl option. Using the -mkl option is equivalent to:
```console
$ icc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x -I$MKL_INC_DIR -L$MKL_LIB_DIR -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
```
In this example, we compile and link the cblas_dgemm example, using LP64 interface to threaded MKL and Intel OMP threads implementation.
### Example: Intel MKL and GNU Compiler
```console
$ ml GCC
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ gcc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lm
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, using LP64 interface to threaded MKL and gnu OMP threads implementation.
## MKL and MIC Accelerators
The Intel MKL is capable to automatically offload the computations o the MIC accelerator. See section [Intel Xeon Phi](/software/intel/intel-xeon-phi-salomon/) for details.
## LAPACKE C Interface
MKL includes LAPACKE C Interface to LAPACK. For some reason, although Intel is the author of LAPACKE, the LAPACKE header files are not present in MKL. For this reason, we have prepared LAPACKE module, which includes Intel's LAPACKE headers from official LAPACK, which you can use to compile code using LAPACKE interface against MKL.
## Further Reading
Read more on [Intel website](http://software.intel.com/en-us/intel-mkl), in particular the [MKL users guide](https://software.intel.com/en-us/intel-mkl/documentation/linux).
# Intel Parallel Studio
The Salomon cluster provides following elements of the Intel Parallel Studio XE
Intel Parallel Studio XE
* Intel Compilers
* Intel Debugger
* Intel MKL Library
* Intel Integrated Performance Primitives Library
* Intel Threading Building Blocks Library
* Intel Trace Analyzer and Collector
* Intel Advisor
* Intel Inspector
## Intel Compilers
The Intel compilers are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```console
$ ml intel
$ icc -v
$ ifort -v
```
Read more at the [Intel Compilers](/software/intel/intel-suite/intel-compilers/) page.
## Intel Debugger
IDB is no longer available since Parallel Studio 2015.
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment.
```console
$ ml intel
$ idb
```
Read more at the [Intel Debugger](/software/intel/intel-suite/intel-debugger/) page.
## Intel Math Kernel Library
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL unites and provides these basic components: BLAS, LAPACK, ScaLapack, PARDISO, FFT, VML, VSL, Data fitting, Feast Eigensolver and many more.
```console
$ ml imkl
```
Read more at the [Intel MKL](/software/intel/intel-suite/intel-mkl/) page.
## Intel Integrated Performance Primitives
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX is available, via module ipp. The IPP is a library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax and many more.
```console
$ ml ipp
```
Read more at the [Intel IPP](/software/intel/intel-suite/intel-integrated-performance-primitives/) page.
## Intel Threading Building Blocks
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner.
```console
$ ml tbb
```
Read more at the [Intel TBB](/software/intel/intel-suite/intel-tbb/) page.
# Intel TBB
## Intel Threading Building Blocks
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. The tasks are executed by a runtime scheduler and may be offloaded to [MIC accelerator](/software/intel//intel-xeon-phi-salomon/).
Intel is available on the cluster.
```console
$ ml av tbb
```
The module sets up environment variables, required for linking and running tbb enabled applications.
Link the tbb library, using -ltbb
## Examples
Number of examples, demonstrating use of TBB and its built-in scheduler is available on Anselm, in the $TBB_EXAMPLES directory.
```console
$ ml intel
$ ml tbb
$ cp -a $TBB_EXAMPLES/common $TBB_EXAMPLES/parallel_reduce /tmp/
$ cd /tmp/parallel_reduce/primes
$ icc -O2 -DNDEBUG -o primes.x main.cpp primes.cpp -ltbb
$ ./primes.x
```
In this example, we compile, link and run the primes example, demonstrating use of parallel task-based reduce in computation of prime numbers.
You will need the tbb module loaded to run the tbb enabled executable. This may be avoided, by compiling library search paths into the executable.
```console
$ icc -O2 -o primes.x main.cpp primes.cpp -Wl,-rpath=$LIBRARY_PATH -ltbb
```
## Further Reading
Read more on Intel website, [http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm](http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm)
# Intel Trace Analyzer and Collector
Intel Trace Analyzer and Collector (ITAC) is a tool to collect and graphically analyze behavior of MPI applications. It helps you to analyze communication patterns of your application, identify hotspots, perform correctness checking (identify deadlocks, data corruption etc), simulate how your application would run on a different interconnect.
ITAC is a offline analysis tool - first you run your application to collect a trace file, then you can open the trace in a GUI analyzer to view it.
## Installed Version
Currently on Salomon is version 9.1.2.024 available as module itac/9.1.2.024
## Collecting Traces
ITAC can collect traces from applications that are using Intel MPI. To generate a trace, simply add -trace option to your mpirun command :
```console
$ ml itac/9.1.2.024
$ mpirun -trace myapp
```
The trace will be saved in file myapp.stf in the current directory.
## Viewing Traces
To view and analyze the trace, open ITAC GUI in a [graphical environment](/general/accessing-the-clusters/graphical-user-interface/x-window-system/):
```console
$ ml itac/9.1.2.024
$ traceanalyzer
```
The GUI will launch and you can open the produced `*`.stf file.
![](../../../img/Snmekobrazovky20151204v15.35.12.png)
Please refer to Intel documenation about usage of the GUI tool.
## References
1. [Getting Started with Intel® Trace Analyzer and Collector](https://software.intel.com/en-us/get-started-with-itac-for-linux)
1. [Intel® Trace Analyzer and Collector - Documentation](https://software.intel.com/en-us/intel-trace-analyzer)
# Intel Xeon Phi
## Guide to Intel Xeon Phi Usage
Intel Xeon Phi can be programmed in several modes. The default mode on Anselm is offload mode, but all modes described in this document are supported.
## Intel Utilities for Xeon Phi
To get access to a compute node with Intel Xeon Phi accelerator, use the PBS interactive session
```console
$ qsub -I -q qmic -A NONE-0-0
```
To set up the environment module "Intel" has to be loaded
```console
$ ml intel
```
Information about the hardware can be obtained by running the micinfo program on the host.
```console
$ /usr/bin/micinfo
```
The output of the "micinfo" utility executed on one of the Anselm node is as follows. (note: to get PCIe related details the command has to be run with root privileges)
```console
MicInfo Utility Log
Created Wed Sep 13 13:44:14 2017
System Info
HOST OS : Linux
OS Version : 2.6.32-696.3.2.el6.Bull.120.x86_64
Driver Version : 3.4.9-1
MPSS Version : 3.4.9
Host Physical Memory : 98836 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0391
SMC Firmware Version : 1.17.6900
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.9
Device Serial Number : ADKC30102489
Board
Vendor ID : 0x8086
Device ID : 0x2250
Subsystem ID : 0x2500
Coprocessor Stepping ID : 3
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : B1
Board SKU : B1PRQ-5110P/5120D
ECC Mode : Enabled
SMC HW Revision : Product 225W Passive CS
Cores
Total No of Active Cores : 60
Voltage : 1009000 uV
Frequency : 1052631 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 53 C
GDDR
GDDR Vendor : Elpida
GDDR Version : 0x1
GDDR Density : 2048 Mb
GDDR Size : 7936 MB
GDDR Technology : GDDR5
GDDR Speed : 5.000000 GT/s
GDDR Frequency : 2500000 kHz
GDDR Voltage : 1501000 uV
```
## Offload Mode
To compile a code for Intel Xeon Phi a MPSS stack has to be installed on the machine where compilation is executed. Currently the MPSS stack is only installed on compute nodes equipped with accelerators.
```console
$ qsub -I -q qmic -A NONE-0-0
$ ml intel
```
For debugging purposes it is also recommended to set environment variable "OFFLOAD_REPORT". Value can be set from 0 to 3, where higher number means more debugging information.
```console
export OFFLOAD_REPORT=3
```
A very basic example of code that employs offload programming technique is shown in the next listing.
!!! note
This code is sequential and utilizes only single core of the accelerator.
```cpp
$ vim source-offload.cpp
#include <iostream>
int main(int argc, char* argv[])
{
const int niter = 100000;
double result = 0;
#pragma offload target(mic)
for (int i = 0; i < niter; ++i) {
const double t = (i + 0.5) / niter;
result += 4.0 / (t * t + 1.0);
}
result /= niter;
std::cout << "Pi ~ " << result << 'n';
}
```
To compile a code using Intel compiler run
```console
$ icc source-offload.cpp -o bin-offload
```
To execute the code, run the following command on the host
```console
$ ./bin-offload
```
### Parallelization in Offload Mode Using OpenMP
One way of paralelization a code for Xeon Phi is using OpenMP directives. The following example shows code for parallel vector addition.
```cpp
$ vim ./vect-add
#include <stdio.h>
typedef int T;
#define SIZE 1000
#pragma offload_attribute(push, target(mic))
T in1[SIZE];
T in2[SIZE];
T res[SIZE];
#pragma offload_attribute(pop)
// MIC function to add two vectors
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to add two vectors
void add_cpu (T *a, T *b, T *c, int size) {
int i;
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare(T *a, T *b, T size ){
int pass = 0;
int i;
for (i = 0; i < size; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %dn",i, a[i], b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passedn"); else printf ("Test Failedn");
return pass;
}
int main()
{
int i;
random_T(in1, SIZE);
random_T(in2, SIZE);
#pragma offload target(mic) in(in1,in2) inout(res)
{
// Parallel loop from main function
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
// or parallel loop is called inside the function
add_mic(in1, in2, res, SIZE);
}
//Check the results with CPU implementation
T res_cpu[SIZE];
add_cpu(in1, in2, res_cpu, SIZE);
compare(res, res_cpu, SIZE);
}
```
During the compilation Intel compiler shows which loops have been vectorized in both host and accelerator. This can be enabled with compiler option "-vec-report2". To compile and execute the code run
```console
$ icc vect-add.c -openmp_report2 -vec-report2 -o vect-add
$ ./vect-add
```
Some interesting compiler flags useful not only for code debugging are:
!!! note
Debugging
openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic level
Performance ooptimization
xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions) instructions.
## Automatic Offload Using Intel MKL Library
Intel MKL includes an Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Intel Xeon Phi coprocessors automatically and transparently.
Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
The Automatic Offload may be enabled by either an MKL function call within the code:
```cpp
mkl_mic_enable();
```
or by setting environment variable
```console
$ export MKL_MIC_ENABLE=1
```
To get more information about automatic offload refer to "[Using Intel® MKL Automatic Offload on Intel ® Xeon Phi™ Coprocessors](http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf)" white paper or [Intel MKL documentation](https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation).
### Automatic Offload Example
At first get an interactive PBS session on a node with MIC accelerator and load "intel" module that automatically loads "mkl" module as well.
```console
$ qsub -I -q qmic -A OPEN-0-0 -l select=1:ncpus=16
$ ml intel
```
Following example show how to automatically offload an SGEMM (single precision - general matrix multiply) function to MIC coprocessor. The code can be copied to a file and compiled without any necessary modification.
```cpp
$ vim sgemm-ao-short.c
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include "mkl.h"
int main(int argc, char **argv)
{
float *A, *B, *C; /* Matrices */
MKL_INT N = 2560; /* Matrix dimensions */
MKL_INT LD = N; /* Leading dimension */
int matrix_bytes; /* Matrix size in bytes */
int matrix_elements; /* Matrix size in elements */
float alpha = 1.0, beta = 1.0; /* Scaling factors */
char transa = 'N', transb = 'N'; /* Transposition options */
int i, j; /* Counters */
matrix_elements = N * N;
matrix_bytes = sizeof(float) * matrix_elements;
/* Allocate the matrices */
A = malloc(matrix_bytes); B = malloc(matrix_bytes); C = malloc(matrix_bytes);
/* Initialize the matrices */
for (i = 0; i < matrix_elements; i++) {
A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
}
printf("Computing SGEMM on the hostn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
printf("Enabling Automatic Offloadn");
/* Alternatively, set environment variable MKL_MIC_ENABLE=1 */
mkl_mic_enable();
int ndevices = mkl_mic_get_device_count(); /* Number of MIC devices */
printf("Automatic Offload enabled: %d MIC devices presentn", ndevices);
printf("Computing SGEMM with automatic workdivisionn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
/* Free the matrix memory */
free(A); free(B); free(C);
printf("Donen");
return 0;
}
```
!!! note
This example is simplified version of an example from MKL. The expanded version can be found here: `$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c`.
To compile a code using Intel compiler use:
```console
$ icc -mkl sgemm-ao-short.c -o sgemm
```
For debugging purposes enable the offload report to see more information about automatic offloading.
```console
$ export OFFLOAD_REPORT=2
```
The output of a code should look similar to following listing, where lines starting with [MKL] are generated by offload reporting:
```console
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 1 MIC devices present
Computing SGEMM with automatic workdivision
[MKL] [MIC --] [AO Function] SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision] 0.00 1.00
[MKL] [MIC 00] [AO SGEMM CPU Time] 0.463351 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time] 0.179608 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 52428800 bytes
[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 26214400 bytes
Done
```
## Native Mode
In the native mode a program is executed directly on Intel Xeon Phi without involvement of the host machine. Similarly to offload mode, the code is compiled on the host computer with Intel compilers.
To compile a code user has to be connected to a compute with MIC and load Intel compilers module. To get an interactive session on a compute node with an Intel Xeon Phi and load the module use following commands:
```console
$ qsub -I -q qmic -A NONE-0-0
$ ml intel
```
!!! note
Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
```console
$ icc -xhost -no-offload -fopenmp vect-add.c -o vect-add-host
```
To run this code on host, use:
```console
$ ./vect-add-host
```
The second example shows how to compile the same code for Intel Xeon Phi:
```console
$ icc -mmic -fopenmp vect-add.c -o vect-add-mic
```
### Execution of the Program in Native Mode on Intel Xeon Phi
The user access to the Intel Xeon Phi is through the SSH. Since user home directories are mounted using NFS on the accelerator, users do not have to copy binary files or libraries between the host and accelerator.
To connect to the accelerator run:
```console
$ ssh mic0
```
If the code is sequential, it can be executed directly:
```console
mic0 $ ~/path_to_binary/vect-add-seq-mic
```
If the code is parallelized using OpenMP a set of additional libraries is required for execution. To locate these libraries new path has to be added to the LD_LIBRARY_PATH environment variable prior to the execution:
```console
mic0 $ export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
```
!!! note
The path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
!!! note
/apps/intel/composer_xe_2013.5.192/compiler/lib/mic
- libiomp5.so
- libimf.so
- libsvml.so
- libirng.so
- libintlc.so.5
Finally, to run the compiled code use:
```console
$ ~/path_to_binary/vect-add-mic
```
## OpenCL
OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming for diverse mix of multi-core CPUs, GPU coprocessors, and other parallel processors. OpenCL provides a flexible execution model and uniform programming environment for software developers to write portable code for systems running on both the CPU and graphics processors or accelerators like the Intel® Xeon Phi.
On Anselm OpenCL is installed only on compute nodes with MIC accelerator, therefore OpenCL code can be compiled only on these nodes.
```console
ml opencl-sdk opencl-rt
```
Always load "opencl-sdk" (providing devel files like headers) and "opencl-rt" (providing dynamic library libOpenCL.so) modules to compile and link OpenCL code. Load "opencl-rt" for running your compiled code.
There are two basic examples of OpenCL code in the following directory:
```console
/apps/intel/opencl-examples/
```
First example "CapsBasic" detects OpenCL compatible hardware, here CPU and MIC, and prints basic information about the capabilities of it.
```console
/apps/intel/opencl-examples/CapsBasic/capsbasic
```
To compile and run the example copy it to your home directory, get a PBS interactive session on of the nodes with MIC and run make for compilation. Make files are very basic and shows how the OpenCL code can be compiled on Anselm.
```console
$ cp /apps/intel/opencl-examples/CapsBasic/* .
$ qsub -I -q qmic -A NONE-0-0
$ make
```
The compilation command for this example is:
```console
$ g++ capsbasic.cpp -lOpenCL -o capsbasic -I/apps/intel/opencl/include/
```
After executing the complied binary file, following output should be displayed.
```console
$ ./capsbasic
Number of available platforms: 1
Platform names:
[0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
CL_DEVICE_TYPE_CPU: 1
CL_DEVICE_TYPE_GPU: 0
CL_DEVICE_TYPE_ACCELERATOR: 1
** Detailed information for each device ***
CL_DEVICE_TYPE_CPU[0]
CL_DEVICE_NAME: Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
CL_DEVICE_AVAILABLE: 1
...
CL_DEVICE_TYPE_ACCELERATOR[0]
CL_DEVICE_NAME: Intel(R) Many Integrated Core Acceleration Card
CL_DEVICE_AVAILABLE: 1
...
```
!!! note
More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>
The second example that can be found in "/apps/intel/opencl-examples" directory is General Matrix Multiply. You can follow the the same procedure to download the example to your directory and compile it.
```console
$ cp -r /apps/intel/opencl-examples/* .
$ qsub -I -q qmic -A NONE-0-0
$ cd GEMM
$ make
```
The compilation command for this example is:
```console
$ g++ cmdoptions.cpp gemm.cpp ../common/basic.cpp ../common/cmdparser.cpp ../common/oclobject.cpp -I../common -lOpenCL -o gemm -I/apps/intel/opencl/include/
```
To see the performance of Intel Xeon Phi performing the DGEMM run the example as follows:
```console
./gemm -d 1
Platforms (1):
[0] Intel(R) OpenCL [Selected]
Devices (2):
[0] Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
[1] Intel(R) Many Integrated Core Acceleration Card [Selected]
Build program options: "-DT=float -DTILE_SIZE_M=1 -DTILE_GROUP_M=16 -DTILE_SIZE_N=128 -DTILE_GROUP_N=1 -DTILE_SIZE_K=8"
Running gemm_nn kernel with matrix size: 3968x3968
Memory row stride to ensure necessary alignment: 15872 bytes
Size of memory region for one matrix: 62980096 bytes
Using alpha = 0.57599 and beta = 0.872412
...
Host time: 0.292953 sec.
Host perf: 426.635 GFLOPS
Host time: 0.293334 sec.
Host perf: 426.081 GFLOPS
...
```
!!! warning
GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI
### Environment Setup and Compilation
Again an MPI code for Intel Xeon Phi has to be compiled on a compute node with accelerator and MPSS software stack installed. To get to a compute node with accelerator use:
```console
$ qsub -I -q qmic -A NONE-0-0
```
The only supported implementation of MPI standard for Intel Xeon Phi is Intel MPI. To setup a fully functional development environment a combination of Intel compiler and Intel MPI has to be used. On a host load following modules before compilation:
```console
$ ml intel
```
To compile an MPI code for host use:
````console
$ mpiicc -xhost -o mpi-test mpi-test.c
```
To compile the same code for Intel Xeon Phi architecture use:
```console
$ mpiicc -mmic -o mpi-test-mic mpi-test.c
````
An example of basic MPI version of "hello-world" example in C language, that can be executed on both host and Xeon Phi is (can be directly copy and pasted to a .c file)
```cpp
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size;
int len;
char node[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(node,&len);
printf( "Hello world from process %d of %d on host %s n", rank, size, node );
MPI_Finalize();
return 0;
}
```
### MPI Programming Models
Intel MPI for the Xeon Phi coprocessors offers different MPI programming models:
!!! note
**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
**Coprocessor-only model** - all MPI ranks reside only on the coprocessors.
**Symmetric model** - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.
### Host-Only Model
In this case all environment variables are set by modules, so to execute the compiled MPI program on a single node, use:
```console
$ mpirun -np 4 ./mpi-test
```
The output should be similar to:
```console
Hello world from process 1 of 4 on host cn207
Hello world from process 3 of 4 on host cn207
Hello world from process 2 of 4 on host cn207
Hello world from process 0 of 4 on host cn207
```
### Coprocessor-Only Model
There are two ways how to execute an MPI code on a single coprocessor: 1.) lunch the program using "**mpirun**" from the
coprocessor; or 2.) lunch the task using "**mpiexec.hydra**" from a host.
#### Execution on Coprocessor
Similarly to execution of OpenMP programs in native mode, since the environmental module are not supported on MIC, user has to setup paths to Intel MPI libraries and binaries manually. One time setup can be done by creating a "**.profile**" file in user's home directory. This file sets up the environment on the MIC automatically once user access to the accelerator through the SSH.
```console
$ vim ~/.profile
PS1='[u@h W]$ '
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#OpenMP
export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
#Intel MPI
export LD_LIBRARY_PATH=/apps/intel/impi/4.1.1.036/mic/lib/:$LD_LIBRARY_PATH
export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH
```
!!! note
\* this file sets up both environmental variable for both MPI and OpenMP libraries.
\* this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use:
```console
$ ssh mic0
```
or in case you need specify a MIC accelerator on a particular node, use:
```console
$ ssh cn207-mic0
```
To run the MPI code in parallel on multiple core of the accelerator, use:
```console
$ mpirun -np 4 ./mpi-test-mic
```
The output should be similar to:
```console
Hello world from process 1 of 4 on host cn207-mic0
Hello world from process 2 of 4 on host cn207-mic0
Hello world from process 3 of 4 on host cn207-mic0
Hello world from process 0 of 4 on host cn207-mic0
```
#### Execution on Host
If the MPI program is launched from host instead of the coprocessor, the environmental variables are not set using the ".profile" file. Therefore user has to specify library paths from the command line when calling "mpiexec".
First step is to tell mpiexec that the MPI should be executed on a local accelerator by setting up the environmental variable "I_MPI_MIC"
```console
$ export I_MPI_MIC=1
```
Now the MPI program can be executed as:
```console
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
```
or using mpirun
```console
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
```
!!! note
\* the full path to the binary has to specified (here: `>~/mpi-test-mic`)
\* the `LD_LIBRARY_PATH` has to match with Intel MPI module used to compile the MPI code
The output should be again similar to:
```console
Hello world from process 1 of 4 on host cn207-mic0
Hello world from process 2 of 4 on host cn207-mic0
Hello world from process 3 of 4 on host cn207-mic0
Hello world from process 0 of 4 on host cn207-mic0
```
!!! note
`mpiexec.hydra` requires a file the MIC filesystem. If the file is missing contact the system administrators.
A simple test to see if the file is present is to execute:
```console
$ ssh mic0 ls /bin/pmi_proxy
/bin/pmi_proxy
```
#### Execution on Host - MPI Processes Distributed Over Multiple Accelerators on Multiple Nodes**
To get access to multiple nodes with MIC accelerator, user has to use PBS to allocate the resources. To start interactive session, that allocates 2 compute nodes = 2 MIC accelerators run qsub command with following parameters:
```console
$ qsub -I -q qmic -A NONE-0-0 -l select=2:ncpus=16
$ ml intel/13.5.192 impi/4.1.1.036
```
This command connects user through ssh to one of the nodes immediately. To see the other nodes that have been allocated use:
```console
$ cat $PBS_NODEFILE
```
For example:
```console
cn204.bullx
cn205.bullx
```
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! note
At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh cn205`
- to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0`
- to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0`
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
```console
$ export I_MPI_MIC=1
```
The launch the MPI program use:
```console
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204-mic0 -n 4 ~/mpi-test-mic
: -host cn205-mic0 -n 6 ~/mpi-test-mic
```
or using mpirun:
```console
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204-mic0 -n 4 ~/mpi-test-mic
: -host cn205-mic0 -n 6 ~/mpi-test-mic
```
In this case four MPI processes are executed on accelerator cn204-mic and six processes are executed on accelerator cn205-mic0. The sample output (sorted after execution) is:
```console
Hello world from process 0 of 10 on host cn204-mic0
Hello world from process 1 of 10 on host cn204-mic0
Hello world from process 2 of 10 on host cn204-mic0
Hello world from process 3 of 10 on host cn204-mic0
Hello world from process 4 of 10 on host cn205-mic0
Hello world from process 5 of 10 on host cn205-mic0
Hello world from process 6 of 10 on host cn205-mic0
Hello world from process 7 of 10 on host cn205-mic0
Hello world from process 8 of 10 on host cn205-mic0
Hello world from process 9 of 10 on host cn205-mic0
```
The same way MPI program can be executed on multiple hosts:
```console
$ mpiexec.hydra -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-host cn204 -n 4 ~/mpi-test
: -host cn205 -n 6 ~/mpi-test
```
### Symmetric Model
In a symmetric mode MPI programs are executed on both host computer(s) and MIC accelerator(s). Since MIC has a different
architecture and requires different binary file produced by the Intel compiler two different files has to be compiled before MPI program is executed.
In the previous section we have compiled two binary files, one for hosts "**mpi-test**" and one for MIC accelerators "**mpi-test-mic**". These two binaries can be executed at once using mpiexec.hydra:
```console
$ mpiexec.hydra
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-host cn205 -n 2 ~/mpi-test
: -host cn205-mic0 -n 2 ~/mpi-test-mic
```
In this example the first two parameters (line 2 and 3) sets up required environment variables for execution. The third line specifies binary that is executed on host (here cn205) and the last line specifies the binary that is execute on the accelerator (here cn205-mic0).
The output of the program is:
```console
Hello world from process 0 of 4 on host cn205
Hello world from process 1 of 4 on host cn205
Hello world from process 2 of 4 on host cn205-mic0
Hello world from process 3 of 4 on host cn205-mic0
```
The execution procedure can be simplified by using the mpirun command with the machine file a a parameter. Machine file contains list of all nodes and accelerators that should used to execute MPI processes.
An example of a machine file that uses 2 >hosts (**cn205** and **cn206**) and 2 accelerators **(cn205-mic0** and **cn206-mic0**) to run 2 MPI processes on each of them:
```console
$ cat hosts_file_mix
cn205:2
cn205-mic0:2
cn206:2
cn206-mic0:2
```
In addition if a naming convention is set in a way that the name of the binary for host is **"bin_name"** and the name of the binary for the accelerator is **"bin_name-mic"** then by setting up the environment variable **I_MPI_MIC_POSTFIX** to **"-mic"** user do not have to specify the names of booth binaries. In this case mpirun needs just the name of the host binary file (i.e. "mpi-test") and uses the suffix to get a name of the binary for accelerator (i..e. "mpi-test-mic").
```console
$ export I_MPI_MIC_POSTFIX=-mic
```
To run the MPI code using mpirun and the machine file "hosts_file_mix" use:
```console
$ mpirun
-genv I_MPI_FABRICS shm:tcp
-genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/
-genv I_MPI_FABRICS_LIST tcp
-genv I_MPI_FABRICS shm:tcp
-genv I_MPI_TCP_NETMASK=10.1.0.0/16
-machinefile hosts_file_mix
~/mpi-test
```
A possible output of the MPI "hello-world" example executed on two hosts and two accelerators is:
```console
Hello world from process 0 of 8 on host cn204
Hello world from process 1 of 8 on host cn204
Hello world from process 2 of 8 on host cn204-mic0
Hello world from process 3 of 8 on host cn204-mic0
Hello world from process 4 of 8 on host cn205
Hello world from process 5 of 8 on host cn205
Hello world from process 6 of 8 on host cn205-mic0
Hello world from process 7 of 8 on host cn205-mic0
```
!!! note
At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
### Using Automatically Generated Node-Files
Set of node-files, that can be used instead of manually creating a new one every time, is generated for user convenience. Six node-files are generated:
!!! note
**Node-files:**
- /lscratch/${PBS_JOBID}/nodefile-cn Hosts only node-file
- /lscratch/${PBS_JOBID}/nodefile-mic MICs only node-file
- /lscratch/${PBS_JOBID}/nodefile-mix Hosts and MICs node-file
- /lscratch/${PBS_JOBID}/nodefile-cn-sn Hosts only node-file, using short names
- /lscratch/${PBS_JOBID}/nodefile-mic-sn MICs only node-file, using short names
- /lscratch/${PBS_JOBID}/nodefile-mix-sn Hosts and MICs node-file, using short names
Each host or accelerator is listed only once per file. User has to specify how many jobs should be executed per node using `-n` parameter of the mpirun command.
## Optimization
For more details about optimization techniques read Intel document [Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors](http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization "http&#x3A;//software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization")
# Intel Xeon Phi
## Guide to Intel Xeon Phi Usage
Intel Xeon Phi accelerator can be programmed in several modes. The default mode on the cluster is offload mode, but all modes described in this document are supported.
## Intel Utilities for Xeon Phi
To get access to a compute node with Intel Xeon Phi accelerator, use the PBS interactive session
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
```
To set up the environment module "intel" has to be loaded, without specifying the version, default version is loaded (at time of writing this, it's 2015b)
```console
$ ml intel
```
Information about the hardware can be obtained by running the micinfo program on the host.
```console
$ /usr/bin/micinfo
```
The output of the "micinfo" utility executed on one of the cluster node is as follows. (note: to get PCIe related details the command has to be run with root privileges)
```console
MicInfo Utility Log
Created Wed Sep 13 13:39:28 2017
System Info
HOST OS : Linux
OS Version : 2.6.32-696.3.2.el6.x86_64
Driver Version : 3.8.2-1
MPSS Version : 3.8.2
Host Physical Memory : 128838 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0391
SMC Firmware Version : 1.17.6900
SMC Boot Loader Version : 1.8.4326
Coprocessor OS Version : 2.6.38.8+mpss3.8.2
Device Serial Number : ADKC44601725
Board
Vendor ID : 0x8086
Device ID : 0x225c
Subsystem ID : 0x7d95
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-7120 P/A/X/D
ECC Mode : Enabled
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 61
Voltage : 1041000 uV
Frequency : 1238095 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 50 C
GDDR
GDDR Vendor : Samsung
GDDR Version : 0x6
GDDR Density : 4096 Mb
GDDR Size : 15872 MB
GDDR Technology : GDDR5
GDDR Speed : 5.500000 GT/s
GDDR Frequency : 2750000 kHz
GDDR Voltage : 1501000 uV
Device No: 1, Device Name: mic1
Version
Flash Version : 2.1.02.0391
SMC Firmware Version : 1.17.6900
SMC Boot Loader Version : 1.8.4326
Coprocessor OS Version : 2.6.38.8+mpss3.8.2
Device Serial Number : ADKC44601893
Board
Vendor ID : 0x8086
Device ID : 0x225c
Subsystem ID : 0x7d95
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-7120 P/A/X/D
ECC Mode : Enabled
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 61
Voltage : 1053000 uV
Frequency : 1238095 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 48 C
GDDR
GDDR Vendor : Samsung
GDDR Version : 0x6
GDDR Density : 4096 Mb
GDDR Size : 15872 MB
GDDR Technology : GDDR5
GDDR Speed : 5.500000 GT/s
GDDR Frequency : 2750000 kHz
GDDR Voltage : 1501000 uV
```
## Offload Mode
To compile a code for Intel Xeon Phi a MPSS stack has to be installed on the machine where compilation is executed. Currently the MPSS stack is only installed on compute nodes equipped with accelerators.
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ ml intel
```
For debugging purposes it is also recommended to set environment variable "OFFLOAD_REPORT". Value can be set from 0 to 3, where higher number means more debugging information.
```console
export OFFLOAD_REPORT=3
```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator.
```cpp
$ cat source-offload.cpp
#include <iostream>
int main(int argc, char* argv[])
{
const int niter = 100000;
double result = 0;
#pragma offload target(mic)
for (int i = 0; i < niter; ++i) {
const double t = (i + 0.5) / niter;
result += 4.0 / (t * t + 1.0);
}
result /= niter;
std::cout << "Pi ~ " << result << '\n';
}
```
To compile a code using Intel compiler run
```console
$ icc source-offload.cpp -o bin-offload
```
To execute the code, run the following command on the host
```console
$ ./bin-offload
```
### Parallelization in Offload Mode Using OpenMP
One way of paralelization a code for Xeon Phi is using OpenMP directives. The following example shows code for parallel vector addition.
```cpp
$ cat ./vect-add
#include <stdio.h>
typedef int T;
#define SIZE 1000
#pragma offload_attribute(push, target(mic))
T in1[SIZE];
T in2[SIZE];
T res[SIZE];
#pragma offload_attribute(pop)
// MIC function to add two vectors
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to add two vectors
void add_cpu (T *a, T *b, T *c, int size) {
int i;
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare(T *a, T *b, T size ){
int pass = 0;
int i;
for (i = 0; i < size; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %dn",i, a[i], b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passedn"); else printf ("Test Failedn");
return pass;
}
int main()
{
int i;
random_T(in1, SIZE);
random_T(in2, SIZE);
#pragma offload target(mic) in(in1,in2) inout(res)
{
// Parallel loop from main function
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
// or parallel loop is called inside the function
add_mic(in1, in2, res, SIZE);
}
//Check the results with CPU implementation
T res_cpu[SIZE];
add_cpu(in1, in2, res_cpu, SIZE);
compare(res, res_cpu, SIZE);
}
```
During the compilation Intel compiler shows which loops have been vectorized in both host and accelerator. This can be enabled with compiler option "-vec-report2". To compile and execute the code run
```console
$ icc vect-add.c -openmp_report2 -vec-report2 -o vect-add
$ ./vect-add
```
Some interesting compiler flags useful not only for code debugging are:
!!! note
Debugging
openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic level
Performance ooptimization
xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions) instructions.
## Automatic Offload Using Intel MKL Library
Intel MKL includes an Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Intel Xeon Phi coprocessors automatically and transparently.
!!! note
Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
The Automatic Offload may be enabled by either an MKL function call within the code:
```cpp
mkl_mic_enable();
```
or by setting environment variable
```console
$ export MKL_MIC_ENABLE=1
```
To get more information about automatic offload refer to "[Using Intel® MKL Automatic Offload on Intel ® Xeon Phi™ Coprocessors](http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf)" white paper or [Intel MKL documentation](https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation).
### Automatic Offload Example
At first get an interactive PBS session on a node with MIC accelerator and load "intel" module that automatically loads "mkl" module as well.
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ ml intel
```
The code can be copied to a file and compiled without any necessary modification.
```cpp
$ vim sgemm-ao-short.c
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include "mkl.h"
int main(int argc, char **argv)
{
float *A, *B, *C; /* Matrices */
MKL_INT N = 2560; /* Matrix dimensions */
MKL_INT LD = N; /* Leading dimension */
int matrix_bytes; /* Matrix size in bytes */
int matrix_elements; /* Matrix size in elements */
float alpha = 1.0, beta = 1.0; /* Scaling factors */
char transa = 'N', transb = 'N'; /* Transposition options */
int i, j; /* Counters */
matrix_elements = N * N;
matrix_bytes = sizeof(float) * matrix_elements;
/* Allocate the matrices */
A = malloc(matrix_bytes); B = malloc(matrix_bytes); C = malloc(matrix_bytes);
/* Initialize the matrices */
for (i = 0; i < matrix_elements; i++) {
A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
}
printf("Computing SGEMM on the host\n");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
printf("Enabling Automatic Offload\n");
/* Alternatively, set environment variable MKL_MIC_ENABLE=1 */
mkl_mic_enable();
int ndevices = mkl_mic_get_device_count(); /* Number of MIC devices */
printf("Automatic Offload enabled: %d MIC devices present\n", ndevices);
printf("Computing SGEMM with automatic workdivision\n");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
/* Free the matrix memory */
free(A); free(B); free(C);
printf("Done\n");
return 0;
}
```
!!! note
This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use:
```console
$ icc -mkl sgemm-ao-short.c -o sgemm
```
For debugging purposes enable the offload report to see more information about automatic offloading.
```console
$ export OFFLOAD_REPORT=2
```
The output of a code should look similar to following listing, where lines starting with [MKL] are generated by offload reporting:
```console
[user@r31u03n799 ~]$ ./sgemm
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 2 MIC devices present
Computing SGEMM with automatic workdivision
[MKL] [MIC --] [AO Function] SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision] 0.44 0.28 0.28
[MKL] [MIC 00] [AO SGEMM CPU Time] 0.252427 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time] 0.091001 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 7864320 bytes
[MKL] [MIC 01] [AO SGEMM CPU Time] 0.252427 seconds
[MKL] [MIC 01] [AO SGEMM MIC Time] 0.094758 seconds
[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 7864320 bytes
Done
```
!!! note ""
Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
### Automatic Offload Example #2
In this example, we will demonstrate automatic offload control via an environment vatiable MKL_MIC_ENABLE. The function DGEMM will be offloaded.
At first get an interactive PBS session on a node with MIC accelerator.
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
```
Once in, we enable the offload and run the Octave software. In octave, we generate two large random matrices and let them multiply together.
```console
$ export MKL_MIC_ENABLE=1
$ export OFFLOAD_REPORT=2
$ ml Octave/3.8.2-intel-2015b
$ octave -q
octave:1> A=rand(10000);
octave:2> B=rand(10000);
octave:3> C=A*B;
[MKL] [MIC --] [AO Function] DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision] 0.14 0.43 0.43
[MKL] [MIC 00] [AO DGEMM CPU Time] 3.814714 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time] 2.781595 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1145600000 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 1382400000 bytes
[MKL] [MIC 01] [AO DGEMM CPU Time] 3.814714 seconds
[MKL] [MIC 01] [AO DGEMM MIC Time] 2.843016 seconds
[MKL] [MIC 01] [AO DGEMM CPU->MIC Data] 1145600000 bytes
[MKL] [MIC 01] [AO DGEMM MIC->CPU Data] 1382400000 bytes
octave:4> exit
```
On the example above we observe, that the DGEMM function workload was split over CPU, MIC 0 and MIC 1, in the ratio 0.14 0.43 0.43. The matrix multiplication was done on the CPU, accelerated by two Xeon Phi accelerators.
## Native Mode
In the native mode a program is executed directly on Intel Xeon Phi without involvement of the host machine. Similarly to offload mode, the code is compiled on the host computer with Intel compilers.
To compile a code user has to be connected to a compute with MIC and load Intel compilers module. To get an interactive session on a compute node with an Intel Xeon Phi and load the module use following commands:
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ ml intel
```
!!! note
Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
```console
$ icc -xhost -no-offload -fopenmp vect-add.c -o vect-add-host
```
To run this code on host, use:
```console
$ ./vect-add-host
```
The second example shows how to compile the same code for Intel Xeon Phi:
```console
$ icc -mmic -fopenmp vect-add.c -o vect-add-mic
```
### Execution of the Program in Native Mode on Intel Xeon Phi
The user access to the Intel Xeon Phi is through the SSH. Since user home directories are mounted using NFS on the accelerator, users do not have to copy binary files or libraries between the host and accelerator.
Get the PATH of MIC enabled libraries for currently used Intel Compiler (here was icc/2015.3.187-GNU-5.1.0-2.25 used):
```console
$ echo $MIC_LD_LIBRARY_PATH
/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic
```
To connect to the accelerator run:
```console
$ ssh mic0
```
If the code is sequential, it can be executed directly:
```console
mic0 $ ~/path_to_binary/vect-add-seq-mic
```
If the code is parallelized using OpenMP a set of additional libraries is required for execution. To locate these libraries new path has to be added to the LD_LIBRARY_PATH environment variable prior to the execution:
```console
mic0 $ export LD_LIBRARY_PATH=/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
```
!!! note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 2015.3.187-GNU-5.1.0-2.25). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
!!! note
/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic
libiomp5.so
libimf.so
libsvml.so
libirng.so
libintlc.so.5
Finally, to run the compiled code use:
## OpenCL
OpenCL (Open Computing Language) is an open standard for general-purpose parallel programming for diverse mix of multi-core CPUs, GPU coprocessors, and other parallel processors. OpenCL provides a flexible execution model and uniform programming environment for software developers to write portable code for systems running on both the CPU and graphics processors or accelerators like the Intel® Xeon Phi.
On Salomon OpenCL is installed only on compute nodes with MIC accelerator, therefore OpenCL code can be compiled only on these nodes.
```console
ml opencl-sdk opencl-rt
```
Always load "opencl-sdk" (providing devel files like headers) and "opencl-rt" (providing dynamic library libOpenCL.so) modules to compile and link OpenCL code. Load "opencl-rt" for running your compiled code.
There are two basic examples of OpenCL code in the following directory:
```console
/apps/intel/opencl-examples/
```
First example "CapsBasic" detects OpenCL compatible hardware, here CPU and MIC, and prints basic information about the capabilities of it.
```console
/apps/intel/opencl-examples/CapsBasic/capsbasic
```
To compile and run the example copy it to your home directory, get a PBS interactive session on of the nodes with MIC and run make for compilation. Make files are very basic and shows how the OpenCL code can be compiled on Salomon.
The compilation command for this example is:
```console
$ g++ capsbasic.cpp -lOpenCL -o capsbasic -I/apps/intel/opencl/include/
```
After executing the complied binary file, following output should be displayed.
```console
./capsbasic
Number of available platforms: 1
Platform names:
[0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
CL_DEVICE_TYPE_CPU: 1
CL_DEVICE_TYPE_GPU: 0
CL_DEVICE_TYPE_ACCELERATOR: 1
** Detailed information for each device ***
CL_DEVICE_TYPE_CPU[0]
CL_DEVICE_NAME: Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
CL_DEVICE_AVAILABLE: 1
...
CL_DEVICE_TYPE_ACCELERATOR[0]
CL_DEVICE_NAME: Intel(R) Many Integrated Core Acceleration Card
CL_DEVICE_AVAILABLE: 1
...
```
!!! note
More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>
To see the performance of Intel Xeon Phi performing the DGEMM run the example as follows:
```console
./gemm -d 1
Platforms (1):
[0] Intel(R) OpenCL [Selected]
Devices (2):
[0] Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
[1] Intel(R) Many Integrated Core Acceleration Card [Selected]
Build program options: "-DT=float -DTILE_SIZE_M=1 -DTILE_GROUP_M=16 -DTILE_SIZE_N=128 -DTILE_GROUP_N=1 -DTILE_SIZE_K=8"
Running gemm_nn kernel with matrix size: 3968x3968
Memory row stride to ensure necessary alignment: 15872 bytes
Size of memory region for one matrix: 62980096 bytes
Using alpha = 0.57599 and beta = 0.872412
...
Host time: 0.292953 sec.
Host perf: 426.635 GFLOPS
Host time: 0.293334 sec.
Host perf: 426.081 GFLOPS
...
```
!!! hint
GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI
### Environment Setup and Compilation
To achieve best MPI performance always use following setup for Intel MPI on Xeon Phi accelerated nodes:
```console
$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1
```
This ensures, that MPI inside node will use SHMEM communication, between HOST and Phi the IB SCIF will be used and between different nodes or Phi's on diferent nodes a CCL-Direct proxy will be used.
!!! note
Other FABRICS like tcp,ofa may be used (even combined with shm) but there's severe loss of performance (by order of magnitude).
Usage of single DAPL PROVIDER (e. g. I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u) will cause failure of Host<->Phi and/or Phi<->Phi communication.
Usage of the I_MPI_DAPL_PROVIDER_LIST on non-accelerated node will cause failure of any MPI communication, since those nodes don't have SCIF device and there's no CCL-Direct proxy runnig.
Again an MPI code for Intel Xeon Phi has to be compiled on a compute node with accelerator and MPSS software stack installed. To get to a compute node with accelerator use:
```console
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
```
The only supported implementation of MPI standard for Intel Xeon Phi is Intel MPI. To setup a fully functional development environment a combination of Intel compiler and Intel MPI has to be used. On a host load following modules before compilation:
```console
$ ml intel
```
To compile an MPI code for host use:
```console
$ mpiicc -xhost -o mpi-test mpi-test.c
```
To compile the same code for Intel Xeon Phi architecture use:
```console
$ mpiicc -mmic -o mpi-test-mic mpi-test.c
```
Or, if you are using Fortran :
```console
$ mpiifort -mmic -o mpi-test-mic mpi-test.f90
```
An example of basic MPI version of "hello-world" example in C language, that can be executed on both host and Xeon Phi is (can be directly copy and pasted to a .c file)
```cpp
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size;
int len;
char node[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(node,&len);
printf( "Hello world from process %d of %d on host %s n", rank, size, node );
MPI_Finalize();
return 0;
}
```
### MPI Programming Models
Intel MPI for the Xeon Phi coprocessors offers different MPI programming models:
!!! note
**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
**Coprocessor-only model** - all MPI ranks reside only on the coprocessors.
**Symmetric model** - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.
### Host-Only Model
In this case all environment variables are set by modules, so to execute the compiled MPI program on a single node, use:
```console
$ mpirun -np 4 ./mpi-test
```
The output should be similar to:
```console
Hello world from process 1 of 4 on host r38u31n1000
Hello world from process 3 of 4 on host r38u31n1000
Hello world from process 2 of 4 on host r38u31n1000
Hello world from process 0 of 4 on host r38u31n1000
```
### Coprocessor-Only Model
There are two ways how to execute an MPI code on a single coprocessor: 1.) lunch the program using "**mpirun**" from the
coprocessor; or 2.) lunch the task using "**mpiexec.hydra**" from a host.
#### Execution on Coprocessor
Similarly to execution of OpenMP programs in native mode, since the environmental module are not supported on MIC, user has to setup paths to Intel MPI libraries and binaries manually. One time setup can be done by creating a "**.profile**" file in user's home directory. This file sets up the environment on the MIC automatically once user access to the accelerator through the SSH.
At first get the LD_LIBRARY_PATH for currenty used Intel Compiler and Intel MPI:
```console
$ echo $MIC_LD_LIBRARY_PATH
/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic/
```
Use it in your ~/.profile:
```console
$ cat ~/.profile
PS1='[\u@\h \W]\$ '
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#IMPI
export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187-GNU-5.1.0-2.25/mic/bin/:$PATH
#OpenMP (ICC, IFORT), IMKL and IMPI
export LD_LIBRARY_PATH=/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
```
!!! note
\* this file sets up both environmental variable for both MPI and OpenMP libraries.
\* this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use:
```console
$ ssh mic0
```
or in case you need specify a MIC accelerator on a particular node, use:
```console
$ ssh r38u31n1000-mic0
```
To run the MPI code in parallel on multiple core of the accelerator, use:
```console
$ mpirun -np 4 ./mpi-test-mic
```
The output should be similar to:
```console
Hello world from process 1 of 4 on host r38u31n1000-mic0
Hello world from process 2 of 4 on host r38u31n1000-mic0
Hello world from process 3 of 4 on host r38u31n1000-mic0
Hello world from process 0 of 4 on host r38u31n1000-mic0
```
#### Execution on Host
If the MPI program is launched from host instead of the coprocessor, the environmental variables are not set using the ".profile" file. Therefore user has to specify library paths from the command line when calling "mpiexec".
First step is to tell mpiexec that the MPI should be executed on a local accelerator by setting up the environmental variable "I_MPI_MIC"
```console
$ export I_MPI_MIC=1
```
Now the MPI program can be executed as:
```console
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 -n 4 ~/mpi-test-mic
```
or using mpirun
```console
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 -n 4 ~/mpi-test-mic
```
!!! note
\* the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
\* the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to:
```console
Hello world from process 1 of 4 on host r38u31n1000-mic0
Hello world from process 2 of 4 on host r38u31n1000-mic0
Hello world from process 3 of 4 on host r38u31n1000-mic0
Hello world from process 0 of 4 on host r38u31n1000-mic0
```
!!! hint
**"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing contact the system administrators.
A simple test to see if the file is present is to execute:
```console
$ ssh mic0 ls /bin/pmi_proxy
/bin/pmi_proxy
```
#### Execution on Host - MPI Processes Distributed Over Multiple Accelerators on Multiple Nodes
To get access to multiple nodes with MIC accelerator, user has to use PBS to allocate the resources. To start interactive session, that allocates 2 compute nodes = 2 MIC accelerators run qsub command with following parameters:
```console
$ qsub -I -q qprod -l select=2:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ ml intel impi
```
This command connects user through ssh to one of the nodes immediately. To see the other nodes that have been allocated use:
```console
$ cat $PBS_NODEFILE
```
For example:
```console
r25u25n710.ib0.smc.salomon.it4i.cz
r25u26n711.ib0.smc.salomon.it4i.cz
```
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**r25u25n710-mic0**" and "**r25u26n711-mic0**" accelerators.
!!! note
At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh r25u26n711`
- to connect to the accelerator on the first node from the first node: `$ ssh r25u25n710-mic0` or `$ ssh mic0`
- to connect to the accelerator on the second node from the first node: `$ ssh r25u25n711-mic0`
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC", don't forget to have correct FABRIC and PROVIDER defined.
```console
$ export I_MPI_MIC=1
$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1
```
The launch the MPI program use:
```console
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH \
-host r25u25n710-mic0 -n 4 ~/mpi-test-mic \
: -host r25u26n711-mic0 -n 6 ~/mpi-test-mic
```
or using mpirun:
```console
$ mpirun -genv LD_LIBRARY_PATH \
-host r25u25n710-mic0 -n 4 ~/mpi-test-mic \
: -host r25u26n711-mic0 -n 6 ~/mpi-test-mic
```
In this case four MPI processes are executed on accelerator cn204-mic and six processes are executed on accelerator cn205-mic0. The sample output (sorted after execution) is:
```console
Hello world from process 0 of 10 on host r25u25n710-mic0
Hello world from process 1 of 10 on host r25u25n710-mic0
Hello world from process 2 of 10 on host r25u25n710-mic0
Hello world from process 3 of 10 on host r25u25n710-mic0
Hello world from process 4 of 10 on host r25u26n711-mic0
Hello world from process 5 of 10 on host r25u26n711-mic0
Hello world from process 6 of 10 on host r25u26n711-mic0
Hello world from process 7 of 10 on host r25u26n711-mic0
Hello world from process 8 of 10 on host r25u26n711-mic0
Hello world from process 9 of 10 on host r25u26n711-mic0
```
The same way MPI program can be executed on multiple hosts:
```console
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH \
-host r25u25n710 -n 4 ~/mpi-test \
: -host r25u26n711 -n 6 ~/mpi-test
```
### Symmetric Model
In a symmetric mode MPI programs are executed on both host computer(s) and MIC accelerator(s). Since MIC has a different
architecture and requires different binary file produced by the Intel compiler two different files has to be compiled before MPI program is executed.
In the previous section we have compiled two binary files, one for hosts "**mpi-test**" and one for MIC accelerators "**mpi-test-mic**". These two binaries can be executed at once using mpiexec.hydra:
```console
$ mpirun \
-genv $MIC_LD_LIBRARY_PATH \
-host r38u32n1001 -n 2 ~/mpi-test \
: -host r38u32n1001-mic0 -n 2 ~/mpi-test-mic
```
In this example the first two parameters (line 2 and 3) sets up required environment variables for execution. The third line specifies binary that is executed on host (here r38u32n1001) and the last line specifies the binary that is execute on the accelerator (here r38u32n1001-mic0).
The output of the program is:
```console
Hello world from process 0 of 4 on host r38u32n1001
Hello world from process 1 of 4 on host r38u32n1001
Hello world from process 2 of 4 on host r38u32n1001-mic0
Hello world from process 3 of 4 on host r38u32n1001-mic0
```
The execution procedure can be simplified by using the mpirun command with the machine file a a parameter. Machine file contains list of all nodes and accelerators that should used to execute MPI processes.
An example of a machine file that uses 2 >hosts (**r38u32n1001** and **r38u32n1002**) and 2 accelerators **(r38u32n1001-mic0** and **r38u32n1002-mic0**) to run 2 MPI processes on each of them:
```console
$ cat hosts_file_mix
r38u32n1001:2
r38u32n1001-mic0:2
r38u33n1002:2
r38u33n1002-mic0:2
```
In addition if a naming convention is set in a way that the name of the binary for host is **"bin_name"** and the name of the binary for the accelerator is **"bin_name-mic"** then by setting up the environment variable **I_MPI_MIC_POSTFIX** to **"-mic"** user do not have to specify the names of booth binaries. In this case mpirun needs just the name of the host binary file (i.e. "mpi-test") and uses the suffix to get a name of the binary for accelerator (i..e. "mpi-test-mic").
```console
$ export I_MPI_MIC_POSTFIX=-mic
```
To run the MPI code using mpirun and the machine file "hosts_file_mix" use:
```console
$ mpirun \
-genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH \
-machinefile hosts_file_mix \
~/mpi-test
```
A possible output of the MPI "hello-world" example executed on two hosts and two accelerators is:
```console
Hello world from process 0 of 8 on host r38u31n1000
Hello world from process 1 of 8 on host r38u31n1000
Hello world from process 2 of 8 on host r38u31n1000-mic0
Hello world from process 3 of 8 on host r38u31n1000-mic0
Hello world from process 4 of 8 on host r38u32n1001
Hello world from process 5 of 8 on host r38u32n1001
Hello world from process 6 of 8 on host r38u32n1001-mic0
Hello world from process 7 of 8 on host r38u32n1001-mic0
```
!!! note
At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
### Using Automatically Generated Node-Files
Set of node-files, that can be used instead of manually creating a new one every time, is generated for user convenience. Six node-files are generated:
!!! note
**Node-files:**
- /lscratch/${PBS_JOBID}/nodefile-cn Hosts only node-file
- /lscratch/${PBS_JOBID}/nodefile-mic MICs only node-file
- /lscratch/${PBS_JOBID}/nodefile-mix Hosts and MICs node-file
- /lscratch/${PBS_JOBID}/nodefile-cn-sn Hosts only node-file, using short names
- /lscratch/${PBS_JOBID}/nodefile-mic-sn MICs only node-file, using short names
- /lscratch/${PBS_JOBID}/nodefile-mix-sn Hosts and MICs node-file, using short names
Each host or accelerator is listed only once per file. User has to specify how many jobs should be executed per node using `-n` parameter of the mpirun command.
## Optimization
For more details about optimization techniques read Intel document [Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors](http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization "http&#x3A;//software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization")
# ISV Licenses
## Guide to Managing Independent Software Vendor Licenses
On Anselm cluster there are also installed commercial software applications, also known as ISV (Independent Software Vendor), which are subjects to licensing. The licenses are limited and their usage may be restricted only to some users or user groups.
Currently Flex License Manager based licensing is supported on the cluster for products ANSYS, Comsol and MATLAB. More information about the applications can be found in the general software section.
If an ISV application was purchased for educational (research) purposes and also for commercial purposes, then there are always two separate versions maintained and suffix "edu" is used in the name of the non-commercial version.
## Overview of the Licenses Usage
!!! note
The overview is generated every minute and is accessible from web or command line interface.
### Web Interface
For each license there is a table, which provides the information about the name, number of available (purchased/licensed), number of used and number of free license features [https://extranet.it4i.cz/anselm/licenses](https://extranet.it4i.cz/anselm/licenses)
### Text Interface
For each license there is a unique text file, which provides the information about the name, number of available (purchased/licensed), number of used and number of free license features. The text files are accessible from the Anselm command prompt.
| Product | File with license state | Note |
| ---------- | ------------------------------------------------- | ------------------- |
| ansys | /apps/user/licenses/ansys_features_state.txt | Commercial |
| comsol | /apps/user/licenses/comsol_features_state.txt | Commercial |
| comsol-edu | /apps/user/licenses/comsol-edu_features_state.txt | Non-commercial only |
| matlab | /apps/user/licenses/matlab_features_state.txt | Commercial |
| matlab-edu | /apps/user/licenses/matlab-edu_features_state.txt | Non-commercial only |
The file has a header which serves as a legend. All the info in the legend starts with a hash (#) so it can be easily filtered when parsing the file via a script.
Example of the Commercial Matlab license state:
```console
$ cat /apps/user/licenses/matlab_features_state.txt
# matlab
# -------------------------------------------------
# FEATURE TOTAL USED AVAIL
# -------------------------------------------------
MATLAB 1 1 0
SIMULINK 1 0 1
Curve_Fitting_Toolbox 1 0 1
Signal_Blocks 1 0 1
GADS_Toolbox 1 0 1
Image_Toolbox 1 0 1
Compiler 1 0 1
Neural_Network_Toolbox 1 0 1
Optimization_Toolbox 1 0 1
Signal_Toolbox 1 0 1
Statistics_Toolbox 1 0 1
```
## License Tracking in PBS Pro Scheduler and Users Usage
Each feature of each license is accounted and checked by the scheduler of PBS Pro. If you ask for certain licenses, the scheduler won't start the job until the asked licenses are free (available). This prevents to crash batch jobs, just because of unavailability of the needed licenses.
The general format of the name is `feature__APP__FEATURE`.
Names of applications (APP):
```bash
ansys
comsol
comsol-edu
matlab
matlab-edu
```
To get the FEATUREs of a license take a look into the corresponding state file ([see above](/software/isv_licenses/#Licence)), or use:
### Application and List of Provided Features
* **ansys** $ grep -v "#" /apps/user/licenses/ansys_features_state.txt | cut -f1 -d' '
* **comsol** $ grep -v "#" /apps/user/licenses/comsol_features_state.txt | cut -f1 -d' '
* **comsol-ed** $ grep -v "#" /apps/user/licenses/comsol-edu_features_state.txt | cut -f1 -d' '
* **matlab** $ grep -v "#" /apps/user/licenses/matlab_features_state.txt | cut -f1 -d' '
* **matlab-edu** $ grep -v "#" /apps/user/licenses/matlab-edu_features_state.txt | cut -f1 -d' '
Example of PBS Pro resource name, based on APP and FEATURE name:
| Application | Feature | PBS Pro resource name |
| ----------- | -------------------------- | ----------------------------------------------- |
| ansys | acfd | feature_ansys_acfd |
| ansys | aa_r | feature_ansys_aa_r |
| comsol | COMSOL | feature_comsol_COMSOL |
| comsol | HEATTRANSFER | feature_comsol_HEATTRANSFER |
| comsol-edu | COMSOLBATCH | feature_comsol-edu_COMSOLBATCH |
| comsol-edu | STRUCTURALMECHANICS | feature_comsol-edu_STRUCTURALMECHANICS |
| matlab | MATLAB | feature_matlab_MATLAB |
| matlab | Image_Toolbox | feature_matlab_Image_Toolbox |
| matlab-edu | MATLAB_Distrib_Comp_Engine | feature_matlab-edu_MATLAB_Distrib_Comp_Engine |
| matlab-edu | Image_Acquisition_Toolbox | feature_matlab-edu_Image_Acquisition_Toolbox\\ |
!!! warnig
Resource names in PBS Pro are case sensitive.
### Example of qsub Statement
Run an interactive PBS job with 1 Matlab EDU license, 1 Distributed Computing Toolbox and 32 Distributed Computing Engines (running on 32 cores):
```console
$ qsub -I -q qprod -A PROJECT_ID -l select=2:ncpus=16 -l feature__matlab-edu__MATLAB=1 -l feature__matlab-edu__Distrib_Computing_Toolbox=1 -l feature__matlab-edu__MATLAB_Distrib_Comp_Engine=32
```
The license is used and accounted only with the real usage of the product. So in this example, the general Matlab is used after Matlab is run by the user and not at the time, when the shell of the interactive job is started. Also the Distributed Computing licenses are used at the time, when the user uses the distributed parallel computation in Matlab (e. g. issues pmode start, matlabpool, etc.).
# Conda (Anaconda)
Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux. Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer. It was created for Python programs, but it can package and distribute software for any language.
Conda as a package manager helps you find and install packages. If you need a package that requires a different version of Python, you do not need to switch to a different environment manager, because conda is also an environment manager. With just a few commands, you can set up a totally separate environment to run that different version of Python, while continuing to run your usual version of Python in your normal environment.
Conda treats Python the same as any other package, so it is easy to manage and update multiple installations.
Anaconda supports Python 2.7, 3.4, 3.5 and 3.6. The default is Python 2.7 or 3.6, depending on which installer you used:
* For the installers “Anaconda” and “Miniconda,” the default is 2.7.
* For the installers “Anaconda3” or “Miniconda3,” the default is 3.6.
## Conda on the IT4Innovations Clusters
On the clusters we have the Anaconda2 and Anaconda3 software installed. How to use these modules is shown below.
!!! note
Use the command `ml av conda` to get up-to-date versions of the modules.
```console
$ ml av conda
------------- /apps/modules/lang ---------------------------------
Anaconda2/4.4.0 Anaconda3/4.4.0
```
## Anaconda2
The default is Python 2.7
### First Usage Module Anaconda2
```console
$ ml Anaconda2/4.4.0
$ python --version
Python 2.7.13 :: Anaconda 4.4.0 (64-bit)
$ conda install numpy
Fetching package metadata .........
Solving package specifications: .
Package plan for installation in environment /apps/all/Anaconda2/4.4.0:
The following packages will be UPDATED:
anaconda: 4.4.0-np112py27_0 --> custom-py27_0
...
...
...
CondaIOError: Missing write permissions in: /apps/all/Anaconda2/4.4.0
#
# You don't appear to have the necessary permissions to install packages
# into the install area '/apps/all/Anaconda2/4.4.0'.
# However you can clone this environment into your home directory and
# then make changes to it.
# This may be done using the command:
#
# $ conda create -n my_root --clone="/apps/all/Anaconda2/4.4.0"
$
$ conda create -n anaconda2 --clone="/apps/all/Anaconda2/4.4.0"
Source: /apps/all/Anaconda2/4.4.0
Destination: /home/svi47/.conda/envs/anaconda2
The following packages cannot be cloned out of the root environment:
- conda-4.3.21-py27_0
- conda-env-2.6.0-0
Packages: 213
...
...
...
#
# To activate this environment, use:
# > source activate anaconda2
#
# To deactivate this en
```
### Usage Module Anaconda2
```console
$ ml Anaconda2/4.4.0
$ source activate anaconda2
(anaconda2) ~]$
```
## Anaconda3
The default is Python 3.6
### First Usage Module Anaconda3
```console
$ ml Anaconda3/4.4.0
$ python --version
Python 3.6.1 :: Anaconda 4.4.0 (64-bit)
$ conda install numpy
Fetching package metadata .........
Solving package specifications: .
Package plan for installation in environment /apps/all/Anaconda3/4.4.0:
The following packages will be UPDATED:
anaconda: 4.4.0-np112py36_0 --> custom-py36_0
...
...
...
CondaIOError: Missing write permissions in: /apps/all/Anaconda3/4.4.0
#
# You don't appear to have the necessary permissions to install packages
# into the install area '/apps/all/Anaconda3/4.4.0'.
# However you can clone this environment into your home directory and
# then make changes to it.
# This may be done using the command:
#
# $ conda create -n my_root --clone="/apps/all/Anaconda3/4.4.0"
$
$ conda create -n anaconda3 --clone="/apps/all/Anaconda3/4.4.0"
Source: /apps/all/Anaconda3/4.4.0
Destination: /home/svi47/.conda/envs/anaconda3
The following packages cannot be cloned out of the root environment:
- conda-4.3.21-py36_0
- conda-env-2.6.0-0
Packages: 200
Files: 6
...
...
...
#
# To activate this environment, use:
# > source activate anaconda3
#
# To deactivate this environment, use:
# > source deactivate anaconda3
#
$ source activate anaconda3
(anaconda3) ~]$
```
### Usage Module Anaconda3
```console
$ ml Anaconda3/4.4.0
$ source activate anaconda3
(anaconda3) ~]$
```
# CSharp
C# is available on the cluster.
```console
$ ml av mono
-------------------- /apps/modules/lang ---------------
Mono/5.0.0.100
```
!!! note
Use the command `ml av mono` to get up-to-date versions of the modules.
Activate C# by loading the Mono module:
```console
$ ml Mono
```
## Examples
### Hello World
Copy this code to new file hello.cs:
```csc
using System;
class HelloWorld {
static void Main() {
Console.WriteLine("Hello world!!!");
}
}
```
Compile the program and make *Windows executable*.
```console
$ mcs -out:hello.exe hello.cs
```
Now run the program:
```console
$ mono hello.exe
Hello world!!!
```
### Interactive Console
Type:
```console
$ csharp
Mono C# Shell, type "help;" for help
Enter statements below.
csharp>
```
Now you are in interactive mode. You can try following example.
```csc
csharp> using System;
csharp> int a = 5;
csharp> double b = 1.5;
csharp> Console.WriteLine("{0}*{1} is equal to {2}", a,b,a*b);
5*1.5 is equal to 7.5
csharp> a == b
false
```
Show all files modified in last 5 days:
```csc
csharp> using System.IO;
csharp> from f in Directory.GetFiles ("mydirectory")
> let fi = new FileInfo (f)
> where fi.LastWriteTime > DateTime.Now-TimeSpan.FromDays(5) select f;
{ "mydirectory/mynewfile.cs", "mydirectory/script.sh" }
```
## MPI.NET
MPI is available for mono.
```csc
using System;
using MPI;
class MPIHello
{
static void Main(string[] args)
{
using (new MPI.Environment(ref args))
{
Console.WriteLine("Greetings from node {0} of {1} running on {2}",
Communicator.world.Rank, Communicator.world.Size,
MPI.Environment.ProcessorName);
}
}
}
```
Compile and run the program on Anselm:
```console
$ qsub -I -A DD-13-5 -q qexp -l select=2:ncpus=16,walltime=00:30:00
$ ml mpi.net
$ mcs -out:csc.exe -reference:/apps/tools/mpi.net/1.0.0-mono-3.12.1/lib/MPI.dll csc.cs
$ mpirun -n 4 mono csc.exe
Greetings from node 2 of 4 running on cn204
Greetings from node 0 of 4 running on cn204
Greetings from node 3 of 4 running on cn199
Greetings from node 1 of 4 running on cn199
```
For more informations look at [Mono documentation page](http://www.mono-project.com/docs/).
# Java
Java is available on the cluster. Activate java by loading the Java module
```console
$ ml Java
```
Note that the Java module must be loaded on the compute nodes as well, in order to run java on compute nodes.
Check for java version and path
```console
$ java -version
$ which java
```
With the module loaded, not only the runtime environment (JRE), but also the development environment (JDK) with the compiler is available.
```console
$ javac -version
$ which javac
```
Java applications may use MPI for inter-process communication, in conjunction with OpenMPI. Read more on [here](http://www.open-mpi.org/faq/?category=java). This functionality is currently not supported on Anselm cluster. In case you require the java interface to MPI, contact [cluster support](https://support.it4i.cz/rt/).
## Java With OpenMPI
Because there is an increasing interest in using Java for HPC. Also, MPI can benefit from Java because its widespread use makes it likely to find new uses beyond traditional HPC applications.
The Java bindings are integrated into OpenMPI starting from the v1.7 series. Beginning with the v2.0 series, the Java bindings include coverage of MPI-3.1.
### Example (Hello.java)
```java
import mpi.*;
class Hello {
static public void main(String[] args) throws MPIException {
MPI.Init(args);
int myrank = MPI.COMM_WORLD.getRank();
int size = MPI.COMM_WORLD.getSize() ;
System.out.println("Hello world from rank " + myrank + " of " + size);
MPI.Finalize();
}
}
```
```console
$ ml Java/1.8.0_144
$ ml OpenMPI/1.8.0_144
$ mpijavac Hello.java
$ mpirun java Hello
Hello world from rank 23 of 28
Hello world from rank 25 of 28
Hello world from rank 0 of 28
Hello world from rank 4 of 28
Hello world from rank 7 of 28
Hello world from rank 8 of 28
Hello world from rank 11 of 28
Hello world from rank 12 of 28
Hello world from rank 13 of 28
Hello world from rank 18 of 28
Hello world from rank 17 of 28
Hello world from rank 24 of 28
Hello world from rank 27 of 28
Hello world from rank 2 of 28
Hello world from rank 3 of 28
Hello world from rank 1 of 28
Hello world from rank 10 of 28
Hello world from rank 14 of 28
Hello world from rank 16 of 28
Hello world from rank 19 of 28
Hello world from rank 26 of 28
Hello world from rank 6 of 28
Hello world from rank 9 of 28
Hello world from rank 15 of 28
Hello world from rank 20 of 28
Hello world from rank 5 of 28
Hello world from rank 21 of 28
Hello world from rank 22 of 28
```
# Python
Python is a widely used high-level programming language for general-purpose programming, created by Guido van Rossum and first released in 1991. An interpreted language, Python has a design philosophy that emphasizes code readability (notably using whitespace indentation to delimit code blocks rather than curly brackets or keywords), and a syntax that allows programmers to express concepts in fewer lines of code than might be used in languages such as C++ or Java. The language provides constructs intended to enable writing clear programs on both a small and large scale.
Python features a dynamic type system and automatic memory management and supports multiple programming paradigms, including object-oriented, imperative, functional programming, and procedural styles. It has a large and comprehensive standard library.
* [Documentation for Python 3.X](http://docs.python.org/3/)
* [Documentation for Python 2.X](http://docs.python.org/2/)
## Python on the IT4Innovations Clusters
On the clusters we have the Python 2.X and Python 3.X software installed. How to use these modules is shown below.
!!! note
Use the command `ml av python/` to get up-to-date versions of the modules.
```console
$ ml av python/
-------------------------- /apps/modules/lang --------------------------
Python/2.7.8-intel-2015b Python/2.7.9-gompi-2015e Python/2.7.10-GCC-4.9.3-2.25-bare Python/2.7.11-intel-2016a Python/3.4.3-intel-2015b Python/3.5.2-intel-2017.00
Python/2.7.8-intel-2016.01 Python/2.7.9-ictce-7.3.5 Python/2.7.10-GNU-4.9.3-2.25-bare Python/2.7.11-intel-2017a Python/3.5.1-intel-2016.01 Python/3.5.2
Python/2.7.9-foss-2015b Python/2.7.9-intel-2015b Python/2.7.11-foss-2016a Python/2.7.11-intel-2017.00 Python/3.5.1-intel-2017.00 Python/3.6.1
Python/2.7.9-foss-2015g Python/2.7.9-intel-2016.01 Python/2.7.11-GCC-4.9.3-2.25-bare Python/2.7.13-base Python/3.5.1 Python/3.6.2-base (D)
Python/2.7.9-GNU-5.1.0-2.25 Python/2.7.9 Python/2.7.11-intel-2015b Python/2.7.13 Python/3.5.2-foss-2016a
-------------------------- /apps/modules/math ---------------------------
ScientificPython/2.9.4-intel-2015b-Python-2.7.9 ScientificPython/2.9.4-intel-2015b-Python-2.7.11 ScientificPython/2.9.4-intel-2016.01-Python-2.7.9 (D)
Where:
D: Default Module
If you need software that is not listed, request it at support@it4i.cz.
```
## Python 2.X
Python 2.7 is scheduled to be the last major version in the 2.x series before it moves into an extended maintenance period. This release contains many of the features that were first released in Python 3.1.
```console
$ ml av python/2
----------------------------------------------------------------------------------------------- /apps/modules/lang ------------------------------------------------------------------------------------------------
Python/2.7.8-intel-2015b Python/2.7.9-GNU-5.1.0-2.25 Python/2.7.9-intel-2016.01 Python/2.7.11-foss-2016a Python/2.7.11-intel-2017a
Python/2.7.8-intel-2016.01 Python/2.7.9-gompi-2015e Python/2.7.9 Python/2.7.11-GCC-4.9.3-2.25-bare Python/2.7.11-intel-2017.00
Python/2.7.9-foss-2015b Python/2.7.9-ictce-7.3.5 Python/2.7.10-GCC-4.9.3-2.25-bare Python/2.7.11-intel-2015b Python/2.7.13-base
Python/2.7.9-foss-2015g Python/2.7.9-intel-2015b Python/2.7.10-GNU-4.9.3-2.25-bare Python/2.7.11-intel-2016a Python/2.7.13
----------------------------------------------------------------------------------------------- /apps/modules/math ------------------------------------------------------------------------------------------------
ScientificPython/2.9.4-intel-2015b-Python-2.7.9 ScientificPython/2.9.4-intel-2015b-Python-2.7.11 ScientificPython/2.9.4-intel-2016.01-Python-2.7.9 (D)
Where:
D: Default Module
If you need software that is not listed, request it at support@it4i.cz.
```
### Used Module Python/2.x
```console
$ python --version
Python 2.6.6
$ ml Python/2.7.13
$ python --version
Python 2.7.1
```
### Packages in Python/2.x
```console
$ pip list
appdirs (1.4.3)
asn1crypto (0.22.0)
backports-abc (0.5)
backports.shutil-get-terminal-size (1.0.0)
backports.ssl-match-hostname (3.5.0.1)
BeautifulSoup (3.2.1)
beautifulsoup4 (4.5.3)
...
```
### How to Install New Package to Python/2.x?
```console
$ ml Python/2.7.13
$ python --version
$ pip install wheel --user
Collecting wheel
Downloading wheel-0.30.0-py2.py3-none-any.whl (49kB)
100% |████████████████████████████████| 51kB 835kB/s
Installing collected packages: wheel
Successfully installed wheel-0.30.0
```
### How to Update Package in Python/2.x?
```console
$ ml Python/2.7.13
$ python --version
$ pip install scipy --upgrade --user
Collecting scipy
Downloading scipy-0.19.1-cp27-cp27mu-manylinux1_x86_64.whl (45.0MB)
100% |████████████████████████████████| 45.0MB 5.8kB/s
Requirement already up-to-date: numpy>=1.8.2 in /apps/all/Python/2.7.13/lib/python2.7/site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-0.19.1
```
## Python 3.X
Python 3.0 (a.k.a. "Python 3000" or "Py3k") is a new version of the language that is incompatible with the 2.x line of releases. The language is mostly the same, but many details, especially how built-in objects like dictionaries and strings work, have changed considerably, and a lot of deprecated features have finally been removed. Also, the standard library has been reorganized in a few prominent places.
```console
$ ml av python/3
---------------------- /apps/modules/lang ----------------------
Python/3.4.3-intel-2015b Python/3.5.1-intel-2017.00 Python/3.5.2-foss-2016a Python/3.5.2 Python/3.6.2-base (D)
Python/3.5.1-intel-2016.01 Python/3.5.1 Python/3.5.2-intel-2017.00 Python/3.6.1
Where:
D: Default Module
If you need software that is not listed, request it at support@it4i.cz.
```
### Used Module Python/3.x
```console
$ python --version
Python 2.6.6
$ ml Python/3.6.2-base
$ python --version
Python 3.6.2
```
### Packages in Python/3.x
```console
$ pip3 list
nose (1.3.7)
pip (8.0.2)
setuptools (20.1.1)
```
### How to Install New Package to Python/3.x
```console
$ ml Python/3.6.2-base
$ python --version
Python 3.6.2
$ pip3 install pandas --user
Collecting pandas
Downloading pandas-0.20.3.tar.gz (10.4MB)
100% |████████████████████████████████| 10.4MB 42kB/s
Collecting python-dateutil>=2 (from pandas)
Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
100% |████████████████████████████████| 196kB 1.3MB/s
Collecting pytz>=2011k (from pandas)
Downloading pytz-2017.2-py2.py3-none-any.whl (484kB)
100% |████████████████████████████████| 487kB 757kB/s
Collecting numpy>=1.7.0 (from pandas)
Using cached numpy-1.13.1.zip
Collecting six>=1.5 (from python-dateutil>=2->pandas)
Downloading six-1.11.0-py2.py3-none-any.whl
Building wheels for collected packages: pandas, numpy
Running setup.py bdist_wheel for pandas ... done
Stored in directory: /home/kru0052/.cache/pip/wheels/dd/17/6c/a1c7e8d855f3a700b21256329fd396d105b533c5ed3e20c5e9
Running setup.py bdist_wheel for numpy ... done
Stored in directory: /home/kru0052/.cache/pip/wheels/94/44/90/4ce81547e3e5f4398b1601d0051e828b8160f8d3f3dd5a0c8c
Successfully built pandas numpy
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
Successfully installed numpy-1.13.1 pandas-0.20.3 python-dateutil-2.6.1 pytz-2017.2 six-1.11.0
```
### How to Update Package in Python/3.x?
```console
$ pip3 install scipy --upgrade --user
Collecting scipy
Downloading scipy-0.19.1-cp27-cp27mu-manylinux1_x86_64.whl (45.0MB)
100% |████████████████████████████████| 45.0MB 5.8kB/s
Requirement already up-to-date: numpy>=1.8.2 in /apps/all/Python/3.6.2/lib/python3.6/site-packages (from scipy)
Installing collected packages: scipy
Successfully installed scipy-0.19.1
```