Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 442 additions and 0 deletions
Valgrind
========
Valgrind is a tool for memory debugging and profiling.
About Valgrind
--------------
Valgrind is an open-source tool, used mainly for debuggig memory-related problems, such as memory leaks, use of uninitalized memory etc. in C/C++ applications. The toolchain was however extended over time with more functionality, such as debugging of threaded applications, cache profiling, not limited only to C/C++.
Valgind is an extremely useful tool for debugging memory errors such as [off-by-one](http://en.wikipedia.org/wiki/Off-by-one_error). Valgrind uses a virtual machine and dynamic recompilation of binary code, because of that, you can expect that programs being debugged by Valgrind run 5-100 times slower.
The main tools available in Valgrind are :
- **Memcheck**, the original, must used and default tool. Verifies memory access in you program and can detect use of unitialized memory, out of bounds memory access, memory leaks, double free, etc.
- **Massif**, a heap profiler.
- **Hellgrind** and **DRD** can detect race conditions in multi-threaded applications.
- **Cachegrind**, a cache profiler.
- **Callgrind**, a callgraph analyzer.
- For a full list and detailed documentation, please refer to the [official Valgrind documentation](http://valgrind.org/docs/).
Installed versions
------------------
There are two versions of Valgrind available on Anselm.
- Version 3.6.0, installed by operating system vendor in /usr/bin/valgrind. This version is available by default, without the need to load any module. This version however does not provide additional MPI support.
- Version 3.9.0 with support for Intel MPI, available in [module](../../environment-and-modules/) valgrind/3.9.0-impi. After loading the module, this version replaces the default valgrind.
Usage
-----
Compile the application which you want to debug as usual. It is advisable to add compilation flags -g (to add debugging information to the binary so that you will see original source code lines in the output) and -O0 (to disable compiler optimizations).
For example, lets look at this C code, which has two problems :
```cpp
#include <stdlib.h>
void f(void)
{
int* x = malloc(10 * sizeof(int));
x[10] = 0; // problem 1: heap block overrun
} // problem 2: memory leak -- x not freed
int main(void)
{
f();
return 0;
}
```
Now, compile it with Intel compiler :
```bash
$ module add intel
$ icc -g valgrind-example.c -o valgrind-example
```
Now, lets run it with Valgrind. The syntax is :
*valgrind [valgrind options] &lt;your program binary&gt; [your program options]*
If no Valgrind options are specified, Valgrind defaults to running Memcheck tool. Please refer to the Valgrind documentation for a full description of command line options.
```bash
$ valgrind ./valgrind-example
==12652== Memcheck, a memory error detector
==12652== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==12652== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==12652== Command: ./valgrind-example
==12652==
==12652== Invalid write of size 4
==12652== at 0x40053E: f (valgrind-example.c:6)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652== Address 0x5861068 is 0 bytes after a block of size 40 alloc'd
==12652== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==12652== by 0x400528: f (valgrind-example.c:5)
==12652== by 0x40054E: main (valgrind-example.c:11)
==12652==
==12652==
==12652== HEAP SUMMARY:
==12652== in use at exit: 40 bytes in 1 blocks
==12652== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==12652==
==12652== LEAK SUMMARY:
==12652== definitely lost: 40 bytes in 1 blocks
==12652== indirectly lost: 0 bytes in 0 blocks
==12652== possibly lost: 0 bytes in 0 blocks
==12652== still reachable: 0 bytes in 0 blocks
==12652== suppressed: 0 bytes in 0 blocks
==12652== Rerun with --leak-check=full to see details of leaked memory
==12652==
==12652== For counts of detected and suppressed errors, rerun with: -v
==12652== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 6 from 6)
```
In the output we can see that Valgrind has detected both errors - the off-by-one memory access at line 5 and a memory leak of 40 bytes. If we want a detailed analysis of the memory leak, we need to run Valgrind with --leak-check=full option :
```bash
$ valgrind --leak-check=full ./valgrind-example
==23856== Memcheck, a memory error detector
==23856== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==23856== Using Valgrind-3.6.0 and LibVEX; rerun with -h for copyright info
==23856== Command: ./valgrind-example
==23856==
==23856== Invalid write of size 4
==23856== at 0x40067E: f (valgrind-example.c:6)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856== Address 0x66e7068 is 0 bytes after a block of size 40 alloc'd
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856==
==23856== HEAP SUMMARY:
==23856== in use at exit: 40 bytes in 1 blocks
==23856== total heap usage: 1 allocs, 0 frees, 40 bytes allocated
==23856==
==23856== 40 bytes in 1 blocks are definitely lost in loss record 1 of 1
==23856== at 0x4C26FDE: malloc (vg_replace_malloc.c:236)
==23856== by 0x400668: f (valgrind-example.c:5)
==23856== by 0x40068E: main (valgrind-example.c:11)
==23856==
==23856== LEAK SUMMARY:
==23856== definitely lost: 40 bytes in 1 blocks
==23856== indirectly lost: 0 bytes in 0 blocks
==23856== possibly lost: 0 bytes in 0 blocks
==23856== still reachable: 0 bytes in 0 blocks
==23856== suppressed: 0 bytes in 0 blocks
==23856==
==23856== For counts of detected and suppressed errors, rerun with: -v
==23856== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 6 from 6)
```
Now we can see that the memory leak is due to the malloc() at line 6.
Usage with MPI
---------------------------
Although Valgrind is not primarily a parallel debugger, it can be used to debug parallel applications as well. When launching your parallel applications, prepend the valgrind command. For example :
```bash
$ mpirun -np 4 valgrind myapplication
```
The default version without MPI support will however report a large number of false errors in the MPI library, such as :
```bash
==30166== Conditional jump or move depends on uninitialised value(s)
==30166== at 0x4C287E8: strlen (mc_replace_strmem.c:282)
==30166== by 0x55443BD: I_MPI_Processor_model_number (init_interface.c:427)
==30166== by 0x55439E0: I_MPI_Processor_arch_code (init_interface.c:171)
==30166== by 0x558D5AE: MPID_nem_impi_init_shm_configuration (mpid_nem_impi_extensions.c:1091)
==30166== by 0x5598F4C: MPID_nem_init_ckpt (mpid_nem_init.c:566)
==30166== by 0x5598B65: MPID_nem_init (mpid_nem_init.c:489)
==30166== by 0x539BD75: MPIDI_CH3_Init (ch3_init.c:64)
==30166== by 0x5578743: MPID_Init (mpid_init.c:193)
==30166== by 0x554650A: MPIR_Init_thread (initthread.c:539)
==30166== by 0x553369F: PMPI_Init (init.c:195)
==30166== by 0x4008BD: main (valgrind-example-mpi.c:18)
```
so it is better to use the MPI-enabled valgrind from module. The MPI version requires library /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so, which must be included in the LD_PRELOAD environment variable.
Lets look at this MPI example :
```cpp
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int *data = malloc(sizeof(int)*99);
MPI_Init(&argc, &argv);
MPI_Bcast(data, 100, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
```
There are two errors - use of uninitialized memory and invalid length of the buffer. Lets debug it with valgrind :
```bash
$ module add intel impi
$ mpicc -g valgrind-example-mpi.c -o valgrind-example-mpi
$ module add valgrind/3.9.0-impi
$ mpirun -np 2 -env LD_PRELOAD /apps/tools/valgrind/3.9.0/impi/lib/valgrind/libmpiwrap-amd64-linux.so valgrind ./valgrind-example-mpi
```
Prints this output : (note that there is output printed for every launched MPI process)
```bash
==31318== Memcheck, a memory error detector
==31318== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31318== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31318== Command: ./valgrind-example-mpi
==31318==
==31319== Memcheck, a memory error detector
==31319== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==31319== Using Valgrind-3.9.0 and LibVEX; rerun with -h for copyright info
==31319== Command: ./valgrind-example-mpi
==31319==
valgrind MPI wrappers 31319: Active for pid 31319
valgrind MPI wrappers 31319: Try MPIWRAP_DEBUG=help for possible options
valgrind MPI wrappers 31318: Active for pid 31318
valgrind MPI wrappers 31318: Try MPIWRAP_DEBUG=help for possible options
==31319== Unaddressable byte(s) found during client check request
==31319== at 0x4E35974: check_mem_is_addressable_untyped (libmpiwrap.c:960)
==31319== by 0x4E5D0FE: PMPI_Bcast (libmpiwrap.c:908)
==31319== by 0x400911: main (valgrind-example-mpi.c:20)
==31319== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31319== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31319== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31319==
==31318== Uninitialised byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x6929040 is 0 bytes inside a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318== Unaddressable byte(s) found during client check request
==31318== at 0x4E3591D: check_mem_is_defined_untyped (libmpiwrap.c:952)
==31318== by 0x4E5D06D: PMPI_Bcast (libmpiwrap.c:908)
==31318== by 0x400911: main (valgrind-example-mpi.c:20)
==31318== Address 0x69291cc is 0 bytes after a block of size 396 alloc'd
==31318== at 0x4C27AAA: malloc (vg_replace_malloc.c:291)
==31318== by 0x4007BC: main (valgrind-example-mpi.c:8)
==31318==
==31318==
==31318== HEAP SUMMARY:
==31318== in use at exit: 3,172 bytes in 67 blocks
==31318== total heap usage: 191 allocs, 124 frees, 81,203 bytes allocated
==31318==
==31319==
==31319== HEAP SUMMARY:
==31319== in use at exit: 3,172 bytes in 67 blocks
==31319== total heap usage: 175 allocs, 108 frees, 48,435 bytes allocated
==31319==
==31318== LEAK SUMMARY:
==31318== definitely lost: 408 bytes in 3 blocks
==31318== indirectly lost: 256 bytes in 1 blocks
==31318== possibly lost: 0 bytes in 0 blocks
==31318== still reachable: 2,508 bytes in 63 blocks
==31318== suppressed: 0 bytes in 0 blocks
==31318== Rerun with --leak-check=full to see details of leaked memory
==31318==
==31318== For counts of detected and suppressed errors, rerun with: -v
==31318== Use --track-origins=yes to see where uninitialised values come from
==31318== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 4 from 4)
==31319== LEAK SUMMARY:
==31319== definitely lost: 408 bytes in 3 blocks
==31319== indirectly lost: 256 bytes in 1 blocks
==31319== possibly lost: 0 bytes in 0 blocks
==31319== still reachable: 2,508 bytes in 63 blocks
==31319== suppressed: 0 bytes in 0 blocks
==31319== Rerun with --leak-check=full to see details of leaked memory
==31319==
==31319== For counts of detected and suppressed errors, rerun with: -v
==31319== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 4 from 4)
```
We can see that Valgrind has reported use of unitialised memory on the master process (which reads the array to be broadcasted) and use of unaddresable memory on both processes.
hVampir
======
Vampir is a commercial trace analysis and visualisation tool. It can work with traces in OTF and OTF2 formats. It does not have the functionality to collect traces, you need to use a trace collection tool (such as [Score-P](../../../salomon/software/debuggers/score-p/)) first to collect the traces.
![](../../../img/Snmekobrazovky20160708v12.33.35.png)
Installed versions
------------------
Version 8.5.0 is currently installed as module Vampir/8.5.0 :
```bash
$ module load Vampir/8.5.0
$ vampir &
```
User manual
-----------
You can find the detailed user manual in PDF format in $EBROOTVAMPIR/doc/vampir-manual.pdf
References
----------
[1]. <https://www.vampir.eu>
This diff is collapsed.
Intel Compilers
===============
The Intel compilers version 13.1.1 are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```bash
$ module load intel
$ icc -v
$ ifort -v
```
The intel compilers provide for vectorization of the code, via the AVX instructions and support threading parallelization via OpenMP
For maximum performance on the Anselm cluster, compile your programs using the AVX instructions, with reporting where the vectorization was used. We recommend following compilation options for high performance
```bash
$ icc -ipo -O3 -vec -xAVX -vec-report1 myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -vec -xAVX -vec-report1 myprog.f mysubroutines.f -o myprog.x
```
In this example, we compile the program enabling interprocedural optimizations between source files (-ipo), aggresive loop optimizations (-O3) and vectorization (-vec -xAVX)
The compiler recognizes the omp, simd, vector and ivdep pragmas for OpenMP parallelization and AVX vectorization. Enable the OpenMP parallelization by the **-openmp** compiler switch.
```bash
$ icc -ipo -O3 -vec -xAVX -vec-report1 -openmp myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -vec -xAVX -vec-report1 -openmp myprog.f mysubroutines.f -o myprog.x
```
Read more at <http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/compiler/cpp-lin/index.htm>
Sandy Bridge/Haswell binary compatibility
-----------------------------------------
Anselm nodes are currently equipped with Sandy Bridge CPUs, while Salomon will use Haswell architecture. >The new processors are backward compatible with the Sandy Bridge nodes, so all programs that ran on the Sandy Bridge processors, should also run on the new Haswell nodes. >To get optimal performance out of the Haswell processors a program should make use of the special AVX2 instructions for this processor. One can do this by recompiling codes with the compiler flags >designated to invoke these instructions. For the Intel compiler suite, there are two ways of doing this:
- Using compiler flag (both for Fortran and C): -xCORE-AVX2. This will create a binary with AVX2 instructions, specifically for the Haswell processors. Note that the executable will not run on Sandy Bridge nodes.
- Using compiler flags (both for Fortran and C): -xAVX -axCORE-AVX2. This will generate multiple, feature specific auto-dispatch code paths for Intel® processors, if there is a performance benefit. So this binary will run both on Sandy Bridge and Haswell processors. During runtime it will be decided which path to follow, dependent on which processor you are running on. In general this will result in larger binaries.
Intel Debugger
==============
Debugging serial applications
-----------------------------
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use X display for running the GUI.
```bash
$ module load intel
$ idb
```
The debugger may run in text mode. To debug in text mode, use
```bash
$ idbc
```
To debug on the compute nodes, module intel must be loaded. The GUI on compute nodes may be accessed using the same way as in the GUI section
Example:
```bash
$ qsub -q qexp -l select=1:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19654.srv11 ready
$ module load intel
$ module load java
$ icc -O0 -g myprog.c -o myprog.x
$ idb ./myprog.x
```
In this example, we allocate 1 full compute node, compile program myprog.c with debugging options -O0 -g and run the idb debugger interactively on the myprog.x executable. The GUI access is via X11 port forwarding provided by the PBS workload manager.
Debugging parallel applications
-------------------------------
Intel debugger is capable of debugging multithreaded and MPI parallel programs as well.
### Small number of MPI ranks
For debugging small number of MPI ranks, you may execute and debug each rank in separate xterm terminal (do not forget the X display. Using Intel MPI, this may be done in following way:
```bash
$ qsub -q qexp -l select=2:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ module load intel impi
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE --enable-x xterm -e idbc ./mympiprog.x
```
In this example, we allocate 2 full compute node, run xterm on each node and start idb debugger in command line mode, debugging two ranks of mympiprog.x application. The xterm will pop up for each rank, with idb prompt ready. The example is not limited to use of Intel MPI
### Large number of MPI ranks
Run the idb debugger from within the MPI debug option. This will cause the debugger to bind to all ranks and provide aggregated outputs across the ranks, pausing execution automatically just after startup. You may then set break points and step the execution manually. Using Intel MPI:
```bash
$ qsub -q qexp -l select=2:ncpus=16 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ module load intel impi
$ mpirun -n 32 -idb ./mympiprog.x
```
### Debugging multithreaded application
Run the idb debugger in GUI mode. The menu Parallel contains number of tools for debugging multiple threads. One of the most useful tools is the **Serialize Execution** tool, which serializes execution of concurrent threads for easy orientation and identification of concurrency related bugs.
Further information
-------------------
Exhaustive manual on idb features and usage is published at [Intel website](http://software.intel.com/sites/products/documentation/doclib/stdxe/2013/composerxe/debugger/user_guide/index.htm)
Intel TBB
=========
Intel Threading Building Blocks
-------------------------------
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. The tasks are executed by a runtime scheduler and may
be offloaded to [MIC accelerator](../intel-xeon-phi/).
Intel TBB version 4.1 is available on Anselm
```bash
$ module load tbb
```
The module sets up environment variables, required for linking and running tbb enabled applications.
!!! Note "Note"
Link the tbb library, using -ltbb
Examples
--------
Number of examples, demonstrating use of TBB and its built-in scheduler is available on Anselm, in the $TBB_EXAMPLES directory.
```bash
$ module load intel
$ module load tbb
$ cp -a $TBB_EXAMPLES/common $TBB_EXAMPLES/parallel_reduce /tmp/
$ cd /tmp/parallel_reduce/primes
$ icc -O2 -DNDEBUG -o primes.x main.cpp primes.cpp -ltbb
$ ./primes.x
```
In this example, we compile, link and run the primes example, demonstrating use of parallel task-based reduce in computation of prime numbers.
You will need the tbb module loaded to run the tbb enabled executable. This may be avoided, by compiling library search paths into the executable.
```bash
$ icc -O2 -o primes.x main.cpp primes.cpp -Wl,-rpath=$LIBRARY_PATH -ltbb
```
Further reading
---------------
Read more on Intel website, <http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm>
This diff is collapsed.
This diff is collapsed.