Commit e30f7196 authored by Pavel Gajdušek's avatar Pavel Gajdušek
Browse files

structure changed, to do: fix link

parent e6f2bcad
# Debuggers and profilers summary
## Introduction
We provide state of the art programms and tools to develop, profile and debug HPC codes at IT4Innovations. On these pages, we provide an overview of the profiling and debugging tools available on Anslem at IT4I.
## Intel Debugger
Intel debugger is no longer available since Parallel Studio version 2015
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment.
$ ml intel
$ idb
Read more at the [Intel Debugger](../../anselm/software/intel-suite/intel-debugger/) page.
## Allinea Forge (DDT/MAP)
Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process - even if these processes are distributed across a cluster using an MPI implementation.
$ ml Forge
$ forge
Read more at the [Allinea DDT](allinea-ddt/) page.
## Allinea Performance Reports
Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs. Our license is limited to 64 MPI processes.
$ ml PerformanceReports/6.0
$ perf-report mpirun -n 64 ./my_application argument01 argument02
Read more at the [Allinea Performance Reports](allinea-performance-reports/) page.
## RougeWave Totalview
TotalView is a source- and machine-level debugger for multi-process, multi-threaded programs. Its wide range of tools provides ways to analyze, organize, and test programs, making it easy to isolate and identify problems in individual threads and processes in programs of great complexity.
$ ml TotalView/8.15.4-6-linux-x86-64
$ totalview
Read more at the [Totalview](total-view/) page.
## Vampir Trace Analyzer
Vampir is a GUI trace analyzer for traces in OTF format.
$ ml Vampir/8.5.0
$ vampir
Read more at the [Vampir](vampir/) page.
# Aislinn
* Aislinn is a dynamic verifier for MPI programs. For a fixed input it covers all possible runs with respect to nondeterminism introduced by MPI. It allows to detect bugs (for sure) that occurs very rare in normal runs.
* Aislinn detects problems like invalid memory accesses, deadlocks, misuse of MPI, and resource leaks.
* Aislinn is open-source software; you can use it without any licensing limitations.
* Web page of the project: <>
!!! note
Aislinn is software developed at IT4Innovations and some parts are still considered experimental. If you have any questions or experienced any problems, please contact the author: <>.
## Usage
Let us have the following program that contains a bug that is not manifested in all runs:
#include <mpi.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
int *mem1 = (int*) malloc(sizeof(int) * 2);
int *mem2 = (int*) malloc(sizeof(int) * 3);
int data;
MPI_Recv(&data, 1, MPI_INT, MPI_ANY_SOURCE, 1,
mem1[data] = 10; // <---------- Possible invalid memory write
MPI_Recv(&data, 1, MPI_INT, MPI_ANY_SOURCE, 1,
mem2[data] = 10;
if (rank == 1 || rank == 2) {
MPI_Send(&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
return 0;
The program does the following: process 0 receives two messages from anyone and processes 1 and 2 send a message to process 0. If a message from process 1 is received first, then the run does not expose the error. If a message from process 2 is received first, then invalid memory write occurs at line 16.
To verify this program by Aislinn, we first load Aislinn itself:
$ ml aislinn
Now we compile the program by Aislinn implementation of MPI. There are `mpicc` for C programs and `mpicxx` for C++ programs. Only MPI parts of the verified application has to be recompiled; non-MPI parts may remain untouched. Let us assume that our program is in `test.cpp`.
$ mpicc -g test.cpp -o test
The `-g` flag is not necessary, but it puts more debugging information into the program, hence Aislinn may provide more detailed report. The command produces executable file `test`.
Now we run the Aislinn itself. The argument `-p 3` specifies that we want to verify our program for the case of three MPI processes
$ aislinn -p 3 ./test
==AN== INFO: Aislinn v0.3.0
==AN== INFO: Found error 'Invalid write'
==AN== INFO: 1 error(s) found
==AN== INFO: Report written into 'report.html'
Aislinn found an error and produced HTML report. To view it, we can use any browser, e.g.:
$ firefox report.html
At the beginning of the report there are some basic summaries of the verification. In the second part (depicted in the following picture), the error is described.
It shows us:
* Error occurs in process 0 in test.cpp on line 16.
* Stdout and stderr streams are empty. (The program does not write anything).
* The last part shows MPI calls for each process that occurs in the invalid run. The more detailed information about each call can be obtained by mouse cursor.
### Limitations
Since the verification is a non-trivial process there are some of limitations.
* The verified process has to terminate in all runs, i.e. we cannot answer the halting problem.
* The verification is a computationally and memory demanding process. We put an effort to make it efficient and it is an important point for further research. However covering all runs will be always more demanding than techniques that examines only a single run. The good practise is to start with small instances and when it is feasible, make them bigger. The Aislinn is good to find bugs that are hard to find because they occur very rarely (only in a rare scheduling). Such bugs often do not need big instances.
* Aislinn expects that your program is a "standard MPI" program, i.e. processes communicate only through MPI, the verified program does not interacts with the system in some unusual ways (e.g. opening sockets).
There are also some limitations bounded to the current version and they will be removed in the future:
* All files containing MPI calls have to be recompiled by MPI implementation provided by Aislinn. The files that does not contain MPI calls, they do not have to recompiled. Aislinn MPI implementation supports many commonly used calls from MPI-2 and MPI-3 related to point-to-point communication, collective communication, and communicator management. Unfortunately, MPI-IO and one-side communication is not implemented yet.
* Each MPI can use only one thread (if you use OpenMP, set OMP_NUM_THREADS to 1).
* There are some limitations for using files, but if the program just reads inputs and writes results, it is ok.
# Allinea Forge (DDT,MAP)
Allinea Forge consist of two tools - debugger DDT and profiler MAP.
Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process - even if these processes are distributed across a cluster using an MPI implementation.
Allinea MAP is a profiler for C/C++/Fortran HPC codes. It is designed for profiling parallel code, which uses pthreads, OpenMP or MPI.
## License and Limitations for Anselm Users
On Anselm users can debug OpenMP or MPI code that runs up to 64 parallel processes. In case of debugging GPU or Xeon Phi accelerated codes the limit is 8 accelerators. These limitation means that:
* 1 user can debug up 64 processes, or
* 32 users can debug 2 processes, etc.
In case of debugging on accelerators:
* 1 user can debug on up to 8 accelerators, or
* 8 users can debug on single accelerator.
## Compiling Code to Run With DDT
### Modules
Load all necessary modules to compile the code. For example:
$ ml intel
$ ml impi **or** ml OpenMPI/X.X.X-icc
Load the Allinea DDT module:
$ ml Forge
Compile the code:
$ mpicc -g -O0 -o test_debug test.c
$ mpif90 -g -O0 -o test_debug test.f
### Compiler Flags
Before debugging, you need to compile your code with theses flags:
!!! note
\- **g** : Generates extra debugging information usable by GDB. -g3 includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
- - **O0** : Suppress all optimizations.
## Starting a Job With DDT
Be sure to log in with an X window forwarding enabled. This could mean using the -X in the ssh:
$ ssh -X
Other options is to access login node using VNC. Please see the detailed information on how to [use graphic user interface on Anselm](/general/accessing-the-clusters/graphical-user-interface/x-window-system/)
From the login node an interactive session **with X windows forwarding** (-X option) can be started by following command:
$ qsub -I -X -A NONE-0-0 -q qexp -lselect=1:ncpus=16:mpiprocs=16,walltime=01:00:00
Then launch the debugger with the ddt command followed by the name of the executable to debug:
$ ddt test_debug
A submission window that appears have a prefilled path to the executable to debug. You can select the number of MPI processors and/or OpenMP threads on which to run and press run. Command line arguments to a program can be entered to the "Arguments " box.
To start the debugging directly without the submission window, user can specify the debugging and execution parameters from the command line. For example the number of MPI processes is set by option "-np 4". Skipping the dialog is done by "-start" option. To see the list of the "ddt" command line parameters, run "ddt --help".
ddt -start -np 4 ./hello_debug_impi
## Documentation
Users can find original User Guide after loading the DDT module:
[1] Discipline, Magic, Inspiration and Science: Best Practice Debugging with Allinea DDT, Workshop conducted at LLNL by Allinea on May 10, 2013, [link](
# Allinea Performance Reports
## Introduction
Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs.
The Allinea Performance Reports is most useful in profiling MPI rograms.
Our license is limited to 64 MPI processes.
## Modules
Allinea Performance Reports version 6.0 is available
$ ml PerformanceReports/6.0
The module sets up environment variables, required for using the Allinea Performance Reports.
## Usage
Use the the perf-report wrapper on your (MPI) program.
Instead of [running your MPI program the usual way](../mpi/mpi/), use the the perf report wrapper:
$ perf-report mpirun ./mympiprog.x
The mpi program will run as usual. The perf-report creates two additional files, in \*.txt and \*.html format, containing the performance report. Note that demanding MPI codes should be run within [the queue system](../../anselm/job-submission-and-execution/).
## Example
In this example, we will be profiling the mympiprog.x MPI program, using Allinea performance reports. Assume that the code is compiled with intel compilers and linked against intel MPI library:
First, we allocate some nodes via the express queue:
$ qsub -q qexp -l select=2:ppn=24:mpiprocs=24:ompthreads=1 -I
qsub: waiting for job 262197.dm2 to start
qsub: job 262197.dm2 ready
Then we load the modules and run the program the usual way:
$ ml intel
$ ml PerfReports/6.0
$ mpirun ./mympiprog.x
Now lets profile the code:
$ perf-report mpirun ./mympiprog.x
Performance report files [mympiprog_32p\*.txt](mympiprog_32p_2014-10-15_16-56.txt) and [mympiprog_32p\*.html](mympiprog_32p_2014-10-15_16-56.html) were created. We can see that the code is very efficient on MPI and is CPU bounded.
## Introduction
CUBE is a graphical performance report explorer for displaying data from Score-P and Scalasca (and other compatible tools). The name comes from the fact that it displays performance data in a three-dimensions :
* **performance metric**, where a number of metrics are available, such as communication time or cache misses,
* **call path**, which contains the call tree of your program
* **system resource**, which contains system's nodes, processes and threads, depending on the parallel programming model.
Each dimension is organized in a tree, for example the time performance metric is divided into Execution time and Overhead time, call path dimension is organized by files and routines in your source code etc.
\*Figure 1. Screenshot of CUBE displaying data from Scalasca.\*
Each node in the tree is colored by severity (the color scheme is displayed at the bottom of the window, ranging from the least severe blue to the most severe being red). For example in Figure 1, we can see that most of the point-to-point MPI communication happens in routine exch_qbc, colored red.
## Installed Versions
Currently, there are two versions of CUBE 4.2.3 available as [modules](../../modules-matrix/):
* cube/4.2.3-gcc, compiled with GCC
* cube/4.2.3-icc, compiled with Intel compiler
## Usage
CUBE is a graphical application. Refer to Graphical User Interface documentation for a list of methods to launch graphical applications on Anselm.
!!! note
Analyzing large data sets can consume large amount of CPU and RAM. Do not perform large analysis on login nodes.
After loading the appropriate module, simply launch cube command, or alternatively you can use scalasca -examine command to launch the GUI. Note that for Scalasca datasets, if you do not analyze the data with scalasca -examine before to opening them with CUBE, not all performance data will be available.
1\. <>
# Intel Performance Counter Monitor
## Introduction
Intel PCM (Performance Counter Monitor) is a tool to monitor performance hardware counters on Intel>® processors, similar to [PAPI](papi/). The difference between PCM and PAPI is that PCM supports only Intel hardware, but PCM can monitor also uncore metrics, like memory controllers and >QuickPath Interconnect links.
## Installed Version
Currently installed version 2.6. To load the [module](../../modules-matrix/) issue:
$ ml intelpcm
## Command Line Tools
PCM provides a set of tools to monitor system/or application.
### Pcm-Memory
Measures memory bandwidth of your application or the whole system. Usage:
$ pcm-memory.x <delay>|[external_program parameters]
Specify either a delay of updates in seconds or an external program to monitor. If you get an error about PMU in use, respond "y" and relaunch the program.
Sample output:
-- Socket 0 --||-- Socket 1 --
-- Memory Performance Monitoring --||-- Memory Performance Monitoring --
-- Mem Ch 0: Reads (MB/s): 2.44 --||-- Mem Ch 0: Reads (MB/s): 0.26 --
-- Writes(MB/s): 2.16 --||-- Writes(MB/s): 0.08 --
-- Mem Ch 1: Reads (MB/s): 0.35 --||-- Mem Ch 1: Reads (MB/s): 0.78 --
-- Writes(MB/s): 0.13 --||-- Writes(MB/s): 0.65 --
-- Mem Ch 2: Reads (MB/s): 0.32 --||-- Mem Ch 2: Reads (MB/s): 0.21 --
-- Writes(MB/s): 0.12 --||-- Writes(MB/s): 0.07 --
-- Mem Ch 3: Reads (MB/s): 0.36 --||-- Mem Ch 3: Reads (MB/s): 0.20 --
-- Writes(MB/s): 0.13 --||-- Writes(MB/s): 0.07 --
-- NODE0 Mem Read (MB/s): 3.47 --||-- NODE1 Mem Read (MB/s): 1.45 --
-- NODE0 Mem Write (MB/s): 2.55 --||-- NODE1 Mem Write (MB/s): 0.88 --
-- NODE0 P. Write (T/s) : 31506 --||-- NODE1 P. Write (T/s): 9099 --
-- NODE0 Memory (MB/s): 6.02 --||-- NODE1 Memory (MB/s): 2.33 --
-- System Read Throughput(MB/s): 4.93 --
-- System Write Throughput(MB/s): 3.43 --
-- System Memory Throughput(MB/s): 8.35 --
### Pcm-Msr
Command pcm-msr.x can be used to read/write model specific registers of the CPU.
### Pcm-Numa
NUMA monitoring utility does not work on Anselm.
### Pcm-Pcie
Can be used to monitor PCI Express bandwith. Usage: pcm-pcie.x &lt;delay>
### Pcm-Power
Displays energy usage and thermal headroom for CPU and DRAM sockets. Usage: `pcm-power.x <delay> | <external program>`
### Pcm
This command provides an overview of performance counters and memory usage. Usage: `pcm.x <delay> | <external program>`
Sample output :
$ pcm.x ./matrix
Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)
Copyright (c) 2009-2013 Intel Corporation
Number of physical cores: 16
Number of logical cores: 16
Threads (logical cores) per physical core: 1
Num sockets: 2
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 8
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 2400000000 Hz
Package thermal spec power: 115 Watt; Package minimum power: 51 Watt; Package maximum power: 180 Watt;
Socket 0: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Socket 1: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
Number of PCM instances: 2
Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)
Detected Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"
Executing "./matrix" command:
Exit code: 0
EXEC : instructions per nominal CPU cycle
IPC : instructions per CPU cycle
FREQ : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state' (includes Intel Turbo Boost)
L3MISS: L3 cache misses
L2MISS: L2 cache misses (including other core's L2 cache *hits*)
L3HIT : L3 cache hit ratio (0.00-1.00)
L2HIT : L2 cache hit ratio (0.00-1.00)
L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
READ : bytes read from memory controller (in GBytes)
WRITE : bytes written to memory controller (in GBytes)
TEMP : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature
0 0 0.00 0.64 0.01 0.80 5592 11 K 0.49 0.13 0.32 0.06 N/A N/A 67
1 0 0.00 0.18 0.00 0.69 3086 5552 0.44 0.07 0.48 0.08 N/A N/A 68
2 0 0.00 0.23 0.00 0.81 300 562 0.47 0.06 0.43 0.08 N/A N/A 67
3 0 0.00 0.21 0.00 0.99 437 862 0.49 0.06 0.44 0.09 N/A N/A 73
4 0 0.00 0.23 0.00 0.93 293 559 0.48 0.07 0.42 0.09 N/A N/A 73
5 0 0.00 0.21 0.00 1.00 423 849 0.50 0.06 0.43 0.10 N/A N/A 69
6 0 0.00 0.23 0.00 0.94 285 558 0.49 0.06 0.41 0.09 N/A N/A 71
7 0 0.00 0.18 0.00 0.81 674 1130 0.40 0.05 0.53 0.08 N/A N/A 65
8 1 0.00 0.47 0.01 1.26 6371 13 K 0.51 0.35 0.31 0.07 N/A N/A 64
9 1 2.30 1.80 1.28 1.29 179 K 15 M 0.99 0.59 0.04 0.71 N/A N/A 60
10 1 0.00 0.22 0.00 1.26 315 570 0.45 0.06 0.43 0.08 N/A N/A 67
11 1 0.00 0.23 0.00 0.74 321 579 0.45 0.05 0.45 0.07 N/A N/A 66
12 1 0.00 0.22 0.00 1.25 305 570 0.46 0.05 0.42 0.07 N/A N/A 68
13 1 0.00 0.22 0.00 1.26 336 581 0.42 0.04 0.44 0.06 N/A N/A 69
14 1 0.00 0.22 0.00 1.25 314 565 0.44 0.06 0.43 0.07 N/A N/A 69
15 1 0.00 0.29 0.00 1.19 2815 6926 0.59 0.39 0.29 0.08 N/A N/A 69
SKT 0 0.00 0.46 0.00 0.79 11 K 21 K 0.47 0.10 0.38 0.07 0.00 0.00 65
SKT 1 0.29 1.79 0.16 1.29 190 K 15 M 0.99 0.59 0.05 0.70 0.01 0.01 61
TOTAL * 0.14 1.78 0.08 1.28 201 K 15 M 0.99 0.59 0.05 0.70 0.01 0.01 N/A
Instructions retired: 1345 M ; Active cycles: 755 M ; Time (TSC): 582 Mticks ; C0 (active,non-halted) core residency: 6.30 %
C1 core residency: 0.14 %; C3 core residency: 0.20 %; C6 core residency: 0.00 %; C7 core residency: 93.36 %;
C2 package residency: 48.81 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;
PHYSICAL CORE IPC : 1.78 => corresponds to 44.50 % utilization for cores in active state
Instructions per nominal CPU cycle: 0.14 => corresponds to 3.60 % core utilization over time interval
Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):
SKT 0 0 0 | 0% 0%
SKT 1 0 0 | 0% 0%
Total QPI incoming data traffic: 0 QPI data traffic/Memory controller traffic: 0.00
Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):
SKT 0 0 0 | 0% 0%
SKT 1 0 0 | 0% 0%
Total QPI outgoing data and non-data traffic: 0
SKT 0 package consumed 4.06 Joules
SKT 1 package consumed 9.40 Joules
TOTAL: 13.46 Joules
SKT 0 DIMMs consumed 4.18 Joules
SKT 1 DIMMs consumed 4.28 Joules
TOTAL: 8.47 Joules
Cleaning up
### Pcm-Sensor
Can be used as a sensor for ksysguard GUI, which is currently not installed on Anselm.
## API
In a similar fashion to PAPI, PCM provides a C++ API to access the performance counter from within your application. Refer to the [Doxygen documentation]( for details of the API.
!!! note
Due to security limitations, using PCM API to monitor your applications is currently not possible on Anselm. (The application must be run as root user)
Sample program using the API :
#include <stdlib.h>
#include <stdio.h>
#include "cpucounters.h"
#define SIZE 1000
using namespace std;
int main(int argc, char **argv) {
float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
float real_time, proc_time, mflops;
long long flpins;
int retval;
int i,j,k;
PCM * m = PCM::getInstance();
if (m->program() != PCM::Success) return 1;
SystemCounterState before_sstate = getSystemCounterState();
/* Initialize the Matrix arrays */
for ( i=0; i<SIZE*SIZE; i++ ){
mresult[0][i] = 0.0;
matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1; }
/* A naive Matrix-Matrix multiplication */
for (i=0;i<SIZE;i++)
mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];
SystemCounterState after_sstate = getSystemCounterState();
cout << "Instructions per clock:" << getIPC(before_sstate,after_sstate)
<< "L3 cache hit ratio:" << getL3CacheHitRatio(before_sstate,after_sstate)
<< "Bytes read:" << getBytesReadFromMC(before_sstate,after_sstate);
for (i=0; i<SIZE;i++)
for (j=0; j<SIZE; j++)
if (mresult[i][j] == -1) printf("x");
return 0;