Commit e30f7196 authored by Pavel Gajdušek's avatar Pavel Gajdušek

structure changed, to do: fix link

parent e6f2bcad
Pipeline #2670 passed with stages
in 1 minute and 1 second
# Debuggers and profilers summary
## Introduction
We provide state of the art programms and tools to develop, profile and debug HPC codes at IT4Innovations. On these pages, we provide an overview of the profiling and debugging tools available on Anslem at IT4I.
## Intel Debugger
Intel debugger is no longer available since Parallel Studio version 2015
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment.
$ ml intel
$ idb
Read more at the [Intel Debugger](../../anselm/software/intel-suite/intel-debugger/) page.
## Allinea Forge (DDT/MAP)
Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process - even if these processes are distributed across a cluster using an MPI implementation.
$ ml Forge
$ forge
Read more at the [Allinea DDT](allinea-ddt/) page.
## Allinea Performance Reports
Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs. Our license is limited to 64 MPI processes.
$ ml PerformanceReports/6.0
$ perf-report mpirun -n 64 ./my_application argument01 argument02
Read more at the [Allinea Performance Reports](allinea-performance-reports/) page.
## RougeWave Totalview
TotalView is a source- and machine-level debugger for multi-process, multi-threaded programs. Its wide range of tools provides ways to analyze, organize, and test programs, making it easy to isolate and identify problems in individual threads and processes in programs of great complexity.
$ ml TotalView/8.15.4-6-linux-x86-64
$ totalview
Read more at the [Totalview](total-view/) page.
## Vampir Trace Analyzer
Vampir is a GUI trace analyzer for traces in OTF format.
$ ml Vampir/8.5.0
$ vampir
Read more at the [Vampir](vampir/) page.
# Aislinn
* Aislinn is a dynamic verifier for MPI programs. For a fixed input it covers all possible runs with respect to nondeterminism introduced by MPI. It allows to detect bugs (for sure) that occurs very rare in normal runs.
* Aislinn detects problems like invalid memory accesses, deadlocks, misuse of MPI, and resource leaks.
* Aislinn is open-source software; you can use it without any licensing limitations.
* Web page of the project: <>
!!! note
Aislinn is software developed at IT4Innovations and some parts are still considered experimental. If you have any questions or experienced any problems, please contact the author: <>.
## Usage
Let us have the following program that contains a bug that is not manifested in all runs:
#include <mpi.h>
#include <stdlib.h>
int main(int argc, char **argv) {
int rank;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
int *mem1 = (int*) malloc(sizeof(int) * 2);
int *mem2 = (int*) malloc(sizeof(int) * 3);
int data;
MPI_Recv(&data, 1, MPI_INT, MPI_ANY_SOURCE, 1,
mem1[data] = 10; // <---------- Possible invalid memory write
MPI_Recv(&data, 1, MPI_INT, MPI_ANY_SOURCE, 1,
mem2[data] = 10;
if (rank == 1 || rank == 2) {
MPI_Send(&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD);
return 0;
The program does the following: process 0 receives two messages from anyone and processes 1 and 2 send a message to process 0. If a message from process 1 is received first, then the run does not expose the error. If a message from process 2 is received first, then invalid memory write occurs at line 16.
To verify this program by Aislinn, we first load Aislinn itself:
$ ml aislinn
Now we compile the program by Aislinn implementation of MPI. There are `mpicc` for C programs and `mpicxx` for C++ programs. Only MPI parts of the verified application has to be recompiled; non-MPI parts may remain untouched. Let us assume that our program is in `test.cpp`.
$ mpicc -g test.cpp -o test
The `-g` flag is not necessary, but it puts more debugging information into the program, hence Aislinn may provide more detailed report. The command produces executable file `test`.
Now we run the Aislinn itself. The argument `-p 3` specifies that we want to verify our program for the case of three MPI processes
$ aislinn -p 3 ./test
==AN== INFO: Aislinn v0.3.0
==AN== INFO: Found error 'Invalid write'
==AN== INFO: 1 error(s) found
==AN== INFO: Report written into 'report.html'
Aislinn found an error and produced HTML report. To view it, we can use any browser, e.g.:
$ firefox report.html
At the beginning of the report there are some basic summaries of the verification. In the second part (depicted in the following picture), the error is described.
It shows us:
* Error occurs in process 0 in test.cpp on line 16.
* Stdout and stderr streams are empty. (The program does not write anything).
* The last part shows MPI calls for each process that occurs in the invalid run. The more detailed information about each call can be obtained by mouse cursor.
### Limitations
Since the verification is a non-trivial process there are some of limitations.
* The verified process has to terminate in all runs, i.e. we cannot answer the halting problem.
* The verification is a computationally and memory demanding process. We put an effort to make it efficient and it is an important point for further research. However covering all runs will be always more demanding than techniques that examines only a single run. The good practise is to start with small instances and when it is feasible, make them bigger. The Aislinn is good to find bugs that are hard to find because they occur very rarely (only in a rare scheduling). Such bugs often do not need big instances.
* Aislinn expects that your program is a "standard MPI" program, i.e. processes communicate only through MPI, the verified program does not interacts with the system in some unusual ways (e.g. opening sockets).
There are also some limitations bounded to the current version and they will be removed in the future:
* All files containing MPI calls have to be recompiled by MPI implementation provided by Aislinn. The files that does not contain MPI calls, they do not have to recompiled. Aislinn MPI implementation supports many commonly used calls from MPI-2 and MPI-3 related to point-to-point communication, collective communication, and communicator management. Unfortunately, MPI-IO and one-side communication is not implemented yet.
* Each MPI can use only one thread (if you use OpenMP, set OMP_NUM_THREADS to 1).
* There are some limitations for using files, but if the program just reads inputs and writes results, it is ok.
# Allinea Forge (DDT,MAP)
Allinea Forge consist of two tools - debugger DDT and profiler MAP.
Allinea DDT, is a commercial debugger primarily for debugging parallel MPI or OpenMP programs. It also has a support for GPU (CUDA) and Intel Xeon Phi accelerators. DDT provides all the standard debugging features (stack trace, breakpoints, watches, view variables, threads etc.) for every thread running as part of your program, or for every process - even if these processes are distributed across a cluster using an MPI implementation.
Allinea MAP is a profiler for C/C++/Fortran HPC codes. It is designed for profiling parallel code, which uses pthreads, OpenMP or MPI.
## License and Limitations for Anselm Users
On Anselm users can debug OpenMP or MPI code that runs up to 64 parallel processes. In case of debugging GPU or Xeon Phi accelerated codes the limit is 8 accelerators. These limitation means that:
* 1 user can debug up 64 processes, or
* 32 users can debug 2 processes, etc.
In case of debugging on accelerators:
* 1 user can debug on up to 8 accelerators, or
* 8 users can debug on single accelerator.
## Compiling Code to Run With DDT
### Modules
Load all necessary modules to compile the code. For example:
$ ml intel
$ ml impi **or** ml OpenMPI/X.X.X-icc
Load the Allinea DDT module:
$ ml Forge
Compile the code:
$ mpicc -g -O0 -o test_debug test.c
$ mpif90 -g -O0 -o test_debug test.f
### Compiler Flags
Before debugging, you need to compile your code with theses flags:
!!! note
\- **g** : Generates extra debugging information usable by GDB. -g3 includes even more debugging information. This option is available for GNU and INTEL C/C++ and Fortran compilers.
- - **O0** : Suppress all optimizations.
## Starting a Job With DDT
Be sure to log in with an X window forwarding enabled. This could mean using the -X in the ssh:
$ ssh -X
Other options is to access login node using VNC. Please see the detailed information on how to [use graphic user interface on Anselm](/general/accessing-the-clusters/graphical-user-interface/x-window-system/)
From the login node an interactive session **with X windows forwarding** (-X option) can be started by following command:
$ qsub -I -X -A NONE-0-0 -q qexp -lselect=1:ncpus=16:mpiprocs=16,walltime=01:00:00
Then launch the debugger with the ddt command followed by the name of the executable to debug:
$ ddt test_debug
A submission window that appears have a prefilled path to the executable to debug. You can select the number of MPI processors and/or OpenMP threads on which to run and press run. Command line arguments to a program can be entered to the "Arguments " box.
To start the debugging directly without the submission window, user can specify the debugging and execution parameters from the command line. For example the number of MPI processes is set by option "-np 4". Skipping the dialog is done by "-start" option. To see the list of the "ddt" command line parameters, run "ddt --help".
ddt -start -np 4 ./hello_debug_impi
## Documentation
Users can find original User Guide after loading the DDT module:
[1] Discipline, Magic, Inspiration and Science: Best Practice Debugging with Allinea DDT, Workshop conducted at LLNL by Allinea on May 10, 2013, [link](
# Allinea Performance Reports
## Introduction
Allinea Performance Reports characterize the performance of HPC application runs. After executing your application through the tool, a synthetic HTML report is generated automatically, containing information about several metrics along with clear behavior statements and hints to help you improve the efficiency of your runs.
The Allinea Performance Reports is most useful in profiling MPI rograms.
Our license is limited to 64 MPI processes.
## Modules
Allinea Performance Reports version 6.0 is available
$ ml PerformanceReports/6.0
The module sets up environment variables, required for using the Allinea Performance Reports.
## Usage
Use the the perf-report wrapper on your (MPI) program.
Instead of [running your MPI program the usual way](../mpi/mpi/), use the the perf report wrapper:
$ perf-report mpirun ./mympiprog.x
The mpi program will run as usual. The perf-report creates two additional files, in \*.txt and \*.html format, containing the performance report. Note that demanding MPI codes should be run within [the queue system](../../anselm/job-submission-and-execution/).
## Example
In this example, we will be profiling the mympiprog.x MPI program, using Allinea performance reports. Assume that the code is compiled with intel compilers and linked against intel MPI library:
First, we allocate some nodes via the express queue:
$ qsub -q qexp -l select=2:ppn=24:mpiprocs=24:ompthreads=1 -I
qsub: waiting for job 262197.dm2 to start
qsub: job 262197.dm2 ready
Then we load the modules and run the program the usual way:
$ ml intel
$ ml PerfReports/6.0
$ mpirun ./mympiprog.x
Now lets profile the code:
$ perf-report mpirun ./mympiprog.x
Performance report files [mympiprog_32p\*.txt](mympiprog_32p_2014-10-15_16-56.txt) and [mympiprog_32p\*.html](mympiprog_32p_2014-10-15_16-56.html) were created. We can see that the code is very efficient on MPI and is CPU bounded.
## Introduction
CUBE is a graphical performance report explorer for displaying data from Score-P and Scalasca (and other compatible tools). The name comes from the fact that it displays performance data in a three-dimensions :
* **performance metric**, where a number of metrics are available, such as communication time or cache misses,
* **call path**, which contains the call tree of your program
* **system resource**, which contains system's nodes, processes and threads, depending on the parallel programming model.
Each dimension is organized in a tree, for example the time performance metric is divided into Execution time and Overhead time, call path dimension is organized by files and routines in your source code etc.
\*Figure 1. Screenshot of CUBE displaying data from Scalasca.\*
Each node in the tree is colored by severity (the color scheme is displayed at the bottom of the window, ranging from the least severe blue to the most severe being red). For example in Figure 1, we can see that most of the point-to-point MPI communication happens in routine exch_qbc, colored red.
## Installed Versions
Currently, there are two versions of CUBE 4.2.3 available as [modules](../../modules-matrix/):
* cube/4.2.3-gcc, compiled with GCC
* cube/4.2.3-icc, compiled with Intel compiler
## Usage
CUBE is a graphical application. Refer to Graphical User Interface documentation for a list of methods to launch graphical applications on Anselm.
!!! note
Analyzing large data sets can consume large amount of CPU and RAM. Do not perform large analysis on login nodes.
After loading the appropriate module, simply launch cube command, or alternatively you can use scalasca -examine command to launch the GUI. Note that for Scalasca datasets, if you do not analyze the data with scalasca -examine before to opening them with CUBE, not all performance data will be available.
1\. <>
# Intel VTune Amplifier XE
## Introduction
Intel *®* VTune™ Amplifier, part of Intel Parallel studio, is a GUI profiling tool designed for Intel processors. It offers a graphical performance analysis of single core and multithreaded applications. A highlight of the features:
* Hotspot analysis
* Locks and waits analysis
* Low level specific counters, such as branch analysis and memory bandwidth
* Power usage analysis - frequency and sleep states.
## Usage
To profile an application with VTune Amplifier, special kernel modules need to be loaded. The modules are not loaded on the login nodes, thus direct profiling on login nodes is not possible. By default, the kernel modules ale not loaded on compute nodes neither. In order to have the modules loaded, you need to specify vtune=version PBS resource at job submit. The version is the same as for environment module. For example to use VTune/2016_update1:
$ qsub -q qexp -A OPEN-0-0 -I -l select=1,vtune=2016_update1
After that, you can verify the modules sep\*, pax and vtsspp are present in the kernel :
$ lsmod | grep -e sep -e pax -e vtsspp
vtsspp 362000 0
sep3_15 546657 0
pax 4312 0
To launch the GUI, first load the module:
$ module add VTune/2016_update1
and launch the GUI :
$ amplxe-gui
The GUI will open in new window. Click on "New Project..." to create a new project. After clicking OK, a new window with project properties will appear. At "Application:", select the bath to your binary you want to profile (the binary should be compiled with -g flag). Some additional options such as command line arguments can be selected. At "Managed code profiling mode:" select "Native" (unless you want to profile managed mode .NET/Mono applications). After clicking OK, your project is created.
To run a new analysis, click "New analysis...". You will see a list of possible analysis. Some of them will not be possible on the current CPU (eg. Intel Atom analysis is not possible on Sandy bridge CPU), the GUI will show an error box if you select the wrong analysis. For example, select "Advanced Hotspots". Clicking on Start will start profiling of the application.
## Remote Analysis
VTune Amplifier also allows a form of remote analysis. In this mode, data for analysis is collected from the command line without GUI, and the results are then loaded to GUI on another machine. This allows profiling without interactive graphical jobs. To perform a remote analysis, launch a GUI somewhere, open the new analysis window and then click the button "Command line" in bottom right corner. It will show the command line needed to perform the selected analysis.
The command line will look like this:
/apps/all/VTune/2016_update1/vtune_amplifier_xe_2016.1.1.434111/bin64/amplxe-cl -collect advanced-hotspots -app-working-dir /home/sta545/tmp -- /home/sta545/tmp/sgemm
Copy the line to clipboard and then you can paste it in your jobscript or in command line. After the collection is run, open the GUI once again, click the menu button in the upper right corner, and select "Open > Result...". The GUI will load the results from the run.
## Xeon Phi
It is possible to analyze both native and offloaded Xeon Phi applications.
### Native Mode
This mode is useful for native Xeon Phi applications launched directly on the card. In *Analysis Target* window, select *Intel Xeon Phi coprocessor (native)*, choose path to the binary and MIC card to run on.
### Offload Mode
This mode is useful for applications that are launched from the host and use offload, OpenCL or mpirun. In *Analysis Target* window, select *Intel Xeon Phi coprocessor (native)*, choose path to the binaryand MIC card to run on.
!!! note
If the analysis is interrupted or aborted, further analysis on the card might be impossible and you will get errors like "ERROR connecting to MIC card". In this case please contact our support to reboot the MIC card.
You may also use remote analysis to collect data from the MIC and then analyze it in the GUI later :
Native launch:
$ /apps/all/VTune/2016_update1/vtune_amplifier_xe_2016.1.1.434111/bin64/amplxe-cl -target-system mic-native:0 -collect advanced-hotspots -- /home/sta545/tmp/vect-add-mic
Host launch:
$ /apps/all/VTune/2016_update1/vtune_amplifier_xe_2016.1.1.434111/bin64/amplxe-cl -target-system mic-host-launch:0 -collect advanced-hotspots -- /home/sta545/tmp/sgemm
You can obtain this command line by pressing the "Command line..." button on Analysis Type screen.
## References
1. [Performance Tuning for Intel® Xeon Phi™ Coprocessors](
1. [Intel® VTune™ Amplifier Support](
1. [](
This source diff could not be displayed because it is too large. You can view the blob instead.
Executable: mympiprog.x
Resources: 32 processes, 2 nodes
Machine: cn182
Started on: Wed Oct 15 16:56:23 2014
Total time: 7 seconds (0 minutes)
Full path: /home/user
Summary: mympiprog.x is CPU-bound in this configuration
CPU: 88.6% |========|
MPI: 11.4% ||
I/O: 0.0% |
This application run was CPU-bound. A breakdown of this time and advice for investigating further is found in the CPU section below.
As very little time is spent in MPI calls, this code may also benefit from running at larger scales.
A breakdown of how the 88.6% total CPU time was spent:
Scalar numeric ops: 50.0% |====|
Vector numeric ops: 50.0% |====|
Memory accesses: 0.0% |
Other: 0.0% |
The per-core performance is arithmetic-bound. Try to increase the amount of time spent in vectorized instructions by analyzing the compiler's vectorization reports.
A breakdown of how the 11.4% total MPI time was spent:
Time in collective calls: 100.0% |=========|
Time in point-to-point calls: 0.0% |
Effective collective rate: 1.65e+02 bytes/s
Effective point-to-point rate: 0.00e+00 bytes/s
Most of the time is spent in collective calls with a very low transfer rate. This suggests load imbalance is causing synchonization overhead; use an MPI profiler to investigate further.
A breakdown of how the 0.0% total I/O time was spent:
Time in reads: 0.0% |
Time in writes: 0.0% |
Effective read rate: 0.00e+00 bytes/s
Effective write rate: 0.00e+00 bytes/s
No time is spent in I/O operations. There's nothing to optimize here!
Per-process memory usage may also affect scaling:
Mean process memory usage: 2.33e+07 bytes
Peak process memory usage: 2.35e+07 bytes
Peak node memory usage: 2.8% |
The peak node memory usage is very low. You may be able to reduce the amount of allocation time used by running with fewer MPI processes and more data on each process.
## Introduction
Performance Application Programming Interface (PAPI) is a portable interface to access hardware performance counters (such as instruction counts and cache misses) found in most modern architectures. With the new component framework, PAPI is not limited only to CPU counters, but offers also components for CUDA, network, Infiniband etc.
PAPI provides two levels of interface - a simpler, high level interface and more detailed low level interface.
PAPI can be used with parallel as well as serial programs.
## Usage
To use PAPI, load [module](../../modules-matrix/) papi:
$ ml papi
This will load the default version. Execute module avail papi for a list of installed versions.
## Utilities
The bin directory of PAPI (which is automatically added to $PATH upon loading the module) contains various utilites.
### Papi_avail
Prints which preset events are available on the current CPU. The third column indicated whether the preset event is available on the current CPU.
$ papi_avail
Available events and hardware information.
PAPI Version :
Vendor string and code : GenuineIntel (1)
Model string and code : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45)
CPU Revision : 7.000000
CPUID Info : Family: 6 Model: 45 Stepping: 7
CPU Max Megahertz : 2601
CPU Min Megahertz : 1200
Hdw Threads per core : 1
Cores per Socket : 8
Sockets : 2
NUMA Nodes : 2
CPUs per Node : 8
Total CPUs : 16
Running in a VM : no
Number Hardware Counters : 11
Max Multiplex Counters : 32
Name Code Avail Deriv Description (Note)
PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses
PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses
PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses
PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses
### Papi_native_avail
Prints which native events are available on the current CPU.
### Papi_cost
Measures the cost (in cycles) of basic PAPI operations.
### Papi_mem_info
Prints information about the memory architecture of the current CPU.
PAPI provides two kinds of events:
* **Preset events** is a set of predefined common CPU events, standardized across platforms.
* **Native events **is a set of all events supported by the current hardware. This is a larger set of features than preset. For other components than CPU, only native events are usually available.
To use PAPI in your application, you need to link the appropriate include file.
* papi.h for C
* f77papi.h for Fortran 77
* f90papi.h for Fortran 90
* fpapi.h for Fortran with preprocessor
The include path is automatically added by papi module to $INCLUDE.
### High Level API
Please refer to [this description of the High level API](
### Low Level API
Please refer to [this description of the Low level API](
### Timers
PAPI provides the most accurate timers the platform can support. [See](
### System Information
PAPI can be used to query some system infromation, such as CPU name and MHz. [See](
## Example
The following example prints MFLOPS rate of a naive matrix-matrix multiplication:
#include <stdlib.h>
#include <stdio.h>
#include "papi.h"
#define SIZE 1000
int main(int argc, char **argv) {
float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
float real_time, proc_time, mflops;
long long flpins;
int retval;
int i,j,k;
/* Initialize the Matrix arrays */
for ( i=0; i<SIZE*SIZE; i++ ){
mresult[0][i] = 0.0;
matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1;
/* Setup PAPI library and begin collecting data from the counters */
if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
/* A naive Matrix-Matrix multiplication */
for (i=0;i<SIZE;i++)
mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];
/* Collect the data into the variables passed in */
if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
printf("Real_time:t%fnProc_time:t%fnTotal flpins:t%lldnMFLOPS:tt%fn", real_time, proc_time, flpins, mflops);
return 0;
Now compile and run the example :
$ gcc matrix.c -o matrix -lpapi
$ ./matrix
Real_time: 8.852785
Proc_time: 8.850000
Total flpins: 6012390908
MFLOPS: 679.366211
Let's try with optimizations enabled :
$ gcc -O3 matrix.c -o matrix -lpapi
$ ./matrix
Real_time: 0.000020
Proc_time: 0.000000
Total flpins: 6
Now we see a seemingly strange result - the multiplication took no time and only 6 floating point instructions were issued. This is because the compiler optimizations have completely removed the multiplication loop, as the result is actually not used anywhere in the program. We can fix this by adding some "dummy" code at the end of the Matrix-Matrix multiplication routine :
for (i=0; i<SIZE;i++)
for (j=0; j<SIZE; j++)
if (mresult[i][j] == -1.0) printf("x");
Now the compiler won't remove the multiplication loop. (However it is still not that smart to see that the result won't ever be negative). Now run the code again:
$ gcc -O3 matrix.c -o matrix -lpapi
$ ./matrix
Real_time: 8.795956
Proc_time: 8.790000
Total flpins: 18700983160
MFLOPS: 2127.529297
### Intel Xeon Phi
!!! note
PAPI currently supports only a subset of counters on the Intel Xeon Phi processor compared to Intel Xeon, for example the floating point operations counter is missing.
To use PAPI in [Intel Xeon Phi](../../anselm/software/intel-xeon-phi/) native applications, you need to load module with " -mic" suffix, for example " papi/5.3.2-mic" :
$ ml papi/5.3.2-mic
Then, compile your application in the following way:
$ ml intel
$ icc -mmic -Wl,-rpath,/apps/intel/composer_xe_2013.5.192/compiler/lib/mic matrix-mic.c -o matrix-mic -lpapi -lpfm
To execute the application on MIC, you need to manually set LD_LIBRARY_PATH:
$ qsub -q qmic -A NONE-0-0 -I
$ ssh mic0
$ export LD_LIBRARY_PATH="/apps/tools/papi/5.4.0-mic/lib/"
$ ./matrix-mic