Skip to content
Snippets Groups Projects
papi.md 7.6 KiB
Newer Older
David Hrbáč's avatar
David Hrbáč committed
# PAPI

## Introduction
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
Performance Application Programming Interface (PAPI)  is a portable interface to access hardware performance counters (such as instruction counts and cache misses) found in most modern architectures. With the new component framework, PAPI is not limited only to CPU counters, but offers also components for CUDA, network, Infiniband etc.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
PAPI provides two levels of interface - a simpler, high level interface and more detailed low level interface.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

PAPI can be used with parallel as well as serial programs.

David Hrbáč's avatar
David Hrbáč committed
## Usage

To use PAPI, load [module](../../environment-and-modules/) papi:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ module load papi
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
This will load the default version. Execute module avail papi for a list of installed versions.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
## Utilities

Lukáš Krupčík's avatar
Lukáš Krupčík committed
The bin directory of PAPI (which is automatically added to  $PATH upon loading the module) contains various utilites.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### Papi_avail
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Prints which preset events are available on the current CPU. The third column indicated whether the preset event is available on the current CPU.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ papi_avail
    Available events and hardware information.
    --------------------------------------------------------------------------------
    PAPI Version : 5.3.2.0
    Vendor string and code : GenuineIntel (1)
    Model string and code : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (45)
    CPU Revision : 7.000000
    CPUID Info : Family: 6 Model: 45 Stepping: 7
    CPU Max Megahertz : 2601
    CPU Min Megahertz : 1200
    Hdw Threads per core : 1
    Cores per Socket : 8
    Sockets : 2
    NUMA Nodes : 2
    CPUs per Node : 8
    Total CPUs : 16
    Running in a VM : no
    Number Hardware Counters : 11
    Max Multiplex Counters : 32
    --------------------------------------------------------------------------------
    Name Code Avail Deriv Description (Note)
    PAPI_L1_DCM 0x80000000 Yes No Level 1 data cache misses
    PAPI_L1_ICM 0x80000001 Yes No Level 1 instruction cache misses
    PAPI_L2_DCM 0x80000002 Yes Yes Level 2 data cache misses
    PAPI_L2_ICM 0x80000003 Yes No Level 2 instruction cache misses
    PAPI_L3_DCM 0x80000004 No No Level 3 data cache misses
    PAPI_L3_ICM 0x80000005 No No Level 3 instruction cache misses
    PAPI_L1_TCM 0x80000006 Yes Yes Level 1 cache misses
    PAPI_L2_TCM 0x80000007 Yes No Level 2 cache misses
    PAPI_L3_TCM 0x80000008 Yes No Level 3 cache misses
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    ....
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### Papi_native_avail
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Prints which native events are available on the current CPU.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### Papi_cost
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Measures the cost (in cycles) of basic PAPI operations.

David Hrbáč's avatar
David Hrbáč committed
### Papi_mem_info
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Prints information about the memory architecture of the current CPU.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
## PAPI API
Lukáš Krupčík's avatar
Lukáš Krupčík committed
PAPI provides two kinds of events:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
* **Preset events** is a set of predefined common CPU events, standardized across platforms.
* **Native events **is a set of all events supported by the current hardware. This is a larger set of features than preset. For other components than CPU, only native events are usually available.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
To use PAPI in your application, you need to link the appropriate include file.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
* papi.h for C
* f77papi.h for Fortran 77
* f90papi.h for Fortran 90
* fpapi.h for Fortran with preprocessor
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
The include path is automatically added by papi module to $INCLUDE.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### High Level API
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Pavel Jirásek's avatar
Pavel Jirásek committed
Please refer to [this description of the High level API](http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:High_Level).
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### Low Level API
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Pavel Jirásek's avatar
Pavel Jirásek committed
Please refer to [this description of the Low level API](http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:Low_Level).
Lukáš Krupčík's avatar
Lukáš Krupčík committed

### Timers

Pavel Jirásek's avatar
Pavel Jirásek committed
PAPI provides the most accurate timers the platform can support. [See](http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:Timers).
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
### System Information
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Pavel Jirásek's avatar
Pavel Jirásek committed
PAPI can be used to query some system infromation, such as CPU name and MHz. [See](http://icl.cs.utk.edu/projects/papi/wiki/PAPIC:System_Information).
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
## Example
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
The following example prints MFLOPS rate of a naive matrix-matrix multiplication:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    #include <stdlib.h>
    #include <stdio.h>
    #include "papi.h"
    #define SIZE 1000

    int main(int argc, char **argv) {
     float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
     float real_time, proc_time, mflops;
     long long flpins;
     int retval;
     int i,j,k;

     /* Initialize the Matrix arrays */
     for ( i=0; i<SIZE*SIZE; i++ ){
     mresult[0][i] = 0.0;
Lukáš Krupčík's avatar
Lukáš Krupčík committed
     matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1;
Lukáš Krupčík's avatar
Lukáš Krupčík committed
     }
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
     /* Setup PAPI library and begin collecting data from the counters */
     if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
     printf("Error!");

     /* A naive Matrix-Matrix multiplication */
     for (i=0;i<SIZE;i++)
     for(j=0;j<SIZE;j++)
     for(k=0;k<SIZE;k++)
     mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];

     /* Collect the data into the variables passed in */
     if((retval=PAPI_flops( &real_time, &proc_time, &flpins, &mflops))<PAPI_OK)
     printf("Error!");
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
     printf("Real_time:t%fnProc_time:t%fnTotal flpins:t%lldnMFLOPS:tt%fn", real_time, proc_time, flpins, mflops);
     PAPI_shutdown();
     return 0;
    }
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Now compile and run the example :
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ gcc matrix.c -o matrix -lpapi
    $ ./matrix
    Real_time: 8.852785
    Proc_time: 8.850000
    Total flpins: 6012390908
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    MFLOPS: 679.366211
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Let's try with optimizations enabled :

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ gcc -O3 matrix.c -o matrix -lpapi
    $ ./matrix
    Real_time: 0.000020
    Proc_time: 0.000000
    Total flpins: 6
    MFLOPS: inf
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Now we see a seemingly strange result - the multiplication took no time and only 6 floating point instructions were issued. This is because the compiler optimizations have completely removed the multiplication loop, as the result is actually not used anywhere in the program. We can fix this by adding some "dummy" code at the end of the Matrix-Matrix multiplication routine :
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```cpp
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    for (i=0; i<SIZE;i++)
     for (j=0; j<SIZE; j++)
       if (mresult[i][j] == -1.0) printf("x");
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Now the compiler won't remove the multiplication loop. (However it is still not that smart to see that the result won't ever be negative). Now run the code again:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ gcc -O3 matrix.c -o matrix -lpapi
    $ ./matrix
    Real_time: 8.795956
    Proc_time: 8.790000
    Total flpins: 18700983160
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    MFLOPS: 2127.529297
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

### Intel Xeon Phi

David Hrbáč's avatar
David Hrbáč committed
!!! note
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    PAPI currently supports only a subset of counters on the Intel Xeon Phi processor compared to Intel Xeon, for example the floating point operations counter is missing.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
To use PAPI in [Intel Xeon Phi](../intel-xeon-phi/) native applications, you need to load module with " -mic" suffix, for example " papi/5.3.2-mic" :
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ module load papi/5.3.2-mic
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Then, compile your application in the following way:

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ module load intel
    $ icc -mmic -Wl,-rpath,/apps/intel/composer_xe_2013.5.192/compiler/lib/mic matrix-mic.c -o matrix-mic -lpapi -lpfm
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
To execute the application on MIC, you need to manually set LD_LIBRARY_PATH:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
David Hrbáč's avatar
David Hrbáč committed
    $ qsub -q qmic -A NONE-0-0 -I
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ ssh mic0
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ export LD_LIBRARY_PATH=/apps/tools/papi/5.4.0-mic/lib/
    $ ./matrix-mic
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
Alternatively, you can link PAPI statically (-static flag), then LD_LIBRARY_PATH does not need to be set.
Lukáš Krupčík's avatar
Lukáš Krupčík committed

You can also execute the PAPI tools on MIC :

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ /apps/tools/papi/5.4.0-mic/bin/papi_native_avail
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
To use PAPI in offload mode, you need to provide both host and MIC versions of PAPI:
Lukáš Krupčík's avatar
Lukáš Krupčík committed

Lukáš Krupčík's avatar
Lukáš Krupčík committed
```bash
Lukáš Krupčík's avatar
Lukáš Krupčík committed
    $ module load papi/5.4.0
    $ icc matrix-offload.c -o matrix-offload -offload-option,mic,compiler,"-L$PAPI_HOME-mic/lib -lpapi" -lpapi
Lukáš Krupčík's avatar
Lukáš Krupčík committed
```
Lukáš Krupčík's avatar
Lukáš Krupčík committed

David Hrbáč's avatar
David Hrbáč committed
## References

Pavel Jirásek's avatar
Pavel Jirásek committed
1. [Main project page](http://icl.cs.utk.edu/papi/)
1. [Wiki](http://icl.cs.utk.edu/projects/papi/wiki/Main_Page)
1. [API Documentation](http://icl.cs.utk.edu/papi/docs/)