Intel Performance Counter Monitor
=================================

Introduction
------------
Intel PCM (Performance Counter Monitor) is a tool to monitor performance hardware counters on Intel>® processors, similar to [PAPI](papi/). The difference between PCM and PAPI is that PCM supports only Intel hardware, but PCM can monitor also uncore metrics, like memory controllers and >QuickPath Interconnect links.

Installed version
------------------------------
Currently installed version 2.6. To load the [module](../../environment-and-modules/), issue:

```bash
    $ module load intelpcm
```

Command line tools
------------------
PCM provides a set of tools to monitor system/or application.

### pcm-memory

 Measures memory bandwidth of your application or the whole system. Usage:

```bash
    $ pcm-memory.x <delay>|[external_program parameters]
```

Specify either a delay of updates in seconds or an external program to monitor. If you get an error about PMU in use, respond "y" and relaunch the program.

Sample output:

```bash
    ---------------------------------------||---------------------------------------
    --             Socket 0              --||--             Socket 1              --
    ---------------------------------------||---------------------------------------
    ---------------------------------------||---------------------------------------
    ---------------------------------------||---------------------------------------
    --   Memory Performance Monitoring   --||--   Memory Performance Monitoring   --
    ---------------------------------------||---------------------------------------
    --  Mem Ch 0: Reads (MB/s):    2.44  --||--  Mem Ch 0: Reads (MB/s):    0.26  --
    --            Writes(MB/s):    2.16  --||--            Writes(MB/s):    0.08  --
    --  Mem Ch 1: Reads (MB/s):    0.35  --||--  Mem Ch 1: Reads (MB/s):    0.78  --
    --            Writes(MB/s):    0.13  --||--            Writes(MB/s):    0.65  --
    --  Mem Ch 2: Reads (MB/s):    0.32  --||--  Mem Ch 2: Reads (MB/s):    0.21  --
    --            Writes(MB/s):    0.12  --||--            Writes(MB/s):    0.07  --
    --  Mem Ch 3: Reads (MB/s):    0.36  --||--  Mem Ch 3: Reads (MB/s):    0.20  --
    --            Writes(MB/s):    0.13  --||--            Writes(MB/s):    0.07  --
    -- NODE0 Mem Read (MB/s):      3.47  --||-- NODE1 Mem Read (MB/s):      1.45  --
    -- NODE0 Mem Write (MB/s):     2.55  --||-- NODE1 Mem Write (MB/s):     0.88  --
    -- NODE0 P. Write (T/s) :     31506  --||-- NODE1 P. Write (T/s):       9099  --
    -- NODE0 Memory (MB/s):        6.02  --||-- NODE1 Memory (MB/s):        2.33  --
    ---------------------------------------||---------------------------------------
    --                   System Read Throughput(MB/s):      4.93                  --
    --                  System Write Throughput(MB/s):      3.43                  --
    --                 System Memory Throughput(MB/s):      8.35                  --
    ---------------------------------------||--------------------------------------- 
```

### pcm-msr

Command  pcm-msr.x can be used to read/write model specific registers of the CPU.

### pcm-numa

NUMA monitoring utility does not work on Anselm.

### pcm-pcie

Can be used to monitor PCI Express bandwith. Usage: pcm-pcie.x &lt;delay&gt;

### pcm-power

Displays energy usage and thermal headroom for CPU and DRAM sockets. Usage: pcm-power.x &lt;delay&gt; | &lt;external program&gt;

### pcm

This command provides an overview of performance counters and memory usage. Usage: pcm.x &lt;delay&gt; | &lt;external program&gt;

Sample output :

```bash
    $ pcm.x ./matrix

     Intel(r) Performance Counter Monitor V2.6 (2013-11-04 13:43:31 +0100 ID=db05e43)

     Copyright (c) 2009-2013 Intel Corporation

    Number of physical cores: 16
    Number of logical cores: 16
    Threads (logical cores) per physical core: 1
    Num sockets: 2
    Core PMU (perfmon) version: 3
    Number of core PMU generic (programmable) counters: 8
    Width of generic (programmable) counters: 48 bits
    Number of core PMU fixed counters: 3
    Width of fixed counters: 48 bits
    Nominal core frequency: 2400000000 Hz
    Package thermal spec power: 115 Watt; Package minimum power: 51 Watt; Package maximum power: 180 Watt; 
    Socket 0: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
    Socket 1: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
    Number of PCM instances: 2
    Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)

    Detected Intel(R) Xeon(R) CPU E5-2665 0 @ 2.40GHz "Intel(r) microarchitecture codename Sandy Bridge-EP/Jaketown"

     Executing "./matrix" command:

    Exit code: 0

     EXEC  : instructions per nominal CPU cycle
     IPC   : instructions per CPU cycle
     FREQ  : relation to nominal CPU frequency='unhalted clock ticks'/'invariant timer ticks' (includes Intel Turbo Boost)
     AFREQ : relation to nominal CPU frequency while in active state (not in power-saving C state)='unhalted clock ticks'/'invariant timer ticks while in C0-state'  (includes Intel Turbo Boost)
     L3MISS: L3 cache misses
     L2MISS: L2 cache misses (including other core's L2 cache *hits*)
     L3HIT : L3 cache hit ratio (0.00-1.00)
     L2HIT : L2 cache hit ratio (0.00-1.00)
     L3CLK : ratio of CPU cycles lost due to L3 cache misses (0.00-1.00), in some cases could be >1.0 due to a higher memory latency
     L2CLK : ratio of CPU cycles lost due to missing L2 cache but still hitting L3 cache (0.00-1.00)
     READ  : bytes read from memory controller (in GBytes)
     WRITE : bytes written to memory controller (in GBytes)
     TEMP  : Temperature reading in 1 degree Celsius relative to the TjMax temperature (thermal headroom): 0 corresponds to the max temperature

     Core (SKT) | EXEC | IPC  | FREQ  | AFREQ | L3MISS | L2MISS | L3HIT | L2HIT | L3CLK | L2CLK  | READ  | WRITE | TEMP

       0    0     0.00   0.64   0.01    0.80    5592       11 K    0.49    0.13    0.32    0.06     N/A     N/A     67
       1    0     0.00   0.18   0.00    0.69    3086     5552      0.44    0.07    0.48    0.08     N/A     N/A     68
       2    0     0.00   0.23   0.00    0.81     300      562      0.47    0.06    0.43    0.08     N/A     N/A     67
       3    0     0.00   0.21   0.00    0.99     437      862      0.49    0.06    0.44    0.09     N/A     N/A     73
       4    0     0.00   0.23   0.00    0.93     293      559      0.48    0.07    0.42    0.09     N/A     N/A     73
       5    0     0.00   0.21   0.00    1.00     423      849      0.50    0.06    0.43    0.10     N/A     N/A     69
       6    0     0.00   0.23   0.00    0.94     285      558      0.49    0.06    0.41    0.09     N/A     N/A     71
       7    0     0.00   0.18   0.00    0.81     674     1130      0.40    0.05    0.53    0.08     N/A     N/A     65
       8    1     0.00   0.47   0.01    1.26    6371       13 K    0.51    0.35    0.31    0.07     N/A     N/A     64
       9    1     2.30   1.80   1.28    1.29     179 K     15 M    0.99    0.59    0.04    0.71     N/A     N/A     60
      10    1     0.00   0.22   0.00    1.26     315      570      0.45    0.06    0.43    0.08     N/A     N/A     67
      11    1     0.00   0.23   0.00    0.74     321      579      0.45    0.05    0.45    0.07     N/A     N/A     66
      12    1     0.00   0.22   0.00    1.25     305      570      0.46    0.05    0.42    0.07     N/A     N/A     68
      13    1     0.00   0.22   0.00    1.26     336      581      0.42    0.04    0.44    0.06     N/A     N/A     69
      14    1     0.00   0.22   0.00    1.25     314      565      0.44    0.06    0.43    0.07     N/A     N/A     69
      15    1     0.00   0.29   0.00    1.19    2815     6926      0.59    0.39    0.29    0.08     N/A     N/A     69
    -------------------------------------------------------------------------------------------------------------------
     SKT    0     0.00   0.46   0.00    0.79      11 K     21 K    0.47    0.10    0.38    0.07    0.00    0.00     65
     SKT    1     0.29   1.79   0.16    1.29     190 K     15 M    0.99    0.59    0.05    0.70    0.01    0.01     61
    -------------------------------------------------------------------------------------------------------------------
     TOTAL  *     0.14   1.78   0.08    1.28     201 K     15 M    0.99    0.59    0.05    0.70    0.01    0.01     N/A

     Instructions retired: 1345 M ; Active cycles:  755 M ; Time (TSC):  582 Mticks ; C0 (active,non-halted) core residency: 6.30 %

     C1 core residency: 0.14 %; C3 core residency: 0.20 %; C6 core residency: 0.00 %; C7 core residency: 93.36 %;
     C2 package residency: 48.81 %; C3 package residency: 0.00 %; C6 package residency: 0.00 %; C7 package residency: 0.00 %;

     PHYSICAL CORE IPC                 : 1.78 => corresponds to 44.50 % utilization for cores in active state
     Instructions per nominal CPU cycle: 0.14 => corresponds to 3.60 % core utilization over time interval

    Intel(r) QPI data traffic estimation in bytes (data traffic coming to CPU/socket through QPI links):

                   QPI0     QPI1    |  QPI0   QPI1
    ----------------------------------------------------------------------------------------------
     SKT    0        0        0     |    0%     0%
     SKT    1        0        0     |    0%     0%
    ----------------------------------------------------------------------------------------------
    Total QPI incoming data traffic:    0       QPI data traffic/Memory controller traffic: 0.00

    Intel(r) QPI traffic estimation in bytes (data and non-data traffic outgoing from CPU/socket through QPI links):

                   QPI0     QPI1    |  QPI0   QPI1
    ----------------------------------------------------------------------------------------------
     SKT    0        0        0     |    0%     0%
     SKT    1        0        0     |    0%     0%
    ----------------------------------------------------------------------------------------------
    Total QPI outgoing data and non-data traffic:    0

    ----------------------------------------------------------------------------------------------
     SKT    0 package consumed 4.06 Joules
     SKT    1 package consumed 9.40 Joules
    ----------------------------------------------------------------------------------------------
     TOTAL:                    13.46 Joules

    ----------------------------------------------------------------------------------------------
     SKT    0 DIMMs consumed 4.18 Joules
     SKT    1 DIMMs consumed 4.28 Joules
    ----------------------------------------------------------------------------------------------
     TOTAL:                  8.47 Joules
    Cleaning up
```

### pcm-sensor

Can be used as a sensor for ksysguard GUI, which is currently not installed on Anselm.

API
---
In a similar fashion to PAPI, PCM provides a C++ API to access the performance counter from within your application. Refer to the [doxygen documentation](http://intel-pcm-api-documentation.github.io/classPCM.html)![external](../../../img/external.png) for details of the API.

>Due to security limitations, using PCM API to monitor your applications is currently not possible on Anselm. (The application must be run as root user)

Sample program using the API :

```cpp
    #include <stdlib.h>
    #include <stdio.h>
    #include "cpucounters.h"

    #define SIZE 1000

    using namespace std;

    int main(int argc, char **argv) {
      float matrixa[SIZE][SIZE], matrixb[SIZE][SIZE], mresult[SIZE][SIZE];
      float real_time, proc_time, mflops;
      long long flpins;
      int retval;
      int i,j,k;

      PCM * m = PCM::getInstance();

      if (m->program() != PCM::Success) return 1;

      SystemCounterState before_sstate = getSystemCounterState();

      /* Initialize the Matrix arrays */
      for ( i=0; i<SIZE*SIZE; i++ ){
        mresult[0][i] = 0.0;
        matrixa[0][i] = matrixb[0][i] = rand()*(float)1.1; }

      /* A naive Matrix-Matrix multiplication */
      for (i=0;i<SIZE;i++)
        for(j=0;j<SIZE;j++)
          for(k=0;k<SIZE;k++)
            mresult[i][j]=mresult[i][j] + matrixa[i][k]*matrixb[k][j];

      SystemCounterState after_sstate = getSystemCounterState();

      cout << "Instructions per clock:" << getIPC(before_sstate,after_sstate)
      << "L3 cache hit ratio:" << getL3CacheHitRatio(before_sstate,after_sstate)
      << "Bytes read:" << getBytesReadFromMC(before_sstate,after_sstate);

      for (i=0; i<SIZE;i++)
        for (j=0; j<SIZE; j++)
           if (mresult[i][j] == -1) printf("x");

      return 0;
    }
```

Compile it with :

```bash
    $ icc matrix.cpp -o matrix -lpthread -lpcm
```

Sample output:

```bash
    $ ./matrix
    Number of physical cores: 16
    Number of logical cores: 16
    Threads (logical cores) per physical core: 1
    Num sockets: 2
    Core PMU (perfmon) version: 3
    Number of core PMU generic (programmable) counters: 8
    Width of generic (programmable) counters: 48 bits
    Number of core PMU fixed counters: 3
    Width of fixed counters: 48 bits
    Nominal core frequency: 2400000000 Hz
    Package thermal spec power: 115 Watt; Package minimum power: 51 Watt; Package maximum power: 180 Watt; 
    Socket 0: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
    Socket 1: 1 memory controllers detected with total number of 4 channels. 2 QPI ports detected.
    Number of PCM instances: 2
    Max QPI link speed: 16.0 GBytes/second (8.0 GT/second)
    Instructions per clock:1.7
    L3 cache hit ratio:1.0
    Bytes read:12513408
```

References
----------
1.  <https://software.intel.com/en-us/articles/intel-performance-counter-monitor-a-better-way-to-measure-cpu-utilization>![external](../../../img/external.png)
2.  <https://software.intel.com/sites/default/files/m/3/2/2/xeon-e5-2600-uncore-guide.pdf>![external](../../../img/external.png) Intel® Xeon® Processor E5-2600 Product Family Uncore Performance Monitoring Guide.
3.  <http://intel-pcm-api-documentation.github.io/classPCM.html>![external](../../../img/external.png) API Documentation