README.md 36.7 KB
Newer Older
Ondrej Vysocky's avatar
Ondrej Vysocky committed
1
# MERIC #
xvysoc01's avatar
xvysoc01 committed
2

Ondrej Vysocky's avatar
Ondrej Vysocky committed
3
Lightweight C/C++ library with Fortran interface for HPC applications dynamic behavior detection with a goal in energy consumption reduction - applying [READEX](https://www.readex.eu/) approach.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
4

5
The library was originally developed for x86 systems (tested on HSW, BDW and KNL) but additionally supports OpenPOWER8+ CINECA DAVIDE and selected BSC ARM systems.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
6 7 8 9

--------------------------------------------------------------------------------
#      README Content                                                          #
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
10
<!--> master branch links <!-->
11 12 13 14 15 16 17 18 19 20 21
 1. [System parameters tuning](https://code.it4i.cz/vys0053/meric#1-system-parameters-tuning)
 2. [Supported energy measurement systems](https://code.it4i.cz/vys0053/meric#2-supported-energy-measurement-systems)
 3. [MERIC and TIMEPROF interface and Shared Score-P/MERIC API](https://code.it4i.cz/vys0053/meric#3-meric-and-timeprof-interface-and-shared-score-pmeric-api)
 4. [MERIC binary instrumentation](https://code.it4i.cz/vys0053/meric#4-meric-binary-instrumentation)
 5. [Compilation](https://code.it4i.cz/vys0053/meric#5-compilation)
 6. [MERIC input parameters](https://code.it4i.cz/vys0053/meric#6-meric-input-parameters)
 7. [Content of test folder and example application run](https://code.it4i.cz/vys0053/meric#7-content-of-test-folder-and-example-application-run)
 8. [Code dynamism investigation](https://code.it4i.cz/vys0053/meric#8-code-dynamism-investigation)
 9. [MERIC with a Fortran code](https://code.it4i.cz/vys0053/meric#9-meric-with-a-fortran-code)
10. [Tool for static tuning](https://code.it4i.cz/vys0053/meric#10-tool-for-static-tuning)
11. [Acknowledgement](https://code.it4i.cz/vys0053/meric#11-acknowledgement)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
22

Ondrej Vysocky's avatar
Ondrej Vysocky committed
23
<!--> dev branch links 
24 25 26 27 28 29 30 31 32 33 34
 1. [System parameters tuning](https://code.it4i.cz/vys0053/meric/tree/dev#1-system-parameters-tuning)
 2. [Supported energy measurement systems](https://code.it4i.cz/vys0053/meric/tree/dev#2-supported-energy-measurement-systems)
 3. [MERIC and TIMEPROF interface and Shared Score-P/MERIC API](https://code.it4i.cz/vys0053/meric/tree/dev#3-meric-and-timeprof-interface-and-shared-score-pmeric-api)
 4. [MERIC binary instrumentation](https://code.it4i.cz/vys0053/meric/tree/dev#4-meric-binary-instrumentation)
 5. [Compilation](https://code.it4i.cz/vys0053/meric/tree/dev#5-compilation)
 6. [MERIC input parameters](https://code.it4i.cz/vys0053/meric/tree/dev#6-meric-input-parameters)
 7. [Content of test folder and example application run](https://code.it4i.cz/vys0053/meric/tree/dev#7-content-of-test-folder-and-example-application-run)
 8. [Code dynamism investigation](https://code.it4i.cz/vys0053/meric/tree/dev#8-code-dynamism-investigation)
 9. [MERIC with a Fortran code](https://code.it4i.cz/vys0053/meric/tree/dev#9-meric-with-a-fortran-code)
10. [Tool for static tuning](https://code.it4i.cz/vys0053/meric/tree/dev#10-tool-for-static-tuning)
11. [Acknowledgement](https://code.it4i.cz/vys0053/meric/tree/dev#11-acknowledgement)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
35
<!-->
Ondrej Vysocky's avatar
Ondrej Vysocky committed
36

xvysoc01's avatar
xvysoc01 committed
37
--------------------------------------------------------------------------------
38
#     1] System parameters tuning                                              #
xvysoc01's avatar
xvysoc01 committed
39
--------------------------------------------------------------------------------
40 41 42
MERIC allows to set selected system parameters for the whole application run or for each instrumented part of the analyzed application. Following list of the system parameters can be tuned using MERIC. Default and valid values of the following parameters can be read using `systemInfo` application (part of the MERIC repository, located in the tools/ directory).


Ondrej Vysocky's avatar
Ondrej Vysocky committed
43
### CPU core frequency ###
44
MERIC provides several ways how to change the CPU core frequency:
Ondrej Vysocky's avatar
Ondrej Vysocky committed
45 46 47 48
 *  Under Intel P-State using [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe), requires write acess to `/sys/devices/system/cpu/intel_pstate/no_turbo` and `/sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq`.
 *  Using [x86_adapt](https://github.com/tud-zih-energy/x86_adapt).
 *  Using cpufreq/[cpupower](https://github.com/torvalds/linux/tree/master/tools/power/cpupower).
 *  Using direct write to the `/sys/devices/system/cpu/cpu<ID>/cpufreq/scaling_setspeed`.
49 50


Ondrej Vysocky's avatar
Ondrej Vysocky committed
51
### CPU uncore frequency ###
52 53 54 55 56 57 58
Intel uncore frequency refers to frequency of subsystems in the physical processor package that are shared by multiple processor cores. E.g., L3 cache or on-chip ring interconnect.

MERIC requires [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe) or [x86_adapt](https://github.com/tud-zih-energy/x86_adapt) to be able to change the system uncore frequency. 

When using msr-safe to tune uncore frequency, it may happen, that the MSR_UNCORE_RATIO_LIMIT register is not available. In that case add `0x620 0x7F7F` to the system's whitelist.


Ondrej Vysocky's avatar
Ondrej Vysocky committed
59
### Intel RAPL power capping ###
60
Intel processors with RAPL energy measurement system provides possibility to specify average power consumption of the PKG or DRAM in a specific time window.
xvysoc01's avatar
xvysoc01 committed
61
	
62
MERIC requires requires [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe) or [x86_adapt](https://github.com/tud-zih-energy/x86_adapt) to be able change the power limits. Currently only PKG power cap is supported.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
63
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
64
Please, note that RAPL power capping system uses equation to specify the window size for PKG: `PKGTimeLimit = 2^Y * (1.0 + Z/4.0) * TimeUnit`, where Y=<0, 31>, Z=<0,3> and TimeUnit is usually a fraction of microsecond and window size for DRAM: `DRAMTimeLimit = 2^Y*F * TimeUnit`, where F={1.0, 1.1, 1.2, 1.3} and Y=<0, 31>. Due to that it might be impossible to set exactly the window size you have specified.
65 66


Ondrej Vysocky's avatar
Ondrej Vysocky committed
67
### Number of active OMP threads ###
68 69 70
Using OpenMP API.


Ondrej Vysocky's avatar
Ondrej Vysocky committed
71
### ARM Jetson memory frequency ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
72
To simplify MERIC's interface, the memory frequency is set the same way as Intel CPU uncore frequency using MERIC_UNCORE_FREQUENCY environment variable.
73 74 75 76 77 78 79


--------------------------------------------------------------------------------
#     2] Supported energy measurement systems                                  #
--------------------------------------------------------------------------------
Energy measurement system can be selected using `MERIC_MODE` environment variable.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
80
### Intel RAPL ###
81 82 83 84 85
From the Sandy Bridge microarchitecture all contemporary Intel processors provides software-based energy meter with sampling frequency 1 kHz of the PKG (CPUs) and DRAM. The system does not cover energy consumption of the node itself.

Since we are speaking about counters, overflow may happen during the application run. MERIC implements a overflow detection and recovery system, however it cannot fix the values if two or more overflows happen during a region that does not have any nested region. Please, keep it in mind when specifying a region size.

Based on these counters Intel processors can limit its energy consumption, using RAPL power capping system. User may specify average PKG or DRAM power consumption for a specific time window and the system itself will reduce its resources to fulfill this constraint.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
86
	
87 88 89
In MERIC [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe) or [x86_adapt](https://github.com/tud-zih-energy/x86_adapt) is required to access the RAPL counters. In the detailed mode MERIC stores values for each CPU and for each DRAM, otherwise it aggregates the values.


Ondrej Vysocky's avatar
Ondrej Vysocky committed
90
### HDEEM ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
91
High Definition Energy Efficiency Monitoring (HDEEM) is Bull's stand alone energy measurement system located on the node itself, with a power sampling frequency 1 kHz. HDEEM stores power samples in its internal memory and MERIC can read the energy consumption when a region starts and ends with significant overhead of the HDEEM system, or when the application run is over, without any impact on the application run.
92 93


Ondrej Vysocky's avatar
Ondrej Vysocky committed
94 95
### DiG ###
Dwarf in Giant (DiG) is a hardware stand-alone power sampling system that stores the samples, that can be accessed through the [EXAMON](http://projects.eees.dei.unibo.it/monitoring/wordpress/) framework.
96

Ondrej Vysocky's avatar
Ondrej Vysocky committed
97
To activate possibility of energy measurement provided by DiG system in the MERIC, [REST-client library](https://github.com/mrtazz/restclient-cpp) must be available on the target system. MERIC and the tuned applications must be compiled with the library.
98 99


Ondrej Vysocky's avatar
Ondrej Vysocky committed
100
### BSC ARM Jetson ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
101
Part of the MERIC repository is a python power measurement script that runs on node background and effects CPU. The script stores the power samples into a file, and MERIC evaluates energy consumption of the instrumented regions when the application run is over (MERIC_CONTINUAL=1 must be exported). 10 samples per second was selected in the measurement script as a compromise - this sampling rate takes ~2% of the CPU load. It is possible to change the rate in tools/getJTX1measurements.py.
102 103


Ondrej Vysocky's avatar
Ondrej Vysocky committed
104
### BSC ARM ThunderX ###
105
BSC's ThunderX energy measurement system doesn't effects the CPUs, however it measures energy consumed by all available nodes (one must allocate all four nodes), and its energy measurement samples frequency is approximately 4 samples per second. To run MERIC on the ThunderX export MERIC_CONTINUAL=1.
xvysoc01's avatar
xvysoc01 committed
106

107
--------------------------------------------------------------------------------
108
#     3] MERIC and TIMEPROF interface and Shared Score-P/MERIC API             #
xvysoc01's avatar
xvysoc01 committed
109
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
110
## MERIC interface ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
111
If you want to use MERIC with a parallel application, keep in mind, that all processes and all running threads must call every inserted MERIC function, otherwise MERIC behavior is undefined. It is not possible to set runtime environment for each process separately, because MERIC does environment changes that effects whole node (or socket). To guarantee that selected settings is applied for selected region, each measurement start and stop begins with a MPI and OpenMP barrier. MERIC interface is defined in `include/meric.h`.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
112

Ondrej Vysocky's avatar
Ondrej Vysocky committed
113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
 *	`void MERIC_Init()`

	Initialization of the library. Insert directly after MPI_Init().
	MERIC automatically starts a region called same as the analyzed application binary with suffix "_static".

 *	`void MERIC_Close()`

	Finalization of the library run to store the measurement results. Insert directly before MPI_Finalize().
	Ends the region that started at the MERIC_Init() time.

 *	`void MERIC_MeasureStart(const char * regionName)`

	Starting measurement of a reagion.
	Please, do not use **#** and **@** in names of regions.

 *	`double MERIC_MeasureStop()`

	End of the measurement of the last started region.
	Returns runtime od the stopped region in seconds. Only single MPI process per node returns the runtime, the others returns 0.0.

 *	`double MERIC_MeasureStopStart(const char * regionName)`

	End of the measurement of the last started region and start of a new region.
	Removes environment switching to configuration of the region the application is nested in.
	Returns runtime od the stopped region in seconds. Only single MPI process per node returns the runtime, the others returns 0.0.

 *	`void MERIC_CaptureScope(const char * regionName)`

	C++ and C (no support for Fortran) function to start measurement which will be stopped automatically at the end of the scope. Useful to capture a function that has several return statements.
	This functionality is based on [RAII](https://en.cppreference.com/w/cpp/language/raii) technique and if it should be used for instrumentation of a C application, the application may require compilation with `-fno-exceptions` or `-lstdc++` flag.

 *	`void MERIC_IgnoreStart()`

	From this point MERIC doesn't store resources consumption of the following regions but the requested settings of the nested regions is set.
	It is not possible to nest ignore sections of the code.

 *	`void MERIC_IgnoreStop()`

	Cancels the ignore section of the code.
xvysoc01's avatar
xvysoc01 committed
152

Ondrej Vysocky's avatar
Ondrej Vysocky committed
153
If you want to insert MERIC regions into code you don't know, instead of MERIC probes insert into the code MPI and OpenMP barriers and make sure, that the code works fine. After that replace barriers with Meric_MeasureStart/Stop.
xvysoc01's avatar
xvysoc01 committed
154

Ondrej Vysocky's avatar
Ondrej Vysocky committed
155
There is one more restriction in placing probes into the code. In current version MERIC does not support recursively nested regions and regions with at least three starts where the first and the third calls are at the same level, but second call is at higher level. This case is shown in example:
xvysoc01's avatar
xvysoc01 committed
156

Ondrej Vysocky's avatar
Ondrej Vysocky committed
157
```
Ondrej Vysocky's avatar
Ondrej Vysocky committed
158
	MERIC_MeasureStart("X") //region X wraps others, this region is not necessary in this example
Ondrej Vysocky's avatar
Ondrej Vysocky committed
159 160
		MERIC_MeasureStart("A")	//first call of region A
		MERIC_MeasureStop()
xvysoc01's avatar
xvysoc01 committed
161

Ondrej Vysocky's avatar
Ondrej Vysocky committed
162 163 164 165
		MERIC_MeasureStart("B")
			MERIC_MeasureStart("A")	//region A called at higher nested level
			MERIC_MeasureStop()
		MERIC_MeasureStop()
xvysoc01's avatar
xvysoc01 committed
166

Ondrej Vysocky's avatar
Ondrej Vysocky committed
167 168 169 170 171 172 173
		MERIC_MeasureStart("A")	//next region A call at the same nested level as the first one
			MERIC_MeasureStart("C")	//the problem is here, when region A has another nested region
			MERIC_MeasureStop()	//region C will cause defect in its call tree
		MERIC_MeasureStop()
	MERIC_MeasureStop()
```

Ondrej Vysocky's avatar
Ondrej Vysocky committed
174 175 176
## TIMEPROF interface ##
The time measurement provided by TIMEPROF is done by master thread of the process MPI_WORLD_COMM rank 0. The interface is defined in `include/timeprof.h`. 

Ondrej Vysocky's avatar
Ondrej Vysocky committed
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202
 *	`void TIMEPROF_regionStart(const char * regionName)`

	Start time measurement of region called regionName

 *	`double TIMEPROF_regionStop(const char * regionName)`
 
	End of the time measurement, returns region duration in seconds.

 *	`void TIMEPROF_evaluate()`

	At the end of the application run should be called this function 
	to evaluate the measurements. It will produce a list of the regions 
	with duration longer than time threshold (export environment variable 
	`TIMEPROF_TIME` [ms]) and store it to the fileName (export environment 
	variable `TIMEPROF_OUTPUT`).
	If no threshold provided, it will print complete list of measured functions with its minimum runtime.
	If no output file specified, the list of functions will be printed to stdout.

 *	`void TIMEPROF_captureScope(const char * regionName)`

	[RAII](https://en.cppreference.com/w/cpp/language/raii) time measurement of the scope, where specified.
	In case of C applications, the application may require compilation with `-fno-exceptions` or `-lstdc++` flag if you want to use this function.

 *	`double TIMEPROF_getLastRegionDuration()`

	Since scope time measurement does not return the time measured, it can be obtained using this function.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
203 204


Ondrej Vysocky's avatar
Ondrej Vysocky committed
205
## Shared MERIC/Score-P interface ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
206
Into include folder the C header file `readex.h` and Fortran `readex.inc` were added. Instead of previously presented MERIC API you may use following functions, that provides shared instrumentation for MERIC and Score-P. To use selected library for measurement, compile your code with -DUSE_MERIC or -DUSE_SCOREP for compiler annotated code (phase region only) or -DUSE_SCOREP_MANUAL for manually annotated code. It is not possible to use MERIC and Score-P simultaneously. As an example of the API use, test.cpp and fort_test.f90 example codes use this interface.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
207

Ondrej Vysocky's avatar
Ondrej Vysocky committed
208 209
TIMEPROF is not currently supported in readex header file.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
210 211
Used parameters has the same datatype as Score-P functions: struct SCOREP_User_Region* handle, const char* name, uint32_t type.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
212
| Shared API                                | MERIC function           | Score-P function                               |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
213 214 215 216 217 218 219
| ----------------------------------------- | ------------------------ | ---------------------------------------------- |
| READEX_INIT()                             | MERIC_Init()             |                                                |
| READEX_CLOSE()                            | MERIC_Close()            |                                                |
| READEX_PHASE_REGION_DEFINE(handle)        |                          | SCOREP_USER_REGION_DEFINE(handle)              |
| READEX_SIGNIFICANT_REGION_DEFINE(handle)* |                          | SCOREP_USER_REGION_DEFINE(handle)              |
| READEX_REGION_START(handle, name, type)   | MERIC_MeasureStart(name) | SCOREP_USER_REGION_BEGIN(handle, name, type)   |
| READEX_REGION_STOP(handle)                | MERIC_MeasureStop()      | SCOREP_USER_REGION_END(handle)                 |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
220
| READEX_REGION_STOP_START(stop_handle, start_handle, start_name, start_type) | MERIC_MeasureStopStart (start_name) | SCOREP_USER_REGION_END(stop_handle) SCOREP_USER_REGION_BEGIN (start_handle, start_name, start_type) |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
221 222
| READEX_PHASE_START(handle, name, type)    | MERIC_MeasureStart(name) | SCOREP_USER_OA_PHASE_BEGIN(handle, name, type) |
| READEX_PHASE_STOP(handle)                 | MERIC_MeasureStop()      | SCOREP_USER_OA_PHASE_END(handle)               |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
223 224
| READEX_IGNORE_START()                     | MERIC_IgnoreStart()      | SCOREP_RECORDING_OFF()                         |
| READEX_IGNORE_STOP()                      | MERIC_IgnoreStop()       | SCOREP_RECORDING_ON()                          |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
225
    *  Defines region handle except of phase region
226

Ondrej Vysocky's avatar
Ondrej Vysocky committed
227 228
READEX interface doesn't contain all Score-P API functions, because there is no support for these in MERIC. For the rest functionality user may use usual Score-P API, the functions will be ignored if the code will be compiled without Score-P.

229
--------------------------------------------------------------------------------
230
#     4] MERIC binary instrumentation                                          #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251
--------------------------------------------------------------------------------
Manual application instrumentation is straightforward, however it requires at least some basic knowledge about the target application, access to the source files and also time to instrument the code. All these steps can be overcome when using static binary instrumentation (SBI). MERIC repository contains several tools for binary analysis placed in `tools/DBI` directory, all of them are compiled separately from the MERIC compilation, using Makefile located ibidem. For SBI there are two tools based on [Dyninst library](https://dyninst.org/) `dinst_profile.cpp` (make dinst) and `dins_instrument.cpp` (make sbi). The `dinst_profile` tool provides list of all functions that are defined in the binary (shared libraries are not take into account) and provides information whether the function can or cannot be instrumented.

The second tool `dinst_instrument` provides binary instrumentation with MERIC or TIMEPROF based on list of functions that should be instrumented. In case of TIMEPROF, the list of functions does not have to be provided, in that case all the instrumentable functions are selected. Besides the instrumentation tools also adds all the necessary dependencies, so it is not necessary to recompile the application and link it with them. 
In case of MPI applications it is not only the application that is instrumented, but also the MPI library itself too. For this purpose shared MPI library that is used for the application must be provided. Please, specify full path to the MPI library to omit any possible mistake. Use `ldd` command to detect which MPI library is used for the analyzed application. When running instrumented MPI application `LD_PRELOAD` must be specified to replace default MPI library with the instrumented one. 

See tools' help message for more information.

## Dyninst installation ##
The tools has been developed using Dyninst-10.0.0, any newer version of the library should work too, however following information might be out-of-date with newer Dyninst versions. Dyninst installation is described in its [repository README](https://github.com/dyninst/dyninst) and it is quite simple, however Dyninst compilation may fail on `make install` due to missing sudo rights. Due to that all the tools' compilation paths are set, please, export following environment variables before you compile and use the Dyninst tool.
```
export DYNINST_HOME=/PATH/TO/DYNINST/DIRECTORY
export DYNINSTAPI_RT_LIB=$DYNINST_HOME/build/dyninstAPI_RT/libdyninstAPI_RT.so
export LD_LIBRARY_PATH+=:$DYNINST_HOME/build/dyninstAPI_RT/
```

### known issues ###
 * Instrumented function with the return command only may lost the return value (might be related to previous issue), you may increase `instlimit` TIMEPROF parameter for skipping such function.
 * If instrumented application is compiled with the same list of modules as the dininst tools, sometimes the new binary is corrupted. Can be tested using `ldd` command. If it happens, we suggest to use different set of modules for the application or the Dyninst tool.

--------------------------------------------------------------------------------
252
#    5] Compilation                                                            #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
253 254 255 256
--------------------------------------------------------------------------------
MERIC is compiled using [waf build system](https://waf.io/), since the system is not well known, a Makefile in the repository root folder is provided. Please, modify libs and include paths according your system paths:

### MERIC used libraries ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
257 258 259 260 261 262 263 264
* mandatory  - rt (high precision time measurement)
* optionally - OpenMP (used in default)
* optionally - MPI
* optionally - [PAPI](http://icl.cs.utk.edu/papi/)
* optionally - [perf_event](http://man7.org/linux/man-pages/man2/perf_event_open.2.html)
* optionally - [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe)
* optionally - cpufreq/[cpupower](https://github.com/torvalds/linux/tree/master/tools/power/cpupower)
* optionally - [x86_adapt](https://github.com/tud-zih-energy/x86_adapt)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
265
* optionally - [numa](https://github.com/numactl/numactl) (mandatory if both msr-safe and x86_adapt are not available)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
266 267
* optionally - [REST-client](https://github.com/mrtazz/restclient-cpp) (mandatory for DiG energy measurement system)
* integrated - [sheredom json parser](https://github.com/sheredom/json.h)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
268 269

### TIMEPROF used libraries ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
270 271 272
* mandatory  - rt
* optionally - OpenMP
* optionally - MPI
Ondrej Vysocky's avatar
Ondrej Vysocky committed
273 274 275

Beside these libraries waf requires Python.

276
Default compilation expects Intel compiler, if you want to compile using GCC use `make gcc` instead of `make`. Together with MERIC also TIMEPROF is being compiled. If MPI compiler is available, than compilation will produce both MPI and non-MPI versions of the libraries, both using OpenMP. If a MPI application without OpenMP should be analyzed, compilation with `--noopenmp` must be used to compile such version of MERIC. Please, link your application with `-lmeric`/`-ltimeprof` or `-lmericmpi`/`-ltimeprofmpi` for your OpenMP+MPI application or`-lmericmpionly`/`-ltimeprofmpionly` for pure MPI application.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
277 278

--------------------------------------------------------------------------------
279
#     6] MERIC input parameters                                                #
xvysoc01's avatar
xvysoc01 committed
280
--------------------------------------------------------------------------------
xvysoc01's avatar
xvysoc01 committed
281

Ondrej Vysocky's avatar
Ondrej Vysocky committed
282
## SET MERIC STATIC PARAMETERS - mandatory parameters ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
283 284
Specify following parameters to zero if you do not want to tune the selected system parameter.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
285 286
	export MERIC_FREQUENCY=2400MHz
	export MERIC_UNCORE_FREQUENCY=2GHz
287
		- Both frequencies can be specified as integer in Hz (default), KHz, MHz or GHz.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
288 289
	export MERIC_NUM_THREADS=24
		- To run a code in the default settings, without MERIC influence, set these three environment variables to zero.
290 291 292 293
	export MERIC_PWRCAP_POWER=100
		- Average processor PKG power [W] in the selected time window.
	export MERIC_PWRCAP_TIME=10
		- Time window size for power capping specified in microseconds.
xvysoc01's avatar
xvysoc01 committed
294

295 296
## SET MERIC WORKING MODE ##
	export MERIC_MODE=0
Ondrej Vysocky's avatar
Ondrej Vysocky committed
297 298
		0 = hdeem - uses hdeem to measure energy consumption
		1 = rapl - uses rapl counters to measure energy consumption
299
		2 = hdeem & rapl - uses hdeem and rapl at the same time
Ondrej Vysocky's avatar
Ondrej Vysocky committed
300
		3 = none - doesn't measure energy consumption, but provides you the option to set configuration for inserted regions
Ondrej Vysocky's avatar
Ondrej Vysocky committed
301
		4 = jetson - energy measurement on BSC Jetson TX1 system
Ondrej Vysocky's avatar
Ondrej Vysocky committed
302
		5 = thunder - energy measurement on BSC ThunderX system
303
		6 = davide - energy measurement on CINECA D.A.V.I.D.E. system
304
		7 = time - storing runtime of the regions only
305

306
	export MERIC_ITERATION=0
307 308
		- If runing an application several times with the same configuration MERIC_ITERATION=$iteration must be exported.
		- Always starts with 0.
309

Ondrej Vysocky's avatar
Ondrej Vysocky committed
310 311 312 313 314
	export MERIC_BARRIERS=all
		all  = all barriers are applied (default)
		mpi  = use MPI barriers only
		omp  = use OpenMP barriers only
		none = do not use barriers
315

xvysoc01's avatar
xvysoc01 committed
316
## SET ONE OF MERIC OUTPUT FORMAT ##
xvysoc01's avatar
xvysoc01 committed
317
	export MERIC_CONTINUAL=1
318
		- Single samples are stored in HDEEM internal memory and read at the end of the runtime
Ondrej Vysocky's avatar
Ondrej Vysocky committed
319
		  (with frequency 1000 samples per 1 second for blade and 100 samples in detailed mode for VRs).
xvysoc01's avatar
xvysoc01 committed
320
		- Minimal overhead - only times of the beginning and the end of measurement
Ondrej Vysocky's avatar
Ondrej Vysocky committed
321
		  are stored (samples are processed after the measurement).
322
		- in noncontinual mode (MERIC_CONTINUAL=0) energy consumption measured directly 
Ondrej Vysocky's avatar
Ondrej Vysocky committed
323
		  (with HDEEM internal delay) at each region start and end.
xvysoc01's avatar
xvysoc01 committed
324

xvysoc01's avatar
xvysoc01 committed
325
	export MERIC_DETAILED=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
326 327 328
		- HDEEM gives us not only data from blade, but data from Voltage Regulators (VR-CPU1,
		  VR-CPU0, VR-DIMMGH, VR-DIMMEF, VR-DIMMCD, VR-DIMMAB) are stored too.
		- In detailed mode RAPL returns values also for each CPU, not only energy consumption for a node.
xvysoc01's avatar
xvysoc01 committed
329

xvysoc01's avatar
xvysoc01 committed
330
	export MERIC_DEBUG=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
331 332
		- Data are taken both from samples and Stats structure, so we can compare them.
		- Data are taken from blade and Voltage Regulators too.
xvysoc01's avatar
xvysoc01 committed
333
		- Only for measurement check - there can be larger overhead because of two
Ondrej Vysocky's avatar
Ondrej Vysocky committed
334
		  types of data processing performed simultaneously.
xvysoc01's avatar
xvysoc01 committed
335

336
	export MERIC_SAMPLES=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
337 338 339
		- When using HDEEM samples to read the energy consumption, MERIC prints 
		  each sample to the output file, if MERIC_SAMPLES is set.
		- Files can become very big when measured regions run for a long time.
340

341
	export MERIC_AGGREGATE=0
Ondrej Vysocky's avatar
Ondrej Vysocky committed
342 343
		- When running an MPI application, MERIC aggregate the data from all the processes and stores the aggregated results. There are both average values and summary included in the output files.
		- Exporting MERIC_AGGREGATE=0 turns off this behavior and MERIC will store the results for each node separately.
xvysoc01's avatar
xvysoc01 committed
344

345
	export MERIC_COUNTERS=papi or perfevent
Ondrej Vysocky's avatar
Ondrej Vysocky committed
346 347 348 349
		- If set you can read HW counters using PAPI or perfevent.
		- When using counters, there is not only counter value but also an information.
		  about average CPU core frequency during the region runtime, computational
		  and arithmetic intensity (if possible to measure).
350
		- To add a counter you want to measure it is necessary to follow 
Ondrej Vysocky's avatar
Ondrej Vysocky committed
351
		  the instructions in wrapper/counters.h.
xvysoc01's avatar
xvysoc01 committed
352
	
xvysoc01's avatar
xvysoc01 committed
353 354
## SET OUTPUT FILES/FOLDERS NAME ##
	export MERIC_OUTPUT_DIR="hdeemMeasurement"
355
		- Default name is mericMeasurement.
xvysoc01's avatar
xvysoc01 committed
356
	export MERIC_OUTPUT_FILENAME="log"
357
		- Name of the output file is set automatically according specified values of the tuned system parameters, however this is a way how to add filename suffix.
xvysoc01's avatar
xvysoc01 committed
358

Ondrej Vysocky's avatar
Ondrej Vysocky committed
359 360 361
## ADVANCED SETTINGS ##
Settings through the exported environment variable should fulfill your needs when manually searching for the optimal settings. To set more complex settings the configuration file must be define.
In the configuration file one can specify settings for each region separately, different settings can be applied for each node and also socket. It is also possible to provide list of regions to ignore (the settings for these regions are applied but no consumptions are measured), and size of change in settings, that should be ignored, because it is too small to apply.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
362
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
363
To use the extended options, configuration file must be written in JSON format as follows and `export MERIC_REGION_OPTIONS=/path/to/regionoptions.json`.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
364

Ondrej Vysocky's avatar
Ondrej Vysocky committed
365
The basic settings via configuration file is HW settings for regions. In this case each region has an object with parameters. Parameters names are the same as exported environment variables, but without the "MERIC_" prefix.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
366
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
367
In case of per node or per socket settings specification, the objects of region settings are wrapped in another object. Per node settings starts the JSON object with keyword "@NODE" (as well as per socket settings use "@SOCKET"), where the value is an object, that has as a keys ids of the nodes (or sockets), and the value of this object specify the settings for each region.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
368

Ondrej Vysocky's avatar
Ondrej Vysocky committed
369
It is also possible to specify the settings for a socket on a specific node, in this case into "@NODE" object insert "@SOCKET" object, that contains required region settings. If the "@NODE" and "@SOCKET" settings are set in separated objects, the settings for a node has higher priority than settings for a socket. If any region of your code or any parameter is missing, the default setting is set.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
370 371 372 373

	"@SOCKET" : {
		"0" : {
			"A" : {
Ondrej Vysocky's avatar
Ondrej Vysocky committed
374 375
				"FREQUENCY" : 1300MHz,
				"UNCORE_FREQUENCY" : 1400MHz
Ondrej Vysocky's avatar
Ondrej Vysocky committed
376 377 378 379
			}
		}
	}
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
380
To define ignore settings, write a object with the keyword "@IGNORE", and value is another object, that might contain "@REGIONS" with an array of regions' names to ignore, and "@CHANGE" with key object that contains settings with values, that specify how large the change might be to ignore it.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
381 382 383 384

	"@IGNORE" : {
		"@REGIONS" : ["A", "B", "C"],
		"@CHANGE" : {
Ondrej Vysocky's avatar
Ondrej Vysocky committed
385
			"PWRCAP_POWER" : 10,
Ondrej Vysocky's avatar
Ondrej Vysocky committed
386 387 388 389
			"NUM_THREADS" : 2
		}
	}

Ondrej Vysocky's avatar
Ondrej Vysocky committed
390
Examples of region.options files are in test/config directory. The region.options.extra contain all supported settings.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
391

xvysoc01's avatar
xvysoc01 committed
392
--------------------------------------------------------------------------------
393
#     7] Content of test folder and example application run                    #
xvysoc01's avatar
xvysoc01 committed
394
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416
* source codes
	* test.cpp
		* One region, with two another regions inside.
		* This test uses shared Score-P/MERIC API. See section 3 of this README.
	* test_mpi.cpp
		* Same test as test.c, only extended with MPI.
	* fort_test.f90
		* Fortran version of test.cpp to show the MERIC and READEX Fortran interface.
	* samples_test.cpp
		* Test with sleep (minimum energy consumption) and compute (much higher energy consumption) regions in a loop. This allows user to see in the list of samples how energy rise when compute region starts and check if MERIC detects this sample as a first one of this region.
		* Follow the instructions inside to set the test.
	* sleep_test.cpp
		* A test with RUN (maximum load) and SLEEP (minimum load) regions, both with the same runtime.
		* Originally made to control real CPU frequency of the machine.
	* overhead_test.cpp
		* Test to measure MERIC overhead and overhead of libraries to change environment parameters.
	* blas_test.cpp
		* Test compares DGEMM and DGEMV. There are two possible sizes of matrices (large and small (not) to fit in L3 cache). In both cases, sizes of matrices were set to take approximately the same time for both DGEMM and DGEMV region, when using all available resources.
		* This test requires mkl library, due to that it is compiled using `make blasTest` alongside to other tests.
* Makefile
	* Command `make` compiles all test codes except blas_test.cpp.
	* To compile blas_test.cpp use `make blasTest` command.
417
* environment_default.source
Ondrej Vysocky's avatar
Ondrej Vysocky committed
418 419 420 421 422 423 424 425 426 427 428 429 430
	* Basic script that sets chosen MERIC environment variables and informs you which varibles are set.
	* When run with argument `-t`, the script just prints list of set variables.
	* Make a copy of this script and edit it to suits your needs.
* config direcory
	* region.options
		* File that sets exact settings for regions inside your code.
		* In default it is set for blas_test.
	* region.options.extra
		* Configuration file, that shows all available ways how to specify MERIC settings.
* run.sh and run-mpi.sh
	* Scripts that runs MERIC anslysis of test.cpp or test_mpi.cpp on Taurus machine.
* run-jetson.sh
	* Template script to submit a job on BSC ARM Jetson platform.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
431
	
432

Ondrej Vysocky's avatar
Ondrej Vysocky committed
433 434 435 436 437
Specify the mandatory MERIC's parameters and run you instrumented application or one from the test directory. Good starting point is runnig `test` application from test directory. To understand well MERIC's output, explore the source file **test.cpp** to see that this test contains regions A, B and C (inside A are two regions B and one region C). For each region you can set CPU core and uncore frequencies and number of threads.
```
export MERIC_FREQUENCY=0         # no CPU core frequency tuning
export MERIC_UNCORE_FREQUENCY=0  # no CPU uncore frequency tuning
export MERIC_NUM_THREADS=0       # non-OpenMP application
438 439
export MERIC_PWRCAP_POWER=0      # no RAPL power capping
export MERIC_PWRCAP_TIME=0       # no RAPL power capping
440
export MERIC_MODE=7              # time measurement only
441

Ondrej Vysocky's avatar
Ondrej Vysocky committed
442 443
./test                           # run the application as usual
```
444

Ondrej Vysocky's avatar
Ondrej Vysocky committed
445
--------------------------------------------------------------------------------
446
#      8] Code dynamism investigation                                          #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
447
--------------------------------------------------------------------------------
448 449 450 451 452
MERIC's output is stored in the directory in default named `mericMeasurement` that contains result files in folders with names of the regions. To change output folder name export MERIC_OUTPUT_DIR="NEW_NAME".
Each csv file carry 3 types of data:
 *  CALLTREE - the first line of every CSV file, it's a call stack, so we can see, where is the measured region nested
 *  Section label (e.g. '# Job info') - determines a "category" of following data in a file7
 *  Data - tuples (mostly pairs) structured like a hash map: key, value
Ondrej Vysocky's avatar
Ondrej Vysocky committed
453 454

To find the best settings for each region, you should run your code with several possible settings. The content of the MERIC's output directories, can be analysed using our RADAR tool, that generates a MERIC configuration file for production runs of the application and LaTeX report describing the application behavior.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
455 456 457 458 459 460 461 462 463 464

	# Run an application in several configurations you want to test - in test directory are provided example scripts `run.sh` or `run-mpi.sh`
		```
		export MERIC_MODE=1
		for thread in {24..1..1}
		do
			for cpu_freq in {25..12..1}
			do
				for uncore_freq in {30..12..1} # or {30..12..2}
				do
465 466 467 468 469 470 471 472
					for iter in {0..3}
					do
						export MERIC_NUM_THREADS=$thread
						export MERIC_FREQUENCY=${cpu_freq}00MHz
						export MERIC_UNCORE_FREQUENCY=${uncore_freq}00HMz
						export MERIC_ITERATION=$iter
						./test
					done
Ondrej Vysocky's avatar
Ondrej Vysocky committed
473 474 475 476
				done
			done
		done
		```
477
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
478
	# Edit description file `measurementInfo.json` in your output data folder. This step is not compulsory but this file helps you keep information what you have measured.
479 480
		```
		{
481 482
			"Timestamp" : "Thu Dec  6 15:37:23 2018",
			"System"    : "IT4I Salomon",
Ondrej Vysocky's avatar
Ondrej Vysocky committed
483
			"DataFormat": "node_CF_UnCF_thrds",
484
			"Note"      : ""
485
		}
Ondrej Vysocky's avatar
Ondrej Vysocky committed
486
		```
487

Ondrej Vysocky's avatar
Ondrej Vysocky committed
488 489 490 491
	# Process the results using RADAR tool
		Repository URL: https://code.it4i.cz/bes0030/readex-radar.git
		1) Set variables in config.py file (description is included in the file)
		2) Launch python3 ./printFullReport.py -configFile path/to/config.py
492

493

494
--------------------------------------------------------------------------------
495
#      9] MERIC with a Fortran code                                            #
496
--------------------------------------------------------------------------------
497
There are Fortran module and interface in the include directory. The module is being compiled separately from MERIC, use `make fortran` command. To instrument a Fortran application the Dyninst tool for static binary instrumentation can be used, also manual instrumentation is available.
498

499
To allow MERIC manual instrumentation in your Fortran application, add `use meric` command to your program. For the MERIC functions user should use keyword `call` as usual for Fortran functions. Since in Fortran is a problem with C `const *char` all the region names must be ended with `//char(0)` (e.g. `call MERIC_MeasureStart("RegionName"//char(0))`). MERIC repository contains a Fortran code example `test/fort_test.f90` to show how the API can be used.
500
	
501
--------------------------------------------------------------------------------
502
#     10] Tool for static tuning                                               #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
503
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
504
MERIC repository also contain a tool, based on MERIC source code, for static energy measurement and system parameters tuning. It is located in the tools/staticMERICtool/.
505

Ondrej Vysocky's avatar
Ondrej Vysocky committed
506
Binaries `energyMeasureStart` and `energyMeasureStop` provides RAPL energy measurement for a single node (similar to HDEEM commandline tools startHdeem and stopHdeem), if one wants to do measurement on several nodes multiNodeStaticMeasureStart/Stop.sh scripts are located in the same directory. Since there is no multi-node HDEEM measurement tool, this script provides the option for both energy measurement interfaces. To select which one should be used, the script takes one argument `--rapl` or `--hdeem`.
507 508

For a static analysis of a selected application, the directory with the tool contain `staticAnalysis.sh` bash script too. It not only runs the application in variety of available HW settings, but also stores the results in format similar to MERIC, so the results can be analysed using RADAR library.
509

Ondrej Vysocky's avatar
Ondrej Vysocky committed
510
--------------------------------------------------------------------------------
511
#     11] Acknowledgement                                                      #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
512
--------------------------------------------------------------------------------
513
MERIC is being developed at [IT4Innovations National Supercomputing Center](https://www.it4i.cz/) under [BSD-3 license](https://code.it4i.cz/vys0053/meric/blob/master/LICENSE).
Ondrej Vysocky's avatar
Ondrej Vysocky committed
514
Please, open an issue, if you meet any problem. 
515

Ondrej Vysocky's avatar
Ondrej Vysocky committed
516 517
	
For referencing MERIC, please, cite: **[MERIC and RADAR Generator: Tools for Energy Evaluation and Runtime Tuning of HPC Applications](https://link.springer.com/chapter/10.1007/978-3-319-97136-0_11)**.
518

Ondrej Vysocky's avatar
Ondrej Vysocky committed
519 520 521 522