README.md 35.2 KB
Newer Older
Ondrej Vysocky's avatar
Ondrej Vysocky committed
1
# MERIC #
xvysoc01's avatar
xvysoc01 committed
2

Ondrej Vysocky's avatar
Ondrej Vysocky committed
3
Lightweight C/C++ library with Fortran interface for HPC applications dynamic behavior detection with a goal in energy consumption reduction - applying [READEX](https://www.readex.eu/) approach.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
4

Ondrej Vysocky's avatar
Ondrej Vysocky committed
5
The library originally developed for x86 systems (tested on HSW, BDW and KNL) but additionally supports OpenPOWER8+ CINECA DAVIDE and selected BSC ARM systems.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
6
7
8
9

--------------------------------------------------------------------------------
#      README Content                                                          #
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
10
11
12
13
14
15
16
17
18
19
<!--> master branch links <!-->
 1. [Content of src folder](https://code.it4i.cz/vys0053/meric#1-content-of-src-folder)
 2. [MERIC and TIMEPROF interface and Shared Score-P/MERIC API](https://code.it4i.cz/vys0053/meric#2-meric-and-timeprof-interface-and-shared-score-pmeric-api)
 3. [MERIC binary instrumentation](https://code.it4i.cz/vys0053/meric#3-meric-binary-instrumentation)
 4. [Compilation](https://code.it4i.cz/vys0053/meric#4-compilation)
 5. [MERIC input parameters](https://code.it4i.cz/vys0053/meric#5-meric-input-parameters)
 6. [Content of test folder and example application run](https://code.it4i.cz/vys0053/meric#6-content-of-test-folder-and-example-application-run)
 7. [Code dynamism investigation](https://code.it4i.cz/vys0053/meric#7-code-dynamism-investigation)
 8. [MERIC with a Fortran code](https://code.it4i.cz/vys0053/meric#8-meric-with-a-fortran-code)
 9. [Using MERIC on BSC ARM systems](https://code.it4i.cz/vys0053/meric#9-using-meric-on-bsc-arm-systems)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
20
10. [Using MERIC on D.A.V.I.D.E. system](https://code.it4i.cz/vys0053/meric#10-using-meric-on-cineca-davide-system)
Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
21
22
23
24
11. [Tool for static tuning](https://code.it4i.cz/vys0053/meric#11-tool-for-static-tuning)
12. [Acknowledgement](https://code.it4i.cz/vys0053/meric#12-acknowledgement)

<!--> dev branch links
Ondrej Vysocky's avatar
Ondrej Vysocky committed
25
26
27
28
29
30
31
32
33
 1. [Content of src folder](https://code.it4i.cz/vys0053/meric/tree/dev#1-content-of-src-folder)
 2. [MERIC and TIMEPROF interface and Shared Score-P/MERIC API](https://code.it4i.cz/vys0053/meric/tree/dev#2-meric-and-timeprof-interface-and-shared-score-pmeric-api)
 3. [MERIC binary instrumentation](https://code.it4i.cz/vys0053/meric/tree/dev#3-meric-binary-instrumentation)
 4. [Compilation](https://code.it4i.cz/vys0053/meric/tree/dev#4-compilation)
 5. [MERIC input parameters](https://code.it4i.cz/vys0053/meric/tree/dev#5-meric-input-parameters)
 6. [Content of test folder and example application run](https://code.it4i.cz/vys0053/meric/tree/dev#6-content-of-test-folder-and-example-application-run)
 7. [Code dynamism investigation](https://code.it4i.cz/vys0053/meric/tree/dev#7-code-dynamism-investigation)
 8. [MERIC with a Fortran code](https://code.it4i.cz/vys0053/meric/tree/dev#8-meric-with-a-fortran-code)
 9. [Using MERIC on BSC ARM systems](https://code.it4i.cz/vys0053/meric/tree/dev#9-using-meric-on-bsc-arm-systems)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
34
10. [Using MERIC on D.A.V.I.D.E. system](https://code.it4i.cz/vys0053/meric/tree/dev#10-using-meric-on-cineca-davide-system)
35
36
11. [Tool for static tuning](https://code.it4i.cz/vys0053/meric/tree/dev#11-tool-for-static-tuning)
12. [Acknowledgement](https://code.it4i.cz/vys0053/meric/tree/dev#12-acknowledgement)
Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
37
<!-->
Ondrej Vysocky's avatar
Ondrej Vysocky committed
38

xvysoc01's avatar
xvysoc01 committed
39
--------------------------------------------------------------------------------
xvysoc01's avatar
xvysoc01 committed
40
#     1] Content of src folder                                                 #
xvysoc01's avatar
xvysoc01 committed
41
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
42
43
44
45
46
47
* basis    - Input parser.
* meric    - Base classes of the library.
* store    - Different types of the store class. To store new type of the data, 
             new class inherited from class StoreBase (store.h) is required.
* timeprof - Lightweight time-profiling library desined to identify functions 
             to instrument with MERIC.
xvysoc01's avatar
xvysoc01 committed
48
49
* wrapper
	- environmentwrapper
Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
50
		= Thread switching, CPU core and uncore frequencies settings using [x86_adapt](https://github.com/tud-zih-energy/x86_adapt) or [cpufreq](http://www.thinkwiki.org/wiki/How_to_use_cpufrequtils) or [libmsr](https://github.com/LLNL/libmsr) + [msr-safe](https://github.com/LLNL/msr-safe).
xvysoc01's avatar
xvysoc01 committed
51

xvysoc01's avatar
xvysoc01 committed
52
	- hdeemwrapper
Ondrej Vysocky's avatar
Ondrej Vysocky committed
53
		= Energy measurement using [HDEEM](https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/EnergyMeasurement).
xvysoc01's avatar
xvysoc01 committed
54

xvysoc01's avatar
xvysoc01 committed
55
	- perfeventwrapper
Ondrej Vysocky's avatar
Ondrej Vysocky committed
56
		= Hardware counters provided by [perf_event](http://man7.org/linux/man-pages/man2/perf_event_open.2.html).
xvysoc01's avatar
xvysoc01 committed
57
58
	
	- papiwrapper
Ondrej Vysocky's avatar
Ondrej Vysocky committed
59
		= Hardware counters provided by [PAPI](http://icl.cs.utk.edu/papi/).
60
61
		
	- raplwrapper
Ondrej Vysocky's avatar
Ondrej Vysocky committed
62
		= Intel RAPL counters read by [x86_adapt](https://github.com/tud-zih-energy/x86_adapt).
63
64
65

	- davidewrapper
		= Support for OpenPOWER system [DAVIDE](http://www.hpc.cineca.it/content/davide).
Ondrej Vysocky's avatar
Ondrej Vysocky committed
66
67
	
	- jetsonwrapper and thunderwrapper
Ondrej Vysocky's avatar
Ondrej Vysocky committed
68
		= Support for [ARM machines](http://montblanc-project.eu/prototypes).
Ondrej Vysocky's avatar
Ondrej Vysocky committed
69
70
71
	
	- counters
		= List of supported HW counters (PAPI, perfevent, RAPL).
xvysoc01's avatar
xvysoc01 committed
72

73
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
74
#     2] MERIC and TIMEPROF interface and Shared Score-P/MERIC API             #
xvysoc01's avatar
xvysoc01 committed
75
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
76
## MERIC interface ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
77
If you want to use MERIC with a parallel application, keep in mind, that all processes and all running threads must call every inserted MERIC function, otherwise MERIC behavior is undefined. It is not possible to set runtime environment for each process separately, because MERIC does environment changes that effects whole node (or socket). To guarantee that selected settings is applied for selected region, each measurement start and stop begins with a MPI and OpenMP barrier. MERIC interface is defined in `include/meric.h`.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
78

Ondrej Vysocky's avatar
Ondrej Vysocky committed
79
	void MERIC_Init()
Ondrej Vysocky's avatar
Ondrej Vysocky committed
80
		Initialization of the library. Insert directly after MPI_Init().
Ondrej Vysocky's avatar
Ondrej Vysocky committed
81
	void MERIC_Close()
Ondrej Vysocky's avatar
Ondrej Vysocky committed
82
		Finalization of the library run to store the measurement results. Insert directly before MPI_Finalize().
Ondrej Vysocky's avatar
Ondrej Vysocky committed
83
84
	void MERIC_MeasureStart(const char * regionName)
		Starting measurement of a reagion.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
85
		Please, do not use **#** and **@** in names of regions.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
86
	double MERIC_MeasureStop()
Ondrej Vysocky's avatar
Ondrej Vysocky committed
87
		End of the measurement of the last started region.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
88
89
90
91
92
		Returns runtime od the stopped region in seconds. Only single MPI process per node returns the runtime, the others returns 0.0.
	double MERIC_MeasureStopStart(const char * regionName)
		End of the measurement of the last started region and start of a new region.
		Removes environment switching to configuration of the region the application is nested in.
		Returns runtime od the stopped region in seconds. Only single MPI process per node returns the runtime, the others returns 0.0.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
93
94
95
	void MERIC_CaptureScope(const char * regionName)
		C++ and C (no support for Fortran) function to start measurement which will be stopped automatically at the end of the scope. Useful to capture a function that has several return statements.
		This functionality is based on [RAII](https://en.cppreference.com/w/cpp/language/raii) technique and if it should be used for instrumentation of a C application, the application may require compilation with `-fno-exceptions` or `-lstdc++` flag.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
96
	void MERIC_IgnoreStart()
Ondrej Vysocky's avatar
Ondrej Vysocky committed
97
98
		From this point MERIC doesn't store resources consumption of the following regions but the requested settings of the nested regions is set.
		It is not possible to nest ignore sections of the code.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
99
	void MERIC_IgnoreStop()
Ondrej Vysocky's avatar
Ondrej Vysocky committed
100
		Cancels the ignore section of the code.
xvysoc01's avatar
xvysoc01 committed
101

Ondrej Vysocky's avatar
Ondrej Vysocky committed
102
If you want to insert MERIC regions into code you don't know, instead of MERIC probes insert into the code MPI and OpenMP barriers and make sure, that the code works fine. After that replace barriers with Meric_MeasureStart/Stop.
xvysoc01's avatar
xvysoc01 committed
103

Ondrej Vysocky's avatar
Ondrej Vysocky committed
104
There is one more restriction in placing probes into the code. In current version MERIC does not support recursively nested regions and regions with at least three starts where the first and the third calls are at the same level, but second call is at higher level. This case is shown in example:
xvysoc01's avatar
xvysoc01 committed
105

Ondrej Vysocky's avatar
Ondrej Vysocky committed
106
```
Ondrej Vysocky's avatar
Ondrej Vysocky committed
107
	MERIC_MeasureStart("X") //region X wraps others, this region is not necessary in this example
Ondrej Vysocky's avatar
Ondrej Vysocky committed
108
109
		MERIC_MeasureStart("A")	//first call of region A
		MERIC_MeasureStop()
xvysoc01's avatar
xvysoc01 committed
110

Ondrej Vysocky's avatar
Ondrej Vysocky committed
111
112
113
114
		MERIC_MeasureStart("B")
			MERIC_MeasureStart("A")	//region A called at higher nested level
			MERIC_MeasureStop()
		MERIC_MeasureStop()
xvysoc01's avatar
xvysoc01 committed
115

Ondrej Vysocky's avatar
Ondrej Vysocky committed
116
117
118
119
120
121
122
		MERIC_MeasureStart("A")	//next region A call at the same nested level as the first one
			MERIC_MeasureStart("C")	//the problem is here, when region A has another nested region
			MERIC_MeasureStop()	//region C will cause defect in its call tree
		MERIC_MeasureStop()
	MERIC_MeasureStop()
```

Ondrej Vysocky's avatar
Ondrej Vysocky committed
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
## TIMEPROF interface ##
The time measurement provided by TIMEPROF is done by master thread of the process MPI_WORLD_COMM rank 0. The interface is defined in `include/timeprof.h`. 

	void MERIC_Init()
		Initialization of the library. Insert directly after MPI_Init().
	void MERIC_Close()
		Finalization of the library run to store the measurement results. Insert directly before MPI_Finalize().
	void TIMEPROF_regionStart(const char * regionName);
		Start time measurement of region called regionName
	double TIMEPROF_regionStop(const char * regionName);
		End of the time measurement, returns region duration in seconds.
	void TIMEPROF_evaluate (unsigned int timeThreshold = 0, const char * fileName = "");
		At the end of the application run might be called this function 
		to evaluate the measurements. It will produce a list of the regions 
		with duration longer than timeThreshold [ms] and store it to the fileName.
		If no threshold provided, it will print complete list of measured functions with its minimum runtime.
		If no output file specified, the list of functions will be printed to stdout.
	void TIMEPROF_captureScope(const char * regionName);
		[RAII](https://en.cppreference.com/w/cpp/language/raii) time measurement of the scope, where specified.
		In case of C applications , the application may require compilation with `-fno-exceptions` or `-lstdc++` flag if you want to use this function.
	double TIMEPROF_getLastRegionDuration();
		Since scope time measurement does not return the time measured, it can be obtained using this function.


Ondrej Vysocky's avatar
Ondrej Vysocky committed
147
## Shared MERIC/Score-P interface ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
148
Into include folder the C header file `readex.h` and Fortran `readex.inc` were added. Instead of previously presented MERIC API you may use following functions, that provides shared instrumentation for MERIC and Score-P. To use selected library for measurement, compile your code with -DUSE_MERIC or -DUSE_SCOREP for compiler annotated code (phase region only) or -DUSE_SCOREP_MANUAL for manually annotated code. It is not possible to use MERIC and Score-P simultaneously. As an example of the API use, test.cpp and fort_test.f90 example codes use this interface.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
149

Ondrej Vysocky's avatar
Ondrej Vysocky committed
150
151
TIMEPROF is not currently supported in readex header file.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
152
153
Used parameters has the same datatype as Score-P functions: struct SCOREP_User_Region* handle, const char* name, uint32_t type.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
154
| Shared API                                | MERIC function           | Score-P function                               |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
155
156
157
158
159
160
161
| ----------------------------------------- | ------------------------ | ---------------------------------------------- |
| READEX_INIT()                             | MERIC_Init()             |                                                |
| READEX_CLOSE()                            | MERIC_Close()            |                                                |
| READEX_PHASE_REGION_DEFINE(handle)        |                          | SCOREP_USER_REGION_DEFINE(handle)              |
| READEX_SIGNIFICANT_REGION_DEFINE(handle)* |                          | SCOREP_USER_REGION_DEFINE(handle)              |
| READEX_REGION_START(handle, name, type)   | MERIC_MeasureStart(name) | SCOREP_USER_REGION_BEGIN(handle, name, type)   |
| READEX_REGION_STOP(handle)                | MERIC_MeasureStop()      | SCOREP_USER_REGION_END(handle)                 |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
162
| READEX_REGION_STOP_START(stop_handle, start_handle, start_name, start_type) | MERIC_MeasureStopStart (start_name) | SCOREP_USER_REGION_END(stop_handle) SCOREP_USER_REGION_BEGIN (start_handle, start_name, start_type) |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
163
164
| READEX_PHASE_START(handle, name, type)    | MERIC_MeasureStart(name) | SCOREP_USER_OA_PHASE_BEGIN(handle, name, type) |
| READEX_PHASE_STOP(handle)                 | MERIC_MeasureStop()      | SCOREP_USER_OA_PHASE_END(handle)               |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
165
166
| READEX_IGNORE_START()                     | MERIC_IgnoreStart()      | SCOREP_RECORDING_OFF()                         |
| READEX_IGNORE_STOP()                      | MERIC_IgnoreStop()       | SCOREP_RECORDING_ON()                          |
Ondrej Vysocky's avatar
Ondrej Vysocky committed
167
    *  Defines region handle except of phase region
168

Ondrej Vysocky's avatar
Ondrej Vysocky committed
169
170
READEX interface doesn't contain all Score-P API functions, because there is no support for these in MERIC. For the rest functionality user may use usual Score-P API, the functions will be ignored if the code will be compiled without Score-P.

171
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
#     3] MERIC binary instrumentation                                          #
--------------------------------------------------------------------------------
Manual application instrumentation is straightforward, however it requires at least some basic knowledge about the target application, access to the source files and also time to instrument the code. All these steps can be overcome when using static binary instrumentation (SBI). MERIC repository contains several tools for binary analysis placed in `tools/DBI` directory, all of them are compiled separately from the MERIC compilation, using Makefile located ibidem. For SBI there are two tools based on [Dyninst library](https://dyninst.org/) `dinst_profile.cpp` (make dinst) and `dins_instrument.cpp` (make sbi). The `dinst_profile` tool provides list of all functions that are defined in the binary (shared libraries are not take into account) and provides information whether the function can or cannot be instrumented.

The second tool `dinst_instrument` provides binary instrumentation with MERIC or TIMEPROF based on list of functions that should be instrumented. In case of TIMEPROF, the list of functions does not have to be provided, in that case all the instrumentable functions are selected. Besides the instrumentation tools also adds all the necessary dependencies, so it is not necessary to recompile the application and link it with them. 
In case of MPI applications it is not only the application that is instrumented, but also the MPI library itself too. For this purpose shared MPI library that is used for the application must be provided. Please, specify full path to the MPI library to omit any possible mistake. Use `ldd` command to detect which MPI library is used for the analyzed application. When running instrumented MPI application `LD_PRELOAD` must be specified to replace default MPI library with the instrumented one. 

See tools' help message for more information.

## Dyninst installation ##
The tools has been developed using Dyninst-10.0.0, any newer version of the library should work too, however following information might be out-of-date with newer Dyninst versions. Dyninst installation is described in its [repository README](https://github.com/dyninst/dyninst) and it is quite simple, however Dyninst compilation may fail on `make install` due to missing sudo rights. Due to that all the tools' compilation paths are set, please, export following environment variables before you compile and use the Dyninst tool.
```
export DYNINST_HOME=/PATH/TO/DYNINST/DIRECTORY
export DYNINSTAPI_RT_LIB=$DYNINST_HOME/build/dyninstAPI_RT/libdyninstAPI_RT.so
export LD_LIBRARY_PATH+=:$DYNINST_HOME/build/dyninstAPI_RT/
```

### known issues ###
 * Instrumented function with the return command only may lost the return value (might be related to previous issue), you may increase `instlimit` TIMEPROF parameter for skipping such function.
 * If instrumented application is compiled with the same list of modules as the dininst tools, sometimes the new binary is corrupted. Can be tested using `ldd` command. If it happens, we suggest to use different set of modules for the application or the Dyninst tool.

--------------------------------------------------------------------------------
#    4] Compilation                                                            #
--------------------------------------------------------------------------------
MERIC is compiled using [waf build system](https://waf.io/), since the system is not well known, a Makefile in the repository root folder is provided. Please, modify libs and include paths according your system paths:

### MERIC used libraries ###
Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
199
200
201
202
203
204
205
206
207
208
209
* mandatory  - rt (high precision time measurement)
* optionally - OpenMP (used in default)
* optionally - MPI
* optionally - [PAPI](http://icl.cs.utk.edu/papi/)
* optionally - [perf_event](http://man7.org/linux/man-pages/man2/perf_event_open.2.html)
* optionally - [libmsr](https://github.com/LLNL/libmsr)+[msr-safe](https://github.com/LLNL/msr-safe)
* optionally - cpufreq/[cpupower](https://github.com/torvalds/linux/tree/master/tools/power/cpupower)
* optionally - [x86_adapt](https://github.com/tud-zih-energy/x86_adapt)
* optionally - [numa](https://github.com/numactl/numactl) (mandatory if x86_adapt is missing)
* optionally - [REST-client](https://github.com/mrtazz/restclient-cpp) (mandatory for DiG energy measurement system)
* integrated - [sheredom json parser](https://github.com/sheredom/json.h)
Ondrej Vysocky's avatar
Ondrej Vysocky committed
210
211

### TIMEPROF used libraries ###
Ondrej Vysocky's avatar
Ondrej Vysocky committed
212
213
214
* mandatory  - rt
* optionally - OpenMP
* optionally - MPI
Ondrej Vysocky's avatar
Ondrej Vysocky committed
215
216
217

Beside these libraries waf requires Python.

218
Default compilation expects Intel compiler, if you want to compile using GCC use `make gcc` instead of `make`. Together with MERIC also TIMEPROF is being compiled. If MPI compiler is available, than compilation will produce both MPI and non-MPI versions of the libraries, both using OpenMP. If a MPI application without OpenMP should be analyzed, compilation with `--noopenmp` must be used to compile such version of MERIC. Please, link your application with `-lmeric`/`-ltimeprof` or `-lmericmpi`/`-ltimeprofmpi` for your OpenMP+MPI application or`-lmericmpionly`/`-ltimeprofmpionly` for pure MPI application.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
219
220
221

--------------------------------------------------------------------------------
#     5] MERIC input parameters                                                #
xvysoc01's avatar
xvysoc01 committed
222
--------------------------------------------------------------------------------
xvysoc01's avatar
xvysoc01 committed
223

Ondrej Vysocky's avatar
Ondrej Vysocky committed
224
## SET MERIC STATIC PARAMETERS - mandatory parameters ##
Ondrej Vysocky's avatar
Ondrej Vysocky committed
225
226
227
	export MERIC_FREQUENCY=2400MHz
	export MERIC_UNCORE_FREQUENCY=2GHz
		- both frequencies can be specified as integer in Hz (default), KHz, MHz or GHz
Ondrej Vysocky's avatar
Ondrej Vysocky committed
228
229
	export MERIC_NUM_THREADS=24
		- To run a code in the default settings, without MERIC influence, set these three environment variables to zero.
xvysoc01's avatar
xvysoc01 committed
230

231
232
## SET MERIC WORKING MODE ##
	export MERIC_MODE=0
Ondrej Vysocky's avatar
Ondrej Vysocky committed
233
234
		0 = hdeem - uses hdeem to measure energy consumption
		1 = rapl - uses rapl counters to measure energy consumption
235
		2 = hdeem & rapl - uses hdeem and rapl at the same time
Ondrej Vysocky's avatar
Ondrej Vysocky committed
236
		3 = none - doesn't measure energy consumption, but provides you the option to set configuration for inserted regions
Ondrej Vysocky's avatar
#1 #38    
Ondrej Vysocky committed
237
		4 = jetson - energy measurement on BSC Jetson TX1 system
Ondrej Vysocky's avatar
Ondrej Vysocky committed
238
		5 = thunder - energy measurement on BSC ThunderX system
239
		6 = davide - energy measurement on CINECA D.A.V.I.D.E. system
240
		7 = time - storing runtime of the regions only
241

242
243
244
245
	export MERIC_ITERATION=0
		- if runing an application several times with the same configuration MERIC_ITERATION=$iteration must be exported
		- always start with 0

Ondrej Vysocky's avatar
Ondrej Vysocky committed
246
247
248
249
250
	export MERIC_BARRIERS=all
		all  = all barriers are applied (default)
		mpi  = use MPI barriers only
		omp  = use OpenMP barriers only
		none = do not use barriers
251

xvysoc01's avatar
xvysoc01 committed
252
## SET ONE OF MERIC OUTPUT FORMAT ##
xvysoc01's avatar
xvysoc01 committed
253
	export MERIC_CONTINUAL=1
254
		- Single samples are stored in HDEEM internal memory and read at the end of the runtime
Ondrej Vysocky's avatar
Ondrej Vysocky committed
255
		  (with frequency 1000 samples per 1 second for blade and 100 samples in detailed mode for VRs).
xvysoc01's avatar
xvysoc01 committed
256
		- Minimal overhead - only times of the beginning and the end of measurement
Ondrej Vysocky's avatar
Ondrej Vysocky committed
257
		  are stored (samples are processed after the measurement).
258
		- in noncontinual mode (MERIC_CONTINUAL=0) energy consumption measured directly 
Ondrej Vysocky's avatar
Ondrej Vysocky committed
259
		  (with HDEEM internal delay) at each region start and end.
xvysoc01's avatar
xvysoc01 committed
260

xvysoc01's avatar
xvysoc01 committed
261
	export MERIC_DETAILED=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
262
263
264
		- HDEEM gives us not only data from blade, but data from Voltage Regulators (VR-CPU1,
		  VR-CPU0, VR-DIMMGH, VR-DIMMEF, VR-DIMMCD, VR-DIMMAB) are stored too.
		- In detailed mode RAPL returns values also for each CPU, not only energy consumption for a node.
xvysoc01's avatar
xvysoc01 committed
265

xvysoc01's avatar
xvysoc01 committed
266
	export MERIC_DEBUG=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
267
268
		- Data are taken both from samples and Stats structure, so we can compare them.
		- Data are taken from blade and Voltage Regulators too.
xvysoc01's avatar
xvysoc01 committed
269
		- Only for measurement check - there can be larger overhead because of two
Ondrej Vysocky's avatar
Ondrej Vysocky committed
270
		  types of data processing performed simultaneously.
xvysoc01's avatar
xvysoc01 committed
271

272
	export MERIC_SAMPLES=1
Ondrej Vysocky's avatar
Ondrej Vysocky committed
273
274
275
		- When using HDEEM samples to read the energy consumption, MERIC prints 
		  each sample to the output file, if MERIC_SAMPLES is set.
		- Files can become very big when measured regions run for a long time.
276

277
	export MERIC_AGGREGATE=0
Ondrej Vysocky's avatar
Ondrej Vysocky committed
278
279
		- When running an MPI application, MERIC aggregate the data from all the processes and stores the aggregated results. There are both average values and summary included in the output files.
		- Exporting MERIC_AGGREGATE=0 turns off this behavior and MERIC will store the results for each node separately.
xvysoc01's avatar
xvysoc01 committed
280

281
	export MERIC_COUNTERS=papi or perfevent
Ondrej Vysocky's avatar
Ondrej Vysocky committed
282
283
284
285
		- If set you can read HW counters using PAPI or perfevent.
		- When using counters, there is not only counter value but also an information.
		  about average CPU core frequency during the region runtime, computational
		  and arithmetic intensity (if possible to measure).
xvysoc01's avatar
xvysoc01 committed
286
		- To add a counter you want to measure it is necessary to follow 
Ondrej Vysocky's avatar
Ondrej Vysocky committed
287
		  the instructions in wrapper/counters.h.
xvysoc01's avatar
xvysoc01 committed
288
	
xvysoc01's avatar
xvysoc01 committed
289
290
## SET OUTPUT FILES/FOLDERS NAME ##
	export MERIC_OUTPUT_DIR="hdeemMeasurement"
Ondrej Vysocky's avatar
Ondrej Vysocky committed
291
		- default name is mericMeasurement
xvysoc01's avatar
xvysoc01 committed
292
	export MERIC_OUTPUT_FILENAME="log"
Ondrej Vysocky's avatar
Ondrej Vysocky committed
293
		- name of the output file is set automatically according specified values of core and uncore CPU frequencies and number of active OpenMP threads, but this is a way how to add filename suffix
xvysoc01's avatar
xvysoc01 committed
294

Ondrej Vysocky's avatar
Ondrej Vysocky committed
295
296
297
## ADVANCED SETTINGS ##
Settings through the exported environment variable should fulfill your needs when manually searching for the optimal settings. To set more complex settings the configuration file must be define.
In the configuration file one can specify settings for each region separately, different settings can be applied for each node and also socket. It is also possible to provide list of regions to ignore (the settings for these regions are applied but no consumptions are measured), and size of change in settings, that should be ignored, because it is too small to apply.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
298
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
299
To use the extended options, configuration file must be written in JSON format as follows and `export MERIC_REGION_OPTIONS=/path/to/regionoptions.json`.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
300

Ondrej Vysocky's avatar
Ondrej Vysocky committed
301
The basic settings via configuration file is HW settings for regions. In this case each region has an object with parameters. Parameters names are the same as exported environment variables, but without the "MERIC_" prefix.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
302
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
303
In case of per node or per socket settings specification, the objects of region settings are wrapped in another object. Per node settings starts the JSON object with keyword "@NODE" (as well as per socket settings use "@SOCKET"), where the value is an object, that has as a keys ids of the nodes (or sockets), and the value of this object specify the settings for each region.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
304

Ondrej Vysocky's avatar
Ondrej Vysocky committed
305
It is also possible to specify the settings for a socket on a specific node, in this case into "@NODE" object insert "@SOCKET" object, that contains required region settings. If the "@NODE" and "@SOCKET" settings are set in separated objects, the settings for a node has higher priority than settings for a socket. If any region of your code or any parameter is missing, the default setting is set.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
306
307
308
309

	"@SOCKET" : {
		"0" : {
			"A" : {
Ondrej Vysocky's avatar
Ondrej Vysocky committed
310
311
				"FREQUENCY" : 1300MHz,
				"UNCORE_FREQUENCY" : 1400MHz
Ondrej Vysocky's avatar
Ondrej Vysocky committed
312
313
314
315
			}
		}
	}
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
316
To define ignore settings, write a object with the keyword "@IGNORE", and value is another object, that might contain "@REGIONS" with an array of regions' names to ignore, and "@CHANGE" with key object that contains settings with values, that specify how large the change might be to ignore it.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
317
318
319
320

	"@IGNORE" : {
		"@REGIONS" : ["A", "B", "C"],
		"@CHANGE" : {
Ondrej Vysocky's avatar
Ondrej Vysocky committed
321
322
			"FREQUENCY" : 2500MHz,
			"UNCORE_FREQUENCY" : 2GHz,
Ondrej Vysocky's avatar
Ondrej Vysocky committed
323
324
325
326
			"NUM_THREADS" : 2
		}
	}

Ondrej Vysocky's avatar
Ondrej Vysocky committed
327
Examples of region.options files are in test/config directory. The region.options.extra contain all supported settings.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
328

xvysoc01's avatar
xvysoc01 committed
329
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
330
#     6] Content of test folder and example application run                    #
xvysoc01's avatar
xvysoc01 committed
331
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
* source codes
	* test.cpp
		* One region, with two another regions inside.
		* This test uses shared Score-P/MERIC API. See section 3 of this README.
	* test_mpi.cpp
		* Same test as test.c, only extended with MPI.
	* fort_test.f90
		* Fortran version of test.cpp to show the MERIC and READEX Fortran interface.
	* samples_test.cpp
		* Test with sleep (minimum energy consumption) and compute (much higher energy consumption) regions in a loop. This allows user to see in the list of samples how energy rise when compute region starts and check if MERIC detects this sample as a first one of this region.
		* Follow the instructions inside to set the test.
	* sleep_test.cpp
		* A test with RUN (maximum load) and SLEEP (minimum load) regions, both with the same runtime.
		* Originally made to control real CPU frequency of the machine.
	* overhead_test.cpp
		* Test to measure MERIC overhead and overhead of libraries to change environment parameters.
	* blas_test.cpp
		* Test compares DGEMM and DGEMV. There are two possible sizes of matrices (large and small (not) to fit in L3 cache). In both cases, sizes of matrices were set to take approximately the same time for both DGEMM and DGEMV region, when using all available resources.
		* This test requires mkl library, due to that it is compiled using `make blasTest` alongside to other tests.
* Makefile
	* Command `make` compiles all test codes except blas_test.cpp.
	* To compile blas_test.cpp use `make blasTest` command.
354
* environment_default.source
Ondrej Vysocky's avatar
Ondrej Vysocky committed
355
356
357
358
359
360
361
362
363
364
365
366
367
	* Basic script that sets chosen MERIC environment variables and informs you which varibles are set.
	* When run with argument `-t`, the script just prints list of set variables.
	* Make a copy of this script and edit it to suits your needs.
* config direcory
	* region.options
		* File that sets exact settings for regions inside your code.
		* In default it is set for blas_test.
	* region.options.extra
		* Configuration file, that shows all available ways how to specify MERIC settings.
* run.sh and run-mpi.sh
	* Scripts that runs MERIC anslysis of test.cpp or test_mpi.cpp on Taurus machine.
* run-jetson.sh
	* Template script to submit a job on BSC ARM Jetson platform.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
368
	
369

Ondrej Vysocky's avatar
Ondrej Vysocky committed
370
371
372
373
374
375
Specify the mandatory MERIC's parameters and run you instrumented application or one from the test directory. Good starting point is runnig `test` application from test directory. To understand well MERIC's output, explore the source file **test.cpp** to see that this test contains regions A, B and C (inside A are two regions B and one region C). For each region you can set CPU core and uncore frequencies and number of threads.
```
export MERIC_FREQUENCY=0         # no CPU core frequency tuning
export MERIC_UNCORE_FREQUENCY=0  # no CPU uncore frequency tuning
export MERIC_NUM_THREADS=0       # non-OpenMP application
export MERIC_MODE=6              # time measurement only
376

Ondrej Vysocky's avatar
Ondrej Vysocky committed
377
378
./test                           # run the application as usual
```
379

Ondrej Vysocky's avatar
Ondrej Vysocky committed
380
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
381
#      7] Code dynamism investigation                                          #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
382
--------------------------------------------------------------------------------
383
384
385
386
387
MERIC's output is stored in the directory in default named `mericMeasurement` that contains result files in folders with names of the regions. To change output folder name export MERIC_OUTPUT_DIR="NEW_NAME".
Each csv file carry 3 types of data:
 *  CALLTREE - the first line of every CSV file, it's a call stack, so we can see, where is the measured region nested
 *  Section label (e.g. '# Job info') - determines a "category" of following data in a file7
 *  Data - tuples (mostly pairs) structured like a hash map: key, value
Ondrej Vysocky's avatar
Ondrej Vysocky committed
388
389

To find the best settings for each region, you should run your code with several possible settings. The content of the MERIC's output directories, can be analysed using our RADAR tool, that generates a MERIC configuration file for production runs of the application and LaTeX report describing the application behavior.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
390
391
392
393
394
395
396
397
398
399

	# Run an application in several configurations you want to test - in test directory are provided example scripts `run.sh` or `run-mpi.sh`
		```
		export MERIC_MODE=1
		for thread in {24..1..1}
		do
			for cpu_freq in {25..12..1}
			do
				for uncore_freq in {30..12..1} # or {30..12..2}
				do
400
401
402
403
404
405
406
407
					for iter in {0..3}
					do
						export MERIC_NUM_THREADS=$thread
						export MERIC_FREQUENCY=${cpu_freq}00MHz
						export MERIC_UNCORE_FREQUENCY=${uncore_freq}00HMz
						export MERIC_ITERATION=$iter
						./test
					done
Ondrej Vysocky's avatar
Ondrej Vysocky committed
408
409
410
411
				done
			done
		done
		```
412
	
Ondrej Vysocky's avatar
Ondrej Vysocky committed
413
	# Edit description file `measurementInfo.json` in your output data folder. This step is not compulsory but this file helps you keep information what you have measured.
414
415
		```
		{
416
417
			"Timestamp" : "Thu Dec  6 15:37:23 2018",
			"System"    : "IT4I Salomon",
Ondrej Vysocky's avatar
Ondrej Vysocky committed
418
			"DataFormat": "node_CF_UnCF_thrds",
419
			"Note"      : ""
420
		}
Ondrej Vysocky's avatar
Ondrej Vysocky committed
421
		```
422

Ondrej Vysocky's avatar
Ondrej Vysocky committed
423
424
425
426
	# Process the results using RADAR tool
		Repository URL: https://code.it4i.cz/bes0030/readex-radar.git
		1) Set variables in config.py file (description is included in the file)
		2) Launch python3 ./printFullReport.py -configFile path/to/config.py
xvysoc01's avatar
xvysoc01 committed
427

428

Ondrej Vysocky's avatar
Ondrej Vysocky committed
429
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
430
#      8] MERIC with a Fortran code                                            #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
431
--------------------------------------------------------------------------------
432
433
434
There are Fortran module and interface in the include directory. The module is being compiled separately from MERIC, use `make fortran` command.

For manual instrumentation one can use shared READEX interface (`#include "readex.inc"`) or just the MERIC's API (`#include "meric.inc"`). If MERIC interface is used, user should use keyword `call` as usual for Fortran functions, however READEX interface (as well as Score-P Fortran interface) should be used without it. MERIC repository contains a Fortran code example `test/fort_test.f90` to show how the interfaces can be used.
435
	
436
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
437
#      9] Using MERIC on BSC ARM systems                                       #
438
--------------------------------------------------------------------------------
Ondrej Vysocky's avatar
Ondrej Vysocky committed
439
To compile MERIC on Jetson, use gcc and PAPI modules and Makefile option arm. The only difference when running a test on Jetson is an energy measurement. There is python energy measurement script that runs on node background and effects CPU. 10 samples per second was selected in the measurement script as a compromise - this samples rate takes ~2% of the CPU load. It is possible to change the rate in tools/getJTX1measurements.py. 
Ondrej Vysocky's avatar
Ondrej Vysocky committed
440

Ondrej Vysocky's avatar
Ondrej Vysocky committed
441
Export MERIC_MODE=4 to activate MERIC on Jetson - otherwise it isn't possible to change frequencies and measure energy consumption. Since Jetson doesn't have non-continual energy measurement interface exporting MERIC_CONTINUAL=0 turns off the energy consumption measurement. To start energy measurement one must export MERIC_CONTINUAL=1.
442

Ondrej Vysocky's avatar
Ondrej Vysocky committed
443
ARM core and uncore frequencies are much lower than Haswell's. To easily set these frequencies, input values are in kHz. Default frequencies are 518400 kHz core and 408000 kHz uncore. It is recommended to set frequencies from a list made by administrators, see:
444

Ondrej Vysocky's avatar
Ondrej Vysocky committed
445
	core: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies [kHz]
446
447
	102000 204000 307200 403200 518400 614400 710400 825600 921600 1036800 1132800 1224000 1326000

Ondrej Vysocky's avatar
Ondrej Vysocky committed
448
	uncore:	/sys/kernel/debug/clock/emc/possible_rates [kHz]
449
450
	40800 68000 102000 204000 408000 665600 800000 1065600 1331200 1600000

Ondrej Vysocky's avatar
Ondrej Vysocky committed
451
Another supported ARM system is ThunderX. This system is much more powerfull in compare to Jetson/TX1 and it has energy measurement system that doesn't effects the CPUs. Its measurement system measure the energy consumed by all available nodes (one must allocate all four nodes), and its energy measurement samples frequency is approximately 4 samples per second. Unfortunately, the frequency scaling is not supported. To run MERIC on the ThunderX export MERIC_CONTINUAL=1, MERIC_MODE=5.
Ondrej Vysocky's avatar
Ondrej Vysocky committed
452

Ondrej Vysocky's avatar
Ondrej Vysocky committed
453
At BSC ARM systems it is possible to load modules at login node only - it is necessary to load them before running a job. See run-jetson.sh script in the test directory, that shows how to run a test on Jetson.
454
455
456
457
458
459
460
461

--------------------------------------------------------------------------------
#     10] Using MERIC on CINECA D.A.V.I.D.E. system                            #
--------------------------------------------------------------------------------
To activate posibility of energy measurement provided by DiG system in the MERIC, [REST-client library](https://github.com/mrtazz/restclient-cpp) must be available on the target system. MERIC and the tuned applications must be compiled with the library.

Available CPU core frequencies available on IBM Power8+ are: 4.02, 3.99, 3.96, 3.92, 3.89, 3.86, 3.82, 3.79, 3.76, 3.72, 3.69, 3.66, 3.62, 3.59, 3.56, 3.52, 3.49, 3.46, 3.42, 3.39, 3.36, 3.33, 3.29, 3.26, 3.23, 3.19, 3.16, 3.13, 3.09, 3.06, 3.03, 2.99, 2.96, 2.93, 2.89, 2.86, 2.83, 2.79, 2.76, 2.73, 2.69, 2.66, 2.63, 2.59, 2.56, 2.53, 2.49, 2.46, 2.43, 2.39, 2.36, 2.33, 2.29, 2.26, 2.23, 2.19, 2.16, 2.13, 2.09, 2.06 GHz. For the frequency tuning no extra library is necessary.

Ondrej Vysocky's avatar
Ondrej Vysocky committed
462
--------------------------------------------------------------------------------
463
#     11] Tool for static tuning                                               #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
464
465
--------------------------------------------------------------------------------
MERIC repository also contain a tool, based on MERIC source code, for static energy measurement and CPU frequencies setting. It is located in the tools/staticMERICtool/ and is compiled separately from the MERIC library.
466

Ondrej Vysocky's avatar
Ondrej Vysocky committed
467
Binaries energyMeasureStart and energyMeasureStop provides RAPL energy measurement for a single node (similar to HDEEM commandline tools startHdeem and stopHdeem), if one wants to do measurement on several nodes multiNodeStaticMeasureStart/Stop.sh scripts are located in the same directory. Since there is no multi-node HDEEM measurement tool, this script provides the option for both energy measurement interfaces. To select which one should be used, the script takes one argument `--rapl` or `--hdeem`.
468
469

For a static analysis of a selected application, the directory with the tool contain `staticAnalysis.sh` bash script too. It not only runs the application in variety of available HW settings, but also stores the results in format similar to MERIC, so the results can be analysed using RADAR library.
470

Ondrej Vysocky's avatar
Ondrej Vysocky committed
471
--------------------------------------------------------------------------------
472
#     12] Acknowledgement                                                      #
Ondrej Vysocky's avatar
Ondrej Vysocky committed
473
--------------------------------------------------------------------------------
474
MERIC is being developed at [IT4Innovations National Supercomputing Center](https://www.it4i.cz/) under [BSD-3 license](https://code.it4i.cz/vys0053/meric/blob/master/LICENSE).
Ondrej Vysocky's avatar
Ondrej Vysocky committed
475
Please, open an issue, if you meet any problem. 
476

Ondrej Vysocky's avatar
Ondrej Vysocky committed
477
478
	
For referencing MERIC, please, cite: **[MERIC and RADAR Generator: Tools for Energy Evaluation and Runtime Tuning of HPC Applications](https://link.springer.com/chapter/10.1007/978-3-319-97136-0_11)**.
479

Ondrej Vysocky's avatar
ENH #1    
Ondrej Vysocky committed
480
481
482
483