Skip to content
Snippets Groups Projects
Commit d0e6f7a6 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

repair external and internal links

parent e52e6213
No related branches found
No related tags found
4 merge requests!368Update prace.md to document the change from qprace to qprod as the default...,!367Update prace.md to document the change from qprace to qprod as the default...,!366Update prace.md to document the change from qprace to qprod as the default...,!323extended-acls-storage-section
Pipeline #
...@@ -230,18 +230,19 @@ During the compilation Intel compiler shows which loops have been vectorized in ...@@ -230,18 +230,19 @@ During the compilation Intel compiler shows which loops have been vectorized in
Some interesting compiler flags useful not only for code debugging are: Some interesting compiler flags useful not only for code debugging are:
>Debugging !!! Note "Note"
openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level Debugging
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic level openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic level
>Performance ooptimization Performance ooptimization
xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions) instructions. xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions) instructions.
Automatic Offload using Intel MKL Library Automatic Offload using Intel MKL Library
----------------------------------------- -----------------------------------------
Intel MKL includes an Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Intel Xeon Phi coprocessors automatically and transparently. Intel MKL includes an Automatic Offload (AO) feature that enables computationally intensive MKL functions called in user code to benefit from attached Intel Xeon Phi coprocessors automatically and transparently.
Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [ here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm)![external](../../img/external.png). Behavioral of automatic offload mode is controlled by functions called within the program or by environmental variables. Complete list of controls is listed [here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm)![external](../../img/external.png).
The Automatic Offload may be enabled by either an MKL function call within the code: The Automatic Offload may be enabled by either an MKL function call within the code:
...@@ -325,7 +326,8 @@ Following example show how to automatically offload an SGEMM (single precision - ...@@ -325,7 +326,8 @@ Following example show how to automatically offload an SGEMM (single precision -
} }
``` ```
>Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** !!! Note "Note"
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use: To compile a code using Intel compiler use:
...@@ -367,7 +369,8 @@ To compile a code user has to be connected to a compute with MIC and load Intel ...@@ -367,7 +369,8 @@ To compile a code user has to be connected to a compute with MIC and load Intel
$ module load intel/13.5.192 $ module load intel/13.5.192
``` ```
>Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths. !!! Note "Note"
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only: To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
...@@ -409,7 +412,8 @@ If the code is parallelized using OpenMP a set of additional libraries is requir ...@@ -409,7 +412,8 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
mic0 $ export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH mic0 $ export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
``` ```
>Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. !!! Note "Note"
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is: For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
...@@ -494,7 +498,8 @@ After executing the complied binary file, following output should be displayed. ...@@ -494,7 +498,8 @@ After executing the complied binary file, following output should be displayed.
... ...
``` ```
>More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>![external](../../img/external.png) !!! Note "Note"
More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>![external](../../img/external.png)
The second example that can be found in "/apps/intel/opencl-examples" directory is General Matrix Multiply. You can follow the the same procedure to download the example to your directory and compile it. The second example that can be found in "/apps/intel/opencl-examples" directory is General Matrix Multiply. You can follow the the same procedure to download the example to your directory and compile it.
...@@ -533,7 +538,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as ...@@ -533,7 +538,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
... ...
``` ```
>Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. !!! Note "Note"
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
MPI MPI
----------------- -----------------
...@@ -595,11 +601,12 @@ An example of basic MPI version of "hello-world" example in C language, that can ...@@ -595,11 +601,12 @@ An example of basic MPI version of "hello-world" example in C language, that can
Intel MPI for the Xeon Phi coprocessors offers different MPI programming models: Intel MPI for the Xeon Phi coprocessors offers different MPI programming models:
>**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.) !!! Note "Note"
**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
>**Coprocessor-only model** - all MPI ranks reside only on the coprocessors. **Coprocessor-only model** - all MPI ranks reside only on the coprocessors.
>**Symmetric model** - the MPI ranks reside on both the host and the coprocessor. Most general MPI case. **Symmetric model** - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.
###Host-only model ###Host-only model
...@@ -641,9 +648,11 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta ...@@ -641,9 +648,11 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH
``` ```
>Please note: !!! Note "Note"
- this file sets up both environmental variable for both MPI and OpenMP libraries. Please note:
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
- this file sets up both environmental variable for both MPI and OpenMP libraries.
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use: To access a MIC accelerator located on a node that user is currently connected to, use:
...@@ -694,9 +703,10 @@ or using mpirun ...@@ -694,9 +703,10 @@ or using mpirun
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic $ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
``` ```
>Please note: !!! Note "Note"
- the full path to the binary has to specified (here: "**>~/mpi-test-mic**") Please note:
- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code - the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to: The output should be again similar to:
...@@ -707,7 +717,8 @@ The output should be again similar to: ...@@ -707,7 +717,8 @@ The output should be again similar to:
Hello world from process 0 of 4 on host cn207-mic0 Hello world from process 0 of 4 on host cn207-mic0
``` ```
>>Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute: !!! Note "Note"
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute:
```bash ```bash
$ ssh mic0 ls /bin/pmi_proxy $ ssh mic0 ls /bin/pmi_proxy
...@@ -739,10 +750,12 @@ For example: ...@@ -739,10 +750,12 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators. This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
>Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: !!! Note "Note"
- to connect to the second node : ** $ ssh cn205** Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0**
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0** - to connect to the second node : ** $ ssh cn205**
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0**
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC" At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
...@@ -869,16 +882,19 @@ A possible output of the MPI "hello-world" example executed on two hosts and two ...@@ -869,16 +882,19 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
Hello world from process 7 of 8 on host cn205-mic0 Hello world from process 7 of 8 on host cn205-mic0
``` ```
>Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. !!! Note "Note"
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files** **Using the PBS automatically generated node-files**
PBS also generates a set of node-files that can be used instead of manually creating a new one every time. Three node-files are genereated: PBS also generates a set of node-files that can be used instead of manually creating a new one every time. Three node-files are genereated:
>**Host only node-file:** !!! Note "Note"
- /lscratch/${PBS_JOBID}/nodefile-cn MIC only node-file: **Host only node-file:**
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix - /lscratch/${PBS_JOBID}/nodefile-cn MIC only node-file:
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
......
...@@ -95,7 +95,8 @@ In this example, we demonstrate recommended way to run an MPI application, using ...@@ -95,7 +95,8 @@ In this example, we demonstrate recommended way to run an MPI application, using
### OpenMP thread affinity ### OpenMP thread affinity
>Important! Bind every OpenMP thread to a core! !!! Note "Note"
Important! Bind every OpenMP thread to a core!
In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP: In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP:
......
...@@ -129,7 +129,8 @@ Consider these ways to run an MPI program: ...@@ -129,7 +129,8 @@ Consider these ways to run an MPI program:
**Two MPI** processes per node, using 12 threads each, bound to processor socket is most useful for memory bandwidth bound applications such as BLAS1 or FFT, with scalable memory demand. However, note that the two processes will share access to the network interface. The 12 threads and socket binding should ensure maximum memory access bandwidth and minimize communication, migration and numa effect overheads. **Two MPI** processes per node, using 12 threads each, bound to processor socket is most useful for memory bandwidth bound applications such as BLAS1 or FFT, with scalable memory demand. However, note that the two processes will share access to the network interface. The 12 threads and socket binding should ensure maximum memory access bandwidth and minimize communication, migration and numa effect overheads.
>Important! Bind every OpenMP thread to a core! !!! Note "Note"
Important! Bind every OpenMP thread to a core!
In the previous two cases with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You want to avoid this by setting the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables. In the previous two cases with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You want to avoid this by setting the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables.
......
...@@ -97,10 +97,10 @@ Download the package [parallell](package-parallel-vignette.pdf) vignette. ...@@ -97,10 +97,10 @@ Download the package [parallell](package-parallel-vignette.pdf) vignette.
The forking is the most simple to use. Forking family of functions provide parallelized, drop in replacement for the serial apply() family of functions. The forking is the most simple to use. Forking family of functions provide parallelized, drop in replacement for the serial apply() family of functions.
>Forking via package parallel provides functionality similar to OpenMP construct !!! Note "Note"
omp parallel for Forking via package parallel provides functionality similar to OpenMP construct omp parallel for
>Only cores of single node can be utilized this way! Only cores of single node can be utilized this way!
Forking example: Forking example:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment