Commit e4dc6645 authored by David Hrbáč's avatar David Hrbáč

Please-notes go to /dev/null

parent 201844d6
Pipeline #1880 passed with stages
in 1 minute and 3 seconds
......@@ -216,17 +216,18 @@ $ qsub -N JOBNAME jobscript
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
!!! Hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel
!!! Note
Combine the Job arrays and GNU parallel for best throughput of single core jobs
Combine the Job arrays and GNU parallel for best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! Note
Every subjob in an array runs GNU parallel to utilize all cores on the node
Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU Parallel, Shared jobscript
......@@ -281,7 +282,7 @@ cp output $PBS_O_WORKDIR/$TASK.out
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
!!! Note
Select subjob walltime and number of tasks per subjob carefully
Select subjob walltime and number of tasks per subjob carefully
When deciding this values, think about following guiding rules:
......@@ -300,7 +301,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
!!! Hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Examples
......
......@@ -233,9 +233,12 @@ The resources that are currently subject to accounting are the core hours. The c
PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree".
!!! Note
You need to know your user password to use the command. Displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Hint
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
```bash
......
......@@ -192,7 +192,7 @@ $ module load virtualgl/2.4
$ vglrun glxgears
```
Please note, that if you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. . g. to run the **Mentat** OpenGL application from **MARC** software ackage use:
If you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. E.g. to run the **Mentat** OpenGL application from **MARC** software ackage use:
```bash
$ module load marc/2013.1
......
......@@ -102,7 +102,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use
As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only.
For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that **the selection of the network is done at the compile time** and not at runtime (as expected)!
For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended.
!!! Warning
Selection of the network is done at the compile time and not at runtime (as expected)!
Example UPC code:
......
......@@ -91,8 +91,8 @@ To debug a serial code use:
To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
!!! Note
**Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file:
!!! Hint
To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your `~/.tvdrc` file:
```bash
proc mpi_auto_run_starter {loaded_id} {
......
......@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO
export OFFLOAD_REPORT=3
```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator.
A very basic example of code that employs offload programming technique is shown in the next listing.
!!! Note
This code is sequential and utilizes only single core of the accelerator.
```bash
$ vim source-offload.cpp
......@@ -327,7 +330,7 @@ Following example show how to automatically offload an SGEMM (single precision -
```
!!! Note
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
This example is simplified version of an example from MKL. The expanded version can be found here: `$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c`.
To compile a code using Intel compiler use:
......@@ -370,7 +373,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel
```
!!! Note
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths.
Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
......@@ -413,7 +416,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
```
!!! Note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
The path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
......@@ -538,8 +541,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
...
```
!!! Note
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
!!! Warning
GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI
......@@ -648,9 +651,8 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
```
!!! Note
Please note:
\- this file sets up both environmental variable for both MPI and OpenMP libraries.
\- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
- this file sets up both environmental variable for both MPI and OpenMP libraries.
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use:
......@@ -702,9 +704,8 @@ or using mpirun
```
!!! Note
Please note:
\- the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
\- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
- the full path to the binary has to specified (here: `>~/mpi-test-mic`)
- the `LD_LIBRARY_PATH` has to match with Intel MPI module used to compile the MPI code
The output should be again similar to:
......@@ -716,7 +717,9 @@ The output should be again similar to:
```
!!! Note
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute:
`mpiexec.hydra` requires a file the MIC filesystem. If the file is missing please contact the system administrators.
A simple test to see if the file is present is to execute:
```bash
$ ssh mic0 ls /bin/pmi_proxy
......@@ -749,11 +752,10 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : ** $ ssh cn205**
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0**
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh cn205`
- to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0`
- to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0`
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
......@@ -882,7 +884,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
```
!!! Note
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files**
......@@ -895,7 +897,7 @@ PBS also generates a set of node-files that can be used instead of manually crea
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using `-n` parameter of the mpirun command.
## Optimization
......
......@@ -134,7 +134,7 @@ The last part of the configuration is done directly in the user Matlab script be
This script creates scheduler object "cluster" of type "local" that starts workers locally.
!!! Note
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
......@@ -217,7 +217,8 @@ You can start this script using batch mode the same way as in Local mode example
This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.)
Please note that this method is experimental.
!!! Warning
This method is experimental.
For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab/#running-parallel-matlab-using-distributed-computing-toolbox---engine)
......
......@@ -66,13 +66,11 @@ To test if the MAGMA server runs properly we can run one of examples that are pa
10304 10304 --- ( --- ) 500.70 ( 1.46) ---
```
!!! Note
Please note: MAGMA contains several benchmarks and examples that can be found in:
**$MAGMAROOT/testing/**
!!! Hint
MAGMA contains several benchmarks and examples in `$MAGMAROOT/testing/`
!!! Note
MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16:
**export OMP_NUM_THREADS=16**
MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16 with `export OMP_NUM_THREADS=16`.
See more details at [MAGMA home page](http://icl.cs.utk.edu/magma/).
......
......@@ -281,9 +281,8 @@ SAXPY function multiplies the vector x by the scalar alpha and adds it to the ve
```
!!! Note
Please note: cuBLAS has its own function for data transfers between CPU and GPU memory:
- [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory
cuBLAS has its own function for data transfers between CPU and GPU memory:
- [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory
- [cublasGetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublasgetvector) - transfers data from GPU to CPU memory
To compile the code using NVCC compiler a "-lcublas" compiler flag has to be specified:
......
......@@ -218,7 +218,8 @@ $ qsub -N JOBNAME jobscript
In this example, we submit a job of 101 tasks. 24 input files will be processed in parallel. The 101 tasks on 24 cores are assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
!!! Note
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel
......@@ -302,7 +303,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
!!! Note
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Examples
......
......@@ -202,7 +202,8 @@ Generally both shared file systems are available through GridFTP:
More information about the shared file systems is available [here](storage/).
Please note, that for PRACE users a "prace" directory is used also on the SCRATCH file system.
!!! Hint
`prace` directory is used for PRACE users on the SCRATCH file system.
| Data type | Default path |
| ---------------------------- | ------------------------------- |
......@@ -245,7 +246,7 @@ The resources that are currently subject to accounting are the core hours. The c
PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". You need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Note
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
......
......@@ -138,7 +138,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use
As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only.
For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that the selection of the network is done at the compile time and not at runtime (as expected)!
For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended.
!!! Warning
Selection of the network is done at the compile time and not at runtime (as expected)!
Example UPC code:
......
......@@ -80,8 +80,8 @@ To debug a serial code use:
To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
!!! Note
**Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file:
!!! Hint
To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file.
```bash
proc mpi_auto_run_starter {loaded_id} {
......
......@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO
export OFFLOAD_REPORT=3
```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator.
A very basic example of code that employs offload programming technique is shown in the next listing.
!!! Note
This code is sequential and utilizes only single core of the accelerator.
```bash
$ vim source-offload.cpp
......@@ -326,7 +329,7 @@ Following example show how to automatically offload an SGEMM (single precision -
```
!!! Note
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use:
......@@ -369,7 +372,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel
```
!!! Note
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths.
Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
......@@ -412,7 +415,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
```
!!! Note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
The path exported contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
......@@ -537,8 +540,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
...
```
!!! Note
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
!!! Hint
GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI
......@@ -647,8 +650,6 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
```
!!! Note
Please note:
- this file sets up both environmental variable for both MPI and OpenMP libraries.
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
......@@ -702,9 +703,8 @@ or using mpirun
```
!!! Note
Please note:
\- the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
\- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
- the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to:
......@@ -715,8 +715,10 @@ The output should be again similar to:
Hello world from process 0 of 4 on host cn207-mic0
```
!!! Note
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute:
!!! Hint
**"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators.
A simple test to see if the file is present is to execute:
```bash
$ ssh mic0 ls /bin/pmi_proxy
......@@ -749,11 +751,10 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : ** $ ssh cn205**
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0**
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh cn205`
- to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0`
- to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0`
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
......@@ -882,7 +883,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
```
!!! Note
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files**
......@@ -895,7 +896,7 @@ PBS also generates a set of node-files that can be used instead of manually crea
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
## Optimization
......
......@@ -129,7 +129,8 @@ The last part of the configuration is done directly in the user Matlab script be
This script creates scheduler object "cluster" of type "local" that starts workers locally.
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
!!! Hint
Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
......@@ -212,7 +213,8 @@ You can start this script using batch mode the same way as in Local mode example
This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.)
Please note that this method is experimental.
!!! Warning
This method is experimental.
For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab.md#running-parallel-matlab-using-distributed-computing-toolbox---engine)
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment