Skip to content
Snippets Groups Projects
Commit e4dc6645 authored by David Hrbáč's avatar David Hrbáč
Browse files

Please-notes go to /dev/null

parent 201844d6
No related branches found
No related tags found
No related merge requests found
Showing
with 86 additions and 69 deletions
...@@ -216,17 +216,18 @@ $ qsub -N JOBNAME jobscript ...@@ -216,17 +216,18 @@ $ qsub -N JOBNAME jobscript
In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours. In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. !!! Hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel ## Job Arrays and GNU Parallel
!!! Note !!! Note
Combine the Job arrays and GNU parallel for best throughput of single core jobs Combine the Job arrays and GNU parallel for best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs. While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! Note !!! Note
Every subjob in an array runs GNU parallel to utilize all cores on the node Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU Parallel, Shared jobscript ### GNU Parallel, Shared jobscript
...@@ -281,7 +282,7 @@ cp output $PBS_O_WORKDIR/$TASK.out ...@@ -281,7 +282,7 @@ cp output $PBS_O_WORKDIR/$TASK.out
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached. In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
!!! Note !!! Note
Select subjob walltime and number of tasks per subjob carefully Select subjob walltime and number of tasks per subjob carefully
When deciding this values, think about following guiding rules: When deciding this values, think about following guiding rules:
...@@ -300,7 +301,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript ...@@ -300,7 +301,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours. In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. !!! Hint
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Examples ## Examples
......
...@@ -233,9 +233,12 @@ The resources that are currently subject to accounting are the core hours. The c ...@@ -233,9 +233,12 @@ The resources that are currently subject to accounting are the core hours. The c
PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/). PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours". Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree".
!!! Note !!! Note
You need to know your user password to use the command. Displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Hint
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
```bash ```bash
......
...@@ -192,7 +192,7 @@ $ module load virtualgl/2.4 ...@@ -192,7 +192,7 @@ $ module load virtualgl/2.4
$ vglrun glxgears $ vglrun glxgears
``` ```
Please note, that if you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. . g. to run the **Mentat** OpenGL application from **MARC** software ackage use: If you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. E.g. to run the **Mentat** OpenGL application from **MARC** software ackage use:
```bash ```bash
$ module load marc/2013.1 $ module load marc/2013.1
......
...@@ -102,7 +102,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use ...@@ -102,7 +102,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use
As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only. As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only.
For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that **the selection of the network is done at the compile time** and not at runtime (as expected)! For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended.
!!! Warning
Selection of the network is done at the compile time and not at runtime (as expected)!
Example UPC code: Example UPC code:
......
...@@ -91,8 +91,8 @@ To debug a serial code use: ...@@ -91,8 +91,8 @@ To debug a serial code use:
To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment: To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
!!! Note !!! Hint
**Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file: To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your `~/.tvdrc` file:
```bash ```bash
proc mpi_auto_run_starter {loaded_id} { proc mpi_auto_run_starter {loaded_id} {
......
...@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO ...@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO
export OFFLOAD_REPORT=3 export OFFLOAD_REPORT=3
``` ```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator. A very basic example of code that employs offload programming technique is shown in the next listing.
!!! Note
This code is sequential and utilizes only single core of the accelerator.
```bash ```bash
$ vim source-offload.cpp $ vim source-offload.cpp
...@@ -327,7 +330,7 @@ Following example show how to automatically offload an SGEMM (single precision - ...@@ -327,7 +330,7 @@ Following example show how to automatically offload an SGEMM (single precision -
``` ```
!!! Note !!! Note
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** This example is simplified version of an example from MKL. The expanded version can be found here: `$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c`.
To compile a code using Intel compiler use: To compile a code using Intel compiler use:
...@@ -370,7 +373,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel ...@@ -370,7 +373,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel
``` ```
!!! Note !!! Note
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths. Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only: To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
...@@ -413,7 +416,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir ...@@ -413,7 +416,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
``` ```
!!! Note !!! Note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. The path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is: For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
...@@ -538,8 +541,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as ...@@ -538,8 +541,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
... ...
``` ```
!!! Note !!! Warning
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI ## MPI
...@@ -648,9 +651,8 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta ...@@ -648,9 +651,8 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
``` ```
!!! Note !!! Note
Please note: - this file sets up both environmental variable for both MPI and OpenMP libraries.
\- this file sets up both environmental variable for both MPI and OpenMP libraries. - this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
\- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
To access a MIC accelerator located on a node that user is currently connected to, use: To access a MIC accelerator located on a node that user is currently connected to, use:
...@@ -702,9 +704,8 @@ or using mpirun ...@@ -702,9 +704,8 @@ or using mpirun
``` ```
!!! Note !!! Note
Please note: - the full path to the binary has to specified (here: `>~/mpi-test-mic`)
\- the full path to the binary has to specified (here: "**>~/mpi-test-mic**") - the `LD_LIBRARY_PATH` has to match with Intel MPI module used to compile the MPI code
\- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to: The output should be again similar to:
...@@ -716,7 +717,9 @@ The output should be again similar to: ...@@ -716,7 +717,9 @@ The output should be again similar to:
``` ```
!!! Note !!! Note
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute: `mpiexec.hydra` requires a file the MIC filesystem. If the file is missing please contact the system administrators.
A simple test to see if the file is present is to execute:
```bash ```bash
$ ssh mic0 ls /bin/pmi_proxy $ ssh mic0 ls /bin/pmi_proxy
...@@ -749,11 +752,10 @@ For example: ...@@ -749,11 +752,10 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators. This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note !!! Note
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh cn205`
- to connect to the second node : ** $ ssh cn205** - to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0`
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0** - to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0`
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC" At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
...@@ -882,7 +884,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two ...@@ -882,7 +884,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
``` ```
!!! Note !!! Note
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files** **Using the PBS automatically generated node-files**
...@@ -895,7 +897,7 @@ PBS also generates a set of node-files that can be used instead of manually crea ...@@ -895,7 +897,7 @@ PBS also generates a set of node-files that can be used instead of manually crea
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file: - /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix - /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using `-n` parameter of the mpirun command.
## Optimization ## Optimization
......
...@@ -134,7 +134,7 @@ The last part of the configuration is done directly in the user Matlab script be ...@@ -134,7 +134,7 @@ The last part of the configuration is done directly in the user Matlab script be
This script creates scheduler object "cluster" of type "local" that starts workers locally. This script creates scheduler object "cluster" of type "local" that starts workers locally.
!!! Note !!! Note
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers. The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
...@@ -217,7 +217,8 @@ You can start this script using batch mode the same way as in Local mode example ...@@ -217,7 +217,8 @@ You can start this script using batch mode the same way as in Local mode example
This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.) This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.)
Please note that this method is experimental. !!! Warning
This method is experimental.
For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab/#running-parallel-matlab-using-distributed-computing-toolbox---engine) For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab/#running-parallel-matlab-using-distributed-computing-toolbox---engine)
......
...@@ -66,13 +66,11 @@ To test if the MAGMA server runs properly we can run one of examples that are pa ...@@ -66,13 +66,11 @@ To test if the MAGMA server runs properly we can run one of examples that are pa
10304 10304 --- ( --- ) 500.70 ( 1.46) --- 10304 10304 --- ( --- ) 500.70 ( 1.46) ---
``` ```
!!! Note !!! Hint
Please note: MAGMA contains several benchmarks and examples that can be found in: MAGMA contains several benchmarks and examples in `$MAGMAROOT/testing/`
**$MAGMAROOT/testing/**
!!! Note !!! Note
MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16: MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16 with `export OMP_NUM_THREADS=16`.
**export OMP_NUM_THREADS=16**
See more details at [MAGMA home page](http://icl.cs.utk.edu/magma/). See more details at [MAGMA home page](http://icl.cs.utk.edu/magma/).
......
...@@ -281,9 +281,8 @@ SAXPY function multiplies the vector x by the scalar alpha and adds it to the ve ...@@ -281,9 +281,8 @@ SAXPY function multiplies the vector x by the scalar alpha and adds it to the ve
``` ```
!!! Note !!! Note
Please note: cuBLAS has its own function for data transfers between CPU and GPU memory: cuBLAS has its own function for data transfers between CPU and GPU memory:
- [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory
- [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory
- [cublasGetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublasgetvector) - transfers data from GPU to CPU memory - [cublasGetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublasgetvector) - transfers data from GPU to CPU memory
To compile the code using NVCC compiler a "-lcublas" compiler flag has to be specified: To compile the code using NVCC compiler a "-lcublas" compiler flag has to be specified:
......
...@@ -218,7 +218,8 @@ $ qsub -N JOBNAME jobscript ...@@ -218,7 +218,8 @@ $ qsub -N JOBNAME jobscript
In this example, we submit a job of 101 tasks. 24 input files will be processed in parallel. The 101 tasks on 24 cores are assumed to complete in less than 2 hours. In this example, we submit a job of 101 tasks. 24 input files will be processed in parallel. The 101 tasks on 24 cores are assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. !!! Note
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Job Arrays and GNU Parallel ## Job Arrays and GNU Parallel
...@@ -302,7 +303,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript ...@@ -302,7 +303,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript
In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours. In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours.
Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. !!! Note
Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue.
## Examples ## Examples
......
...@@ -202,7 +202,8 @@ Generally both shared file systems are available through GridFTP: ...@@ -202,7 +202,8 @@ Generally both shared file systems are available through GridFTP:
More information about the shared file systems is available [here](storage/). More information about the shared file systems is available [here](storage/).
Please note, that for PRACE users a "prace" directory is used also on the SCRATCH file system. !!! Hint
`prace` directory is used for PRACE users on the SCRATCH file system.
| Data type | Default path | | Data type | Default path |
| ---------------------------- | ------------------------------- | | ---------------------------- | ------------------------------- |
...@@ -245,7 +246,7 @@ The resources that are currently subject to accounting are the core hours. The c ...@@ -245,7 +246,7 @@ The resources that are currently subject to accounting are the core hours. The c
PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/). PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/).
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours". Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". You need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Note !!! Note
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
......
...@@ -138,7 +138,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use ...@@ -138,7 +138,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use
As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only. As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only.
For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that the selection of the network is done at the compile time and not at runtime (as expected)! For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended.
!!! Warning
Selection of the network is done at the compile time and not at runtime (as expected)!
Example UPC code: Example UPC code:
......
...@@ -80,8 +80,8 @@ To debug a serial code use: ...@@ -80,8 +80,8 @@ To debug a serial code use:
To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment: To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment:
!!! Note !!! Hint
**Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file: To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file.
```bash ```bash
proc mpi_auto_run_starter {loaded_id} { proc mpi_auto_run_starter {loaded_id} {
......
...@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO ...@@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO
export OFFLOAD_REPORT=3 export OFFLOAD_REPORT=3
``` ```
A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator. A very basic example of code that employs offload programming technique is shown in the next listing.
!!! Note
This code is sequential and utilizes only single core of the accelerator.
```bash ```bash
$ vim source-offload.cpp $ vim source-offload.cpp
...@@ -326,7 +329,7 @@ Following example show how to automatically offload an SGEMM (single precision - ...@@ -326,7 +329,7 @@ Following example show how to automatically offload an SGEMM (single precision -
``` ```
!!! Note !!! Note
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use: To compile a code using Intel compiler use:
...@@ -369,7 +372,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel ...@@ -369,7 +372,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel
``` ```
!!! Note !!! Note
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths. Particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only: To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
...@@ -412,7 +415,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir ...@@ -412,7 +415,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
``` ```
!!! Note !!! Note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. The path exported contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is: For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
...@@ -537,8 +540,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as ...@@ -537,8 +540,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
... ...
``` ```
!!! Note !!! Hint
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI ## MPI
...@@ -647,8 +650,6 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta ...@@ -647,8 +650,6 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
``` ```
!!! Note !!! Note
Please note:
- this file sets up both environmental variable for both MPI and OpenMP libraries. - this file sets up both environmental variable for both MPI and OpenMP libraries.
- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules. - this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
...@@ -702,9 +703,8 @@ or using mpirun ...@@ -702,9 +703,8 @@ or using mpirun
``` ```
!!! Note !!! Note
Please note: - the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
\- the full path to the binary has to specified (here: "**>~/mpi-test-mic**") - the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
\- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
The output should be again similar to: The output should be again similar to:
...@@ -715,8 +715,10 @@ The output should be again similar to: ...@@ -715,8 +715,10 @@ The output should be again similar to:
Hello world from process 0 of 4 on host cn207-mic0 Hello world from process 0 of 4 on host cn207-mic0
``` ```
!!! Note !!! Hint
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute: **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators.
A simple test to see if the file is present is to execute:
```bash ```bash
$ ssh mic0 ls /bin/pmi_proxy $ ssh mic0 ls /bin/pmi_proxy
...@@ -749,11 +751,10 @@ For example: ...@@ -749,11 +751,10 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators. This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note !!! Note
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : `$ ssh cn205`
- to connect to the second node : ** $ ssh cn205** - to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0`
- to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0** - to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0`
- to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0**
At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC" At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC"
...@@ -882,7 +883,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two ...@@ -882,7 +883,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
``` ```
!!! Note !!! Note
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files** **Using the PBS automatically generated node-files**
...@@ -895,7 +896,7 @@ PBS also generates a set of node-files that can be used instead of manually crea ...@@ -895,7 +896,7 @@ PBS also generates a set of node-files that can be used instead of manually crea
- /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file: - /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file:
- /lscratch/${PBS_JOBID}/nodefile-mix - /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command.
## Optimization ## Optimization
......
...@@ -129,7 +129,8 @@ The last part of the configuration is done directly in the user Matlab script be ...@@ -129,7 +129,8 @@ The last part of the configuration is done directly in the user Matlab script be
This script creates scheduler object "cluster" of type "local" that starts workers locally. This script creates scheduler object "cluster" of type "local" that starts workers locally.
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. !!! Hint
Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers. The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
...@@ -212,7 +213,8 @@ You can start this script using batch mode the same way as in Local mode example ...@@ -212,7 +213,8 @@ You can start this script using batch mode the same way as in Local mode example
This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.) This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.)
Please note that this method is experimental. !!! Warning
This method is experimental.
For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab.md#running-parallel-matlab-using-distributed-computing-toolbox---engine) For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab.md#running-parallel-matlab-using-distributed-computing-toolbox---engine)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment