diff --git a/docs.it4i/anselm-cluster-documentation/capacity-computing.md b/docs.it4i/anselm-cluster-documentation/capacity-computing.md index e76bdfc3c5b6bebae842de68298edf66f8eb53f3..306474076df83ea3255390348fe63be8640ce4bf 100644 --- a/docs.it4i/anselm-cluster-documentation/capacity-computing.md +++ b/docs.it4i/anselm-cluster-documentation/capacity-computing.md @@ -216,17 +216,18 @@ $ qsub -N JOBNAME jobscript In this example, we submit a job of 101 tasks. 16 input files will be processed in parallel. The 101 tasks on 16 cores are assumed to complete in less than 2 hours. -Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. +!!! Hint + Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. ## Job Arrays and GNU Parallel !!! Note - Combine the Job arrays and GNU parallel for best throughput of single core jobs + Combine the Job arrays and GNU parallel for best throughput of single core jobs While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs. !!! Note - Every subjob in an array runs GNU parallel to utilize all cores on the node + Every subjob in an array runs GNU parallel to utilize all cores on the node ### GNU Parallel, Shared jobscript @@ -281,7 +282,7 @@ cp output $PBS_O_WORKDIR/$TASK.out In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached. !!! Note - Select subjob walltime and number of tasks per subjob carefully + Select subjob walltime and number of tasks per subjob carefully When deciding this values, think about following guiding rules: @@ -300,7 +301,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**32**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 16 input files in parallel, 32 in total per subjob. Every subjob is assumed to complete in less than 2 hours. -Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. +!!! Hint + Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. ## Examples diff --git a/docs.it4i/anselm-cluster-documentation/prace.md b/docs.it4i/anselm-cluster-documentation/prace.md index 579339bb3d46e4da37825abd1423e4e470efb950..9904b34c928ea574f3f473913033cfa08699c7be 100644 --- a/docs.it4i/anselm-cluster-documentation/prace.md +++ b/docs.it4i/anselm-cluster-documentation/prace.md @@ -233,9 +233,12 @@ The resources that are currently subject to accounting are the core hours. The c PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/). -Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours". +Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". !!! Note + You need to know your user password to use the command. Displayed core hours are "system core hours" which differ from PRACE "standardized core hours". + +!!! Hint The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> ```bash diff --git a/docs.it4i/anselm-cluster-documentation/remote-visualization.md b/docs.it4i/anselm-cluster-documentation/remote-visualization.md index 929f6930a42b253a763e43af1cfe2da9add18614..b448ef6823d976142535a2a3ae121bfeb704deca 100644 --- a/docs.it4i/anselm-cluster-documentation/remote-visualization.md +++ b/docs.it4i/anselm-cluster-documentation/remote-visualization.md @@ -192,7 +192,7 @@ $ module load virtualgl/2.4 $ vglrun glxgears ``` -Please note, that if you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. . g. to run the **Mentat** OpenGL application from **MARC** software ackage use: +If you want to run an OpenGL application which is vailable through modules, you need at first load the respective module. E.g. to run the **Mentat** OpenGL application from **MARC** software ackage use: ```bash $ module load marc/2013.1 diff --git a/docs.it4i/anselm-cluster-documentation/software/compilers.md b/docs.it4i/anselm-cluster-documentation/software/compilers.md index 67c0bd30edcc631860ec8d853e0905729f8e5108..86f354ba1fee2daa9116035e7b70673adab12aa2 100644 --- a/docs.it4i/anselm-cluster-documentation/software/compilers.md +++ b/docs.it4i/anselm-cluster-documentation/software/compilers.md @@ -102,7 +102,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only. -For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that **the selection of the network is done at the compile time** and not at runtime (as expected)! +For production runs, it is recommended to use the native Infiband implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. + +!!! Warning + Selection of the network is done at the compile time and not at runtime (as expected)! Example UPC code: diff --git a/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md b/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md index ca08c5ea8f6b45f048e42a95f9f119f05dc35ef2..1389d347704845c9fbcb5d5f9a479b29790275df 100644 --- a/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md +++ b/docs.it4i/anselm-cluster-documentation/software/debuggers/total-view.md @@ -91,8 +91,8 @@ To debug a serial code use: To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment: -!!! Note - **Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file: +!!! Hint + To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your `~/.tvdrc` file: ```bash proc mpi_auto_run_starter {loaded_id} { diff --git a/docs.it4i/anselm-cluster-documentation/software/intel-xeon-phi.md b/docs.it4i/anselm-cluster-documentation/software/intel-xeon-phi.md index 479af5a22c08d333e02dc9ba5afa946766dd75cb..7f478b558b62699e8eece30865070558b9d0af7c 100644 --- a/docs.it4i/anselm-cluster-documentation/software/intel-xeon-phi.md +++ b/docs.it4i/anselm-cluster-documentation/software/intel-xeon-phi.md @@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO export OFFLOAD_REPORT=3 ``` -A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator. +A very basic example of code that employs offload programming technique is shown in the next listing. + +!!! Note + This code is sequential and utilizes only single core of the accelerator. ```bash $ vim source-offload.cpp @@ -327,7 +330,7 @@ Following example show how to automatically offload an SGEMM (single precision - ``` !!! Note - Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** + This example is simplified version of an example from MKL. The expanded version can be found here: `$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c`. To compile a code using Intel compiler use: @@ -370,7 +373,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel ``` !!! Note - Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths. + Particular version of the Intel module is specified. This information is used later to specify the correct library paths. To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only: @@ -413,7 +416,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir ``` !!! Note - Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. + The path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is: @@ -538,8 +541,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as ... ``` -!!! Note - Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. +!!! Warning + GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. ## MPI @@ -648,9 +651,8 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta ``` !!! Note - Please note: - \- this file sets up both environmental variable for both MPI and OpenMP libraries. - \- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules. + - this file sets up both environmental variable for both MPI and OpenMP libraries. + - this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules. To access a MIC accelerator located on a node that user is currently connected to, use: @@ -702,9 +704,8 @@ or using mpirun ``` !!! Note - Please note: - \- the full path to the binary has to specified (here: "**>~/mpi-test-mic**") - \- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code + - the full path to the binary has to specified (here: `>~/mpi-test-mic`) + - the `LD_LIBRARY_PATH` has to match with Intel MPI module used to compile the MPI code The output should be again similar to: @@ -716,7 +717,9 @@ The output should be again similar to: ``` !!! Note - Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute: + `mpiexec.hydra` requires a file the MIC filesystem. If the file is missing please contact the system administrators. + +A simple test to see if the file is present is to execute: ```bash $ ssh mic0 ls /bin/pmi_proxy @@ -749,11 +752,10 @@ For example: This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators. !!! Note - Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: - - - to connect to the second node : ** $ ssh cn205** - - to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0** - - to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0** + At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: + - to connect to the second node : `$ ssh cn205` + - to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0` + - to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0` At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC" @@ -882,7 +884,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two ``` !!! Note - Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. + At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. **Using the PBS automatically generated node-files** @@ -895,7 +897,7 @@ PBS also generates a set of node-files that can be used instead of manually crea - /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file: - /lscratch/${PBS_JOBID}/nodefile-mix -Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. +Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using `-n` parameter of the mpirun command. ## Optimization diff --git a/docs.it4i/anselm-cluster-documentation/software/numerical-languages/matlab.md b/docs.it4i/anselm-cluster-documentation/software/numerical-languages/matlab.md index af36d6e9a3eebd3597307457aeb884716b970a17..1f933d82f633a98cf16394bdf10e608d2f2c1b8f 100644 --- a/docs.it4i/anselm-cluster-documentation/software/numerical-languages/matlab.md +++ b/docs.it4i/anselm-cluster-documentation/software/numerical-languages/matlab.md @@ -134,7 +134,7 @@ The last part of the configuration is done directly in the user Matlab script be This script creates scheduler object "cluster" of type "local" that starts workers locally. !!! Note - Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. + Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers. @@ -217,7 +217,8 @@ You can start this script using batch mode the same way as in Local mode example This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.) -Please note that this method is experimental. +!!! Warning + This method is experimental. For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab/#running-parallel-matlab-using-distributed-computing-toolbox---engine) diff --git a/docs.it4i/anselm-cluster-documentation/software/numerical-libraries/magma-for-intel-xeon-phi.md b/docs.it4i/anselm-cluster-documentation/software/numerical-libraries/magma-for-intel-xeon-phi.md index 600bd8ae94427a8b0708d3fd1a597249d017ae6b..ef2ee580914212b99eda25128ba0dc4063994b5b 100644 --- a/docs.it4i/anselm-cluster-documentation/software/numerical-libraries/magma-for-intel-xeon-phi.md +++ b/docs.it4i/anselm-cluster-documentation/software/numerical-libraries/magma-for-intel-xeon-phi.md @@ -66,13 +66,11 @@ To test if the MAGMA server runs properly we can run one of examples that are pa 10304 10304 --- ( --- ) 500.70 ( 1.46) --- ``` -!!! Note - Please note: MAGMA contains several benchmarks and examples that can be found in: - **$MAGMAROOT/testing/** +!!! Hint + MAGMA contains several benchmarks and examples in `$MAGMAROOT/testing/` !!! Note - MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16: - **export OMP_NUM_THREADS=16** + MAGMA relies on the performance of all CPU cores as well as on the performance of the accelerator. Therefore on Anselm number of CPU OpenMP threads has to be set to 16 with `export OMP_NUM_THREADS=16`. See more details at [MAGMA home page](http://icl.cs.utk.edu/magma/). diff --git a/docs.it4i/anselm-cluster-documentation/software/nvidia-cuda.md b/docs.it4i/anselm-cluster-documentation/software/nvidia-cuda.md index a57f8c7a47dc5507c7640c6a823e16b9626a4976..062f0a69b253a7f6645b4e822e6dd017527a6d90 100644 --- a/docs.it4i/anselm-cluster-documentation/software/nvidia-cuda.md +++ b/docs.it4i/anselm-cluster-documentation/software/nvidia-cuda.md @@ -281,9 +281,8 @@ SAXPY function multiplies the vector x by the scalar alpha and adds it to the ve ``` !!! Note - Please note: cuBLAS has its own function for data transfers between CPU and GPU memory: - - - [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory + cuBLAS has its own function for data transfers between CPU and GPU memory: + - [cublasSetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublassetvector) - transfers data from CPU to GPU memory - [cublasGetVector](http://docs.nvidia.com/cuda/cublas/index.html#cublasgetvector) - transfers data from GPU to CPU memory To compile the code using NVCC compiler a "-lcublas" compiler flag has to be specified: diff --git a/docs.it4i/salomon/capacity-computing.md b/docs.it4i/salomon/capacity-computing.md index d79e48342ec6c16eeddd727db368931113865fdd..b8c0cd66e7739a73a5fd5332f54eab694b4bb6ea 100644 --- a/docs.it4i/salomon/capacity-computing.md +++ b/docs.it4i/salomon/capacity-computing.md @@ -218,7 +218,8 @@ $ qsub -N JOBNAME jobscript In this example, we submit a job of 101 tasks. 24 input files will be processed in parallel. The 101 tasks on 24 cores are assumed to complete in less than 2 hours. -Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. +!!! Note + Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. ## Job Arrays and GNU Parallel @@ -302,7 +303,8 @@ $ qsub -N JOBNAME -J 1-992:32 jobscript In this example, we submit a job array of 31 subjobs. Note the -J 1-992:**48**, this must be the same as the number sent to numtasks file. Each subjob will run on full node and process 24 input files in parallel, 48 in total per subjob. Every subjob is assumed to complete in less than 2 hours. -Please note the #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. +!!! Note + Use #PBS directives in the beginning of the jobscript file, dont' forget to set your valid PROJECT_ID and desired queue. ## Examples diff --git a/docs.it4i/salomon/prace.md b/docs.it4i/salomon/prace.md index 1684281f5b02a8f32212211aa09cdf44be1a947f..4c0f22f746830d77a1fc1a05ac599ac868f20c52 100644 --- a/docs.it4i/salomon/prace.md +++ b/docs.it4i/salomon/prace.md @@ -202,7 +202,8 @@ Generally both shared file systems are available through GridFTP: More information about the shared file systems is available [here](storage/). -Please note, that for PRACE users a "prace" directory is used also on the SCRATCH file system. +!!! Hint + `prace` directory is used for PRACE users on the SCRATCH file system. | Data type | Default path | | ---------------------------- | ------------------------------- | @@ -245,7 +246,7 @@ The resources that are currently subject to accounting are the core hours. The c PRACE users should check their project accounting using the [PRACE Accounting Tool (DART)](http://www.prace-ri.eu/accounting-report-tool/). -Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours". +Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". You need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours". !!! Note The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients> diff --git a/docs.it4i/salomon/software/compilers.md b/docs.it4i/salomon/software/compilers.md index 5f9a9ccbb74efe1c624ea26a5e003717c102a26b..d493d62f006a7bf81af8aff93f3637471dff8be9 100644 --- a/docs.it4i/salomon/software/compilers.md +++ b/docs.it4i/salomon/software/compilers.md @@ -138,7 +138,10 @@ To use the Berkley UPC compiler and runtime environment to run the binaries use As default UPC network the "smp" is used. This is very quick and easy way for testing/debugging, but limited to one node only. -For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. Please note, that the selection of the network is done at the compile time and not at runtime (as expected)! +For production runs, it is recommended to use the native InfiniBand implementation of UPC network "ibv". For testing/debugging using multiple nodes, the "mpi" UPC network is recommended. + +!!! Warning + Selection of the network is done at the compile time and not at runtime (as expected)! Example UPC code: diff --git a/docs.it4i/salomon/software/debuggers/total-view.md b/docs.it4i/salomon/software/debuggers/total-view.md index 29d2113091b6731e66667054e98056492662860f..508350571a558fc7b564a6800574d85b9447a917 100644 --- a/docs.it4i/salomon/software/debuggers/total-view.md +++ b/docs.it4i/salomon/software/debuggers/total-view.md @@ -80,8 +80,8 @@ To debug a serial code use: To debug a parallel code compiled with **OpenMPI** you need to setup your TotalView environment: -!!! Note - **Please note:** To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file: +!!! Hint + To be able to run parallel debugging procedure from the command line without stopping the debugger in the mpiexec source code you have to add the following function to your **~/.tvdrc** file. ```bash proc mpi_auto_run_starter {loaded_id} { diff --git a/docs.it4i/salomon/software/intel-xeon-phi.md b/docs.it4i/salomon/software/intel-xeon-phi.md index 4f648d1514a9b8a697741a2a8c01463b41c6766c..9fbdb31eb1736420a6ea254928c8db9676941fea 100644 --- a/docs.it4i/salomon/software/intel-xeon-phi.md +++ b/docs.it4i/salomon/software/intel-xeon-phi.md @@ -103,7 +103,10 @@ For debugging purposes it is also recommended to set environment variable "OFFLO export OFFLOAD_REPORT=3 ``` -A very basic example of code that employs offload programming technique is shown in the next listing. Please note that this code is sequential and utilizes only single core of the accelerator. +A very basic example of code that employs offload programming technique is shown in the next listing. + +!!! Note + This code is sequential and utilizes only single core of the accelerator. ```bash $ vim source-offload.cpp @@ -326,7 +329,7 @@ Following example show how to automatically offload an SGEMM (single precision - ``` !!! Note - Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** + This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c** To compile a code using Intel compiler use: @@ -369,7 +372,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel ``` !!! Note - Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths. + Particular version of the Intel module is specified. This information is used later to specify the correct library paths. To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only: @@ -412,7 +415,7 @@ If the code is parallelized using OpenMP a set of additional libraries is requir ``` !!! Note - Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. + The path exported contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer. For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is: @@ -537,8 +540,8 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as ... ``` -!!! Note - Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. +!!! Hint + GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module. ## MPI @@ -647,8 +650,6 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta ``` !!! Note - Please note: - - this file sets up both environmental variable for both MPI and OpenMP libraries. - this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules. @@ -702,9 +703,8 @@ or using mpirun ``` !!! Note - Please note: - \- the full path to the binary has to specified (here: "**>~/mpi-test-mic**") - \- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code + - the full path to the binary has to specified (here: "**>~/mpi-test-mic**") + - the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code The output should be again similar to: @@ -715,8 +715,10 @@ The output should be again similar to: Hello world from process 0 of 4 on host cn207-mic0 ``` -!!! Note - Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute: +!!! Hint + **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. + +A simple test to see if the file is present is to execute: ```bash $ ssh mic0 ls /bin/pmi_proxy @@ -749,11 +751,10 @@ For example: This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators. !!! Note - Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: - - - to connect to the second node : ** $ ssh cn205** - - to connect to the accelerator on the first node from the first node: **$ ssh cn204-mic0** or **$ ssh mic0** - - to connect to the accelerator on the second node from the first node: **$ ssh cn205-mic0** + At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh: + - to connect to the second node : `$ ssh cn205` + - to connect to the accelerator on the first node from the first node: `$ ssh cn204-mic0` or `$ ssh mic0` + - to connect to the accelerator on the second node from the first node: `$ ssh cn205-mic0` At this point we expect that correct modules are loaded and binary is compiled. For parallel execution the mpiexec.hydra is used. Again the first step is to tell mpiexec that the MPI can be executed on MIC accelerators by setting up the environmental variable "I_MPI_MIC" @@ -882,7 +883,7 @@ A possible output of the MPI "hello-world" example executed on two hosts and two ``` !!! Note - Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. + At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only. **Using the PBS automatically generated node-files** @@ -895,7 +896,7 @@ PBS also generates a set of node-files that can be used instead of manually crea - /lscratch/${PBS_JOBID}/nodefile-mic Host and MIC node-file: - /lscratch/${PBS_JOBID}/nodefile-mix -Please note each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. +Each host or accelerator is listed only per files. User has to specify how many jobs should be executed per node using "-n" parameter of the mpirun command. ## Optimization diff --git a/docs.it4i/salomon/software/numerical-languages/matlab.md b/docs.it4i/salomon/software/numerical-languages/matlab.md index 95f0e3dde69ad160c495b8b4e5c9cc6dbe0effb0..b9f7bc5a3c5c829e86aea964395c9df3443e15f6 100644 --- a/docs.it4i/salomon/software/numerical-languages/matlab.md +++ b/docs.it4i/salomon/software/numerical-languages/matlab.md @@ -129,7 +129,8 @@ The last part of the configuration is done directly in the user Matlab script be This script creates scheduler object "cluster" of type "local" that starts workers locally. -Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. +!!! Hint + Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function. The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers. @@ -212,7 +213,8 @@ You can start this script using batch mode the same way as in Local mode example This method is a "hack" invented by us to emulate the mpiexec functionality found in previous MATLAB versions. We leverage the MATLAB Generic Scheduler interface, but instead of submitting the workers to PBS, we launch the workers directly within the running job, thus we avoid the issues with master script and workers running in separate jobs (issues with license not available, waiting for the worker's job to spawn etc.) -Please note that this method is experimental. +!!! Warning + This method is experimental. For this method, you need to use SalomonDirect profile, import it using [the same way as SalomonPBSPro](matlab.md#running-parallel-matlab-using-distributed-computing-toolbox---engine)