Commit 201844d6 authored by David Hrbáč's avatar David Hrbáč

Note clean-up

parent 09989368
Pipeline #1832 passed with stages
in 1 minute and 4 seconds
......@@ -6,7 +6,7 @@ In many cases, it is useful to submit huge (>100+) number of computational jobs
However, executing huge number of jobs via the PBS queue may strain the system. This strain may result in slow response to commands, inefficient scheduling and overall degradation of performance and user experience, for all users. For this reason, the number of jobs is **limited to 100 per user, 1000 per job array**
!!! Note "Note"
!!! Note
Please follow one of the procedures below, in case you wish to schedule more than 100 jobs at a time.
- Use [Job arrays](capacity-computing/#job-arrays) when running huge number of [multithread](capacity-computing/#shared-jobscript-on-one-node) (bound to one node only) or multinode (multithread across several nodes) jobs
......@@ -20,7 +20,7 @@ However, executing huge number of jobs via the PBS queue may strain the system.
## Job Arrays
!!! Note "Note"
!!! Note
Huge number of jobs may be easily submitted and managed as a job array.
A job array is a compact representation of many jobs, called subjobs. The subjobs share the same job script, and have the same values for all attributes and resources, with the following exceptions:
......@@ -149,7 +149,7 @@ Read more on job arrays in the [PBSPro Users guide](../../pbspro-documentation/)
## GNU Parallel
!!! Note "Note"
!!! Note
Use GNU parallel to run many single core tasks on one node.
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. GNU parallel is most useful in running single core jobs via the queue system on Anselm.
......@@ -220,12 +220,12 @@ Please note the #PBS directives in the beginning of the jobscript file, dont' fo
## Job Arrays and GNU Parallel
!!! Note "Note"
!!! Note
Combine the Job arrays and GNU parallel for best throughput of single core jobs
While job arrays are able to utilize all available computational nodes, the GNU parallel can be used to efficiently run multiple single-core jobs on single node. The two approaches may be combined to utilize all available (current and future) resources to execute single core jobs.
!!! Note "Note"
!!! Note
Every subjob in an array runs GNU parallel to utilize all cores on the node
### GNU Parallel, Shared jobscript
......@@ -280,7 +280,7 @@ cp output $PBS_O_WORKDIR/$TASK.out
In this example, the jobscript executes in multiple instances in parallel, on all cores of a computing node. Variable $TASK expands to one of the input filenames from tasklist. We copy the input file to local scratch, execute the myprog.x and copy the output file back to the submit directory, under the $TASK.out name. The numtasks file controls how many tasks will be run per subjob. Once an task is finished, new task starts, until the number of tasks in numtasks file is reached.
!!! Note "Note"
!!! Note
Select subjob walltime and number of tasks per subjob carefully
When deciding this values, think about following guiding rules:
......
......@@ -23,14 +23,14 @@ then
fi
```
!!! Note "Note"
!!! Note
Do not run commands outputting to standard output (echo, module list, etc) in .bashrc for non-interactive SSH sessions. It breaks fundamental functionality (scp, PBS) of your account! Conside utilization of SSH session interactivity for such commands as stated in the previous example.
### Application Modules
In order to configure your shell for running particular application on Anselm we use Module package interface.
!!! Note "Note"
!!! Note
The modules set up the application paths, library paths and environment variables for running particular application.
We have also second modules repository. This modules repository is created using tool called EasyBuild. On Salomon cluster, all modules will be build by this tool. If you want to use software from this modules repository, please follow instructions in section [Application Modules Path Expansion](environment-and-modules/#EasyBuild).
......
......@@ -35,7 +35,7 @@ usage<sub>Total</sub> is total usage by all users, by all projects.
Usage counts allocated core-hours (`ncpus x walltime`). Usage is decayed, or cut in half periodically, at the interval 168 hours (one week).
Jobs queued in queue qexp are not calculated to project's usage.
!!! Note "Note"
!!! Note
Calculated usage and fair-share priority can be seen at <https://extranet.it4i.cz/anselm/projects>.
Calculated fair-share priority can be also seen as Resource_List.fairshare attribute of a job.
......@@ -64,7 +64,7 @@ The scheduler makes a list of jobs to run in order of execution priority. Schedu
It means, that jobs with lower execution priority can be run before jobs with higher execution priority.
!!! Note "Note"
!!! Note
It is **very beneficial to specify the walltime** when submitting jobs.
Specifying more accurate walltime enables better scheduling, better execution times and better resource usage. Jobs with suitable (small) walltime could be backfilled - and overtake job(s) with higher priority.
......@@ -11,7 +11,7 @@ When allocating computational resources for the job, please specify
5. Project ID
6. Jobscript or interactive switch
!!! Note "Note"
!!! Note
Use the **qsub** command to submit your job to a queue for allocation of the computational resources.
Submit the job using the qsub command:
......@@ -132,7 +132,7 @@ Although this example is somewhat artificial, it demonstrates the flexibility of
## Job Management
!!! Note "Note"
!!! Note
Check status of your jobs using the **qstat** and **check-pbs-jobs** commands
```bash
......@@ -213,7 +213,7 @@ Run loop 3
In this example, we see actual output (some iteration loops) of the job 35141.dm2
!!! Note "Note"
!!! Note
Manage your queued or running jobs, using the **qhold**, **qrls**, **qdel**, **qsig** or **qalter** commands
You may release your allocation at any time, using qdel command
......@@ -238,12 +238,12 @@ $ man pbs_professional
### Jobscript
!!! Note "Note"
!!! Note
Prepare the jobscript to run batch jobs in the PBS queue system
The Jobscript is a user made script, controlling sequence of commands for executing the calculation. It is often written in bash, other scripts may be used as well. The jobscript is supplied to PBS **qsub** command as an argument and executed by the PBS Professional workload manager.
!!! Note "Note"
!!! Note
The jobscript or interactive shell is executed on first of the allocated nodes.
```bash
......@@ -273,7 +273,7 @@ $ pwd
In this example, 4 nodes were allocated interactively for 1 hour via the qexp queue. The interactive shell is executed in the home directory.
!!! Note "Note"
!!! Note
All nodes within the allocation may be accessed via ssh. Unallocated nodes are not accessible to user.
The allocated nodes are accessible via ssh from login nodes. The nodes may access each other via ssh as well.
......@@ -305,7 +305,7 @@ In this example, the hostname program is executed via pdsh from the interactive
### Example Jobscript for MPI Calculation
!!! Note "Note"
!!! Note
Production jobs must use the /scratch directory for I/O
The recommended way to run production jobs is to change to /scratch directory early in the jobscript, copy all inputs to /scratch, execute the calculations and copy outputs to home directory.
......@@ -337,12 +337,12 @@ exit
In this example, some directory on the /home holds the input file input and executable mympiprog.x . We create a directory myjob on the /scratch filesystem, copy input and executable files from the /home directory where the qsub was invoked ($PBS_O_WORKDIR) to /scratch, execute the MPI programm mympiprog.x and copy the output file back to the /home directory. The mympiprog.x is executed as one process per node, on all allocated nodes.
!!! Note "Note"
!!! Note
Consider preloading inputs and executables onto [shared scratch](storage/) before the calculation starts.
In some cases, it may be impractical to copy the inputs to scratch and outputs to home. This is especially true when very large input and output files are expected, or when the files should be reused by a subsequent calculation. In such a case, it is users responsibility to preload the input files on shared /scratch before the job submission and retrieve the outputs manually, after all calculations are finished.
!!! Note "Note"
!!! Note
Store the qsub options within the jobscript. Use **mpiprocs** and **ompthreads** qsub options to control the MPI job execution.
Example jobscript for an MPI job with preloaded inputs and executables, options for qsub are stored within the script :
......@@ -375,7 +375,7 @@ sections.
### Example Jobscript for Single Node Calculation
!!! Note "Note"
!!! Note
Local scratch directory is often useful for single node jobs. Local scratch will be deleted immediately after the job ends.
Example jobscript for single node calculation, using [local scratch](storage/) on the node:
......
......@@ -8,7 +8,7 @@ All compute and login nodes of Anselm are interconnected by a high-bandwidth, lo
The compute nodes may be accessed via the InfiniBand network using ib0 network interface, in address range 10.2.1.1-209. The MPI may be used to establish native InfiniBand connection among the nodes.
!!! Note "Note"
!!! Note
The network provides **2170 MB/s** transfer rates via the TCP connection (single stream) and up to **3600 MB/s** via native InfiniBand protocol.
The Fat tree topology ensures that peak transfer rates are achieved between any two nodes, independent of network traffic exchanged among other nodes concurrently.
......
......@@ -235,7 +235,7 @@ PRACE users should check their project accounting using the [PRACE Accounting To
Users who have undergone the full local registration procedure (including signing the IT4Innovations Acceptable Use Policy) and who have received local password may check at any time, how many core-hours have been consumed by themselves and their projects using the command "it4ifree". Please note that you need to know your user password to use the command and that the displayed core hours are "system core hours" which differ from PRACE "standardized core hours".
!!! Note "Note"
!!! Note
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
```bash
......
......@@ -12,14 +12,14 @@ The resources are allocated to the job in a fair-share fashion, subject to const
- **qnvidia**, **qmic**, **qfat**, the Dedicated queues
- **qfree**, the Free resource utilization queue
!!! Note "Note"
!!! Note
Check the queue status at <https://extranet.it4i.cz/anselm/>
Read more on the [Resource AllocationPolicy](resources-allocation-policy/) page.
## Job Submission and Execution
!!! Note "Note"
!!! Note
Use the **qsub** command to submit your jobs.
The qsub submits the job into the queue. The qsub command creates a request to the PBS Job manager for allocation of specified resources. The **smallest allocation unit is entire node, 16 cores**, with exception of the qexp queue. The resources will be allocated when available, subject to allocation policies and constraints. **After the resources are allocated the jobscript or interactive shell is executed on first of the allocated nodes.**
......@@ -28,7 +28,7 @@ Read more on the [Job submission and execution](job-submission-and-execution/) p
## Capacity Computing
!!! Note "Note"
!!! Note
Use Job arrays when running huge number of jobs.
Use GNU Parallel and/or Job arrays when running (many) single core jobs.
......
......@@ -4,7 +4,7 @@
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue and resources available to the Project. The Fair-share at Anselm ensures that individual users may consume approximately equal amount of resources per week. Detailed information in the [Job scheduling](job-priority/) section. The resources are accessible via several queues for queueing the jobs. The queues provide prioritized and exclusive access to the computational resources. Following table provides the queue partitioning overview:
!!! Note "Note"
!!! Note
Check the queue status at <https://extranet.it4i.cz/anselm/>
| queue | active project | project resources | nodes | min ncpus | priority | authorization | walltime |
......@@ -15,7 +15,7 @@ The resources are allocated to the job in a fair-share fashion, subject to const
| qnvidia, qmic, qfat | yes | 0 | 23 total qnvidia4 total qmic2 total qfat | 16 | 200 | yes | 24/48 h |
| qfree | yes | none required | 178 w/o accelerator | 16 | -1024 | no | 12 h |
!!! Note "Note"
!!! Note
**The qfree queue is not free of charge**. [Normal accounting](#resources-accounting-policy) applies. However, it allows for utilization of free resources, once a Project exhausted all its allocated computational resources. This does not apply for Directors Discreation's projects (DD projects) by default. Usage of qfree after exhaustion of DD projects computational resources is allowed after request for this queue.
**The qexp queue is equipped with the nodes not having the very same CPU clock speed.** Should you need the very same CPU speed, you have to select the proper nodes during the PSB job submission.
......@@ -113,7 +113,7 @@ The resources that are currently subject to accounting are the core-hours. The c
### Check Consumed Resources
!!! Note "Note"
!!! Note
The **it4ifree** command is a part of it4i.portal.clients package, located here: <https://pypi.python.org/pypi/it4i.portal.clients>
User may check at any time, how many core-hours have been consumed by himself/herself and his/her projects. The command is available on clusters' login nodes.
......
......@@ -53,7 +53,7 @@ Last login: Tue Jul 9 15:57:38 2013 from your-host.example.com
Example to the cluster login:
!!! Note "Note"
!!! Note
The environment is **not** shared between login nodes, except for [shared filesystems](storage/#shared-filesystems).
## Data Transfer
......@@ -69,14 +69,14 @@ Data in and out of the system may be transferred by the [scp](http://en.wikipedi
The authentication is by the [private key](../get-started-with-it4innovations/accessing-the-clusters/shell-access-and-data-transfer/ssh-keys/)
!!! Note "Note"
!!! Note
Data transfer rates up to **160MB/s** can be achieved with scp or sftp.
1TB may be transferred in 1:50h.
To achieve 160MB/s transfer rates, the end user must be connected by 10G line all the way to IT4Innovations and use computer with fast processor for the transfer. Using Gigabit ethernet connection, up to 110MB/s may be expected. Fast cipher (aes128-ctr) should be used.
!!! Note "Note"
!!! Note
If you experience degraded data transfer performance, consult your local network provider.
On linux or Mac, use scp or sftp client to transfer the data to Anselm:
......@@ -126,7 +126,7 @@ Outgoing connections, from Anselm Cluster login nodes to the outside world, are
| 443 | https |
| 9418 | git |
!!! Note "Note"
!!! Note
Please use **ssh port forwarding** and proxy servers to connect from Anselm to all other remote ports.
Outgoing connections, from Anselm Cluster compute nodes are restricted to the internal network. Direct connections form compute nodes to outside world are cut.
......@@ -135,7 +135,7 @@ Outgoing connections, from Anselm Cluster compute nodes are restricted to the in
### Port Forwarding From Login Nodes
!!! Note "Note"
!!! Note
Port forwarding allows an application running on Anselm to connect to arbitrary remote host and port.
It works by tunneling the connection from Anselm back to users workstation and forwarding from the workstation to the remote host.
......@@ -177,7 +177,7 @@ In this example, we assume that port forwarding from login1:6000 to remote.host.
Port forwarding is static, each single port is mapped to a particular port on remote host. Connection to other remote host, requires new forward.
!!! Note "Note"
!!! Note
Applications with inbuilt proxy support, experience unlimited access to remote hosts, via single proxy server.
To establish local proxy server on your workstation, install and run SOCKS proxy server software. On Linux, sshd demon provides the functionality. To establish SOCKS proxy server listening on port 1080 run:
......
......@@ -32,7 +32,7 @@ Compilation parameters are default:
Molpro is compiled for parallel execution using MPI and OpenMP. By default, Molpro reads the number of allocated nodes from PBS and launches a data server on one node. On the remaining allocated nodes, compute processes are launched, one process per node, each with 16 threads. You can modify this behavior by using -n, -t and helper-server options. Please refer to the [Molpro documentation](http://www.molpro.net/info/2010.1/doc/manual/node9.html) for more details.
!!! Note "Note"
!!! Note
The OpenMP parallelization in Molpro is limited and has been observed to produce limited scaling. We therefore recommend to use MPI parallelization only. This can be achieved by passing option mpiprocs=16:ompthreads=1 to PBS.
You are advised to use the -d option to point to a directory in [SCRATCH file system](../../storage/storage/). Molpro can produce a large amount of temporary data during its run, and it is important that these are placed in the fast scratch file system.
......
......@@ -20,7 +20,7 @@ The module sets up environment variables, required for using the Allinea Perform
## Usage
!!! Note "Note"
!!! Note
Use the the perf-report wrapper on your (MPI) program.
Instead of [running your MPI program the usual way](../mpi/), use the the perf report wrapper:
......
......@@ -190,7 +190,7 @@ Now the compiler won't remove the multiplication loop. (However it is still not
### Intel Xeon Phi
!!! Note "Note"
!!! Note
PAPI currently supports only a subset of counters on the Intel Xeon Phi processor compared to Intel Xeon, for example the floating point operations counter is missing.
To use PAPI in [Intel Xeon Phi](../intel-xeon-phi/) native applications, you need to load module with " -mic" suffix, for example " papi/5.3.2-mic" :
......
......@@ -4,7 +4,7 @@
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX vector instructions is available, via module ipp. The IPP is a very rich library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax, as well as cryptographic functions, linear algebra functions and many more.
!!! Note "Note"
!!! Note
Check out IPP before implementing own math functions for data processing, it is likely already there.
```bash
......
......@@ -23,7 +23,7 @@ Intel MKL version 13.5.192 is available on Anselm
The module sets up environment variables, required for linking and running mkl enabled applications. The most important variables are the $MKLROOT, $MKL_INC_DIR, $MKL_LIB_DIR and $MKL_EXAMPLES
!!! Note "Note"
!!! Note
The MKL library may be linked using any compiler. With intel compiler use -mkl option to link default threaded MKL.
### Interfaces
......@@ -47,7 +47,7 @@ You will need the mkl module loaded to run the mkl enabled executable. This may
### Threading
!!! Note "Note"
!!! Note
Advantage in using the MKL library is that it brings threaded parallelization to applications that are otherwise not parallel.
For this to work, the application must link the threaded MKL library (default). Number and behaviour of MKL threads may be controlled via the OpenMP environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY. MKL_NUM_THREADS takes precedence over OMP_NUM_THREADS
......
......@@ -13,7 +13,7 @@ Intel TBB version 4.1 is available on Anselm
The module sets up environment variables, required for linking and running tbb enabled applications.
!!! Note "Note"
!!! Note
Link the tbb library, using -ltbb
## Examples
......
......@@ -229,7 +229,7 @@ During the compilation Intel compiler shows which loops have been vectorized in
Some interesting compiler flags useful not only for code debugging are:
!!! Note "Note"
!!! Note
Debugging
openmp_report[0|1|2] - controls the compiler based vectorization diagnostic level
......@@ -326,7 +326,7 @@ Following example show how to automatically offload an SGEMM (single precision -
}
```
!!! Note "Note"
!!! Note
Please note: This example is simplified version of an example from MKL. The expanded version can be found here: **$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use:
......@@ -369,7 +369,7 @@ To compile a code user has to be connected to a compute with MIC and load Intel
$ module load intel/13.5.192
```
!!! Note "Note"
!!! Note
Please note that particular version of the Intel module is specified. This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has to specify "-mmic" compiler flag. Two compilation examples are shown below. The first example shows how to compile OpenMP parallel code "vect-add.c" for host only:
......@@ -412,12 +412,12 @@ If the code is parallelized using OpenMP a set of additional libraries is requir
mic0 $ export LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY_PATH
```
!!! Note "Note"
!!! Note
Please note that the path exported in the previous example contains path to a specific compiler (here the version is 5.192). This version number has to match with the version number of the Intel compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required for execution of an OpenMP parallel code on Intel Xeon Phi is:
!!! Note "Note"
!!! Note
/apps/intel/composer_xe_2013.5.192/compiler/lib/mic
- libiomp5.so
......@@ -498,7 +498,7 @@ After executing the complied binary file, following output should be displayed.
...
```
!!! Note "Note"
!!! Note
More information about this example can be found on Intel website: <http://software.intel.com/en-us/vcsource/samples/caps-basic/>
The second example that can be found in "/apps/intel/opencl-examples" directory is General Matrix Multiply. You can follow the the same procedure to download the example to your directory and compile it.
......@@ -538,7 +538,7 @@ To see the performance of Intel Xeon Phi performing the DGEMM run the example as
...
```
!!! Note "Note"
!!! Note
Please note: GNU compiler is used to compile the OpenCL codes for Intel MIC. You do not need to load Intel compiler module.
## MPI
......@@ -600,7 +600,7 @@ An example of basic MPI version of "hello-world" example in C language, that can
Intel MPI for the Xeon Phi coprocessors offers different MPI programming models:
!!! Note "Note"
!!! Note
**Host-only model** - all MPI ranks reside on the host. The coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
**Coprocessor-only model** - all MPI ranks reside only on the coprocessors.
......@@ -647,7 +647,7 @@ Similarly to execution of OpenMP programs in native mode, since the environmenta
export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH
```
!!! Note "Note"
!!! Note
Please note:
\- this file sets up both environmental variable for both MPI and OpenMP libraries.
\- this file sets up the paths to a particular version of Intel MPI library and particular version of an Intel compiler. These versions have to match with loaded modules.
......@@ -701,7 +701,7 @@ or using mpirun
$ mpirun -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -host mic0 -n 4 ~/mpi-test-mic
```
!!! Note "Note"
!!! Note
Please note:
\- the full path to the binary has to specified (here: "**>~/mpi-test-mic**")
\- the LD_LIBRARY_PATH has to match with Intel MPI module used to compile the MPI code
......@@ -715,7 +715,7 @@ The output should be again similar to:
Hello world from process 0 of 4 on host cn207-mic0
```
!!! Note "Note"
!!! Note
Please note that the **"mpiexec.hydra"** requires a file the MIC filesystem. If the file is missing please contact the system administrators. A simple test to see if the file is present is to execute:
```bash
......@@ -748,7 +748,7 @@ For example:
This output means that the PBS allocated nodes cn204 and cn205, which means that user has direct access to "**cn204-mic0**" and "**cn-205-mic0**" accelerators.
!!! Note "Note"
!!! Note
Please note: At this point user can connect to any of the allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : ** $ ssh cn205**
......@@ -881,14 +881,14 @@ A possible output of the MPI "hello-world" example executed on two hosts and two
Hello world from process 7 of 8 on host cn205-mic0
```
!!! Note "Note"
!!! Note
Please note: At this point the MPI communication between MIC accelerators on different nodes uses 1Gb Ethernet only.
**Using the PBS automatically generated node-files**
PBS also generates a set of node-files that can be used instead of manually creating a new one every time. Three node-files are genereated:
!!! Note "Note"
!!! Note
**Host only node-file:**
- /lscratch/${PBS_JOBID}/nodefile-cn MIC only node-file:
......
......@@ -10,7 +10,7 @@ If an ISV application was purchased for educational (research) purposes and also
## Overview of the Licenses Usage
!!! Note "Note"
!!! Note
The overview is generated every minute and is accessible from web or command line interface.
### Web Interface
......
......@@ -38,7 +38,7 @@ For running Windows application (when source code and Linux native application a
IT4Innovations does not provide any licenses for operating systems and software of virtual machines. Users are ( in accordance with [Acceptable use policy document](http://www.it4i.cz/acceptable-use-policy.pdf)) fully responsible for licensing all software running in virtual machines on Anselm. Be aware of complex conditions of licensing software in virtual environments.
!!! Note "Note"
!!! Note
Users are responsible for licensing OS e.g. MS Windows and all software running in their virtual machines.
## Howto
......@@ -248,7 +248,7 @@ Run virtual machine using optimized devices, user network back-end with sharing
Thanks to port forwarding you can access virtual machine via SSH (Linux) or RDP (Windows) connecting to IP address of compute node (and port 2222 for SSH). You must use VPN network).
!!! Note "Note"
!!! Note
Keep in mind, that if you use virtio devices, you must have virtio drivers installed on your virtual machine.
### Networking and Data Sharing
......
......@@ -60,7 +60,7 @@ In this example, the openmpi 1.6.5 using intel compilers is activated
## Compiling MPI Programs
!!! Note "Note"
!!! Note
After setting up your MPI environment, compile your program using one of the mpi wrappers
```bash
......@@ -107,7 +107,7 @@ Compile the above example with
## Running MPI Programs
!!! Note "Note"
!!! Note
The MPI program executable must be compatible with the loaded MPI module.
Always compile and execute using the very same MPI module.
......@@ -119,7 +119,7 @@ The MPI program executable must be available within the same path on all nodes.
Optimal way to run an MPI program depends on its memory requirements, memory access pattern and communication pattern.
!!! Note "Note"
!!! Note
Consider these ways to run an MPI program:
1. One MPI process per node, 16 threads per process
......@@ -130,7 +130,7 @@ Optimal way to run an MPI program depends on its memory requirements, memory acc
**Two MPI** processes per node, using 8 threads each, bound to processor socket is most useful for memory bandwidth bound applications such as BLAS1 or FFT, with scalable memory demand. However, note that the two processes will share access to the network interface. The 8 threads and socket binding should ensure maximum memory access bandwidth and minimize communication, migration and NUMA effect overheads.
!!! Note "Note"
!!! Note
Important! Bind every OpenMP thread to a core!
In the previous two cases with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You want to avoid this by setting the KMP_AFFINITY or GOMP_CPU_AFFINITY environment variables.
......
......@@ -6,7 +6,7 @@ The MPICH2 programs use mpd daemon or ssh connection to spawn processes, no PBS
### Basic Usage
!!! Note "Note"
!!! Note
Use the mpirun to execute the MPICH2 code.
Example:
......@@ -43,7 +43,7 @@ You need to preload the executable, if running on the local scratch /lscratch fi
In this example, we assume the executable helloworld_mpi.x is present on shared home directory. We run the cp command via mpirun, copying the executable from shared home to local scratch . Second mpirun will execute the binary in the /lscratch/15210.srv11 directory on nodes cn17, cn108, cn109 and cn110, one process per node.
!!! Note "Note"
!!! Note
MPI process mapping may be controlled by PBS parameters.
The mpiprocs and ompthreads parameters allow for selection of number of running MPI processes per node as well as number of OpenMP threads per MPI process.
......@@ -92,7 +92,7 @@ In this example, we demonstrate recommended way to run an MPI application, using
### OpenMP Thread Affinity
!!! Note "Note"
!!! Note
Important! Bind every OpenMP thread to a core!
In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP:
......
......@@ -41,7 +41,7 @@ plots, images, etc... will be still available.
## Running Parallel Matlab Using Distributed Computing Toolbox / Engine
!!! Note "Note"
!!! Note
Distributed toolbox is available only for the EDU variant
The MPIEXEC mode available in previous versions is no longer available in MATLAB 2015. Also, the programming interface has changed. Refer to [Release Notes](http://www.mathworks.com/help/distcomp/release-notes.html#buanp9e-1).
......@@ -64,7 +64,7 @@ Or in the GUI, go to tab HOME -> Parallel -> Manage Cluster Profiles..., click I
With the new mode, MATLAB itself launches the workers via PBS, so you can either use interactive mode or a batch mode on one node, but the actual parallel processing will be done in a separate job started by MATLAB itself. Alternatively, you can use "local" mode to run parallel code on just a single node.
!!! Note "Note"
!!! Note
The profile is confusingly named Salomon, but you can use it also on Anselm.
### Parallel Matlab Interactive Session
......@@ -133,7 +133,7 @@ The last part of the configuration is done directly in the user Matlab script be
This script creates scheduler object "cluster" of type "local" that starts workers locally.
!!! Note "Note"
!!! Note
Please note: Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling parpool(sched, ...) function.
The last step is to start matlabpool with "cluster" object and correct number of workers. We have 24 cores per node, so we start 24 workers.
......
......@@ -2,7 +2,7 @@
## Introduction
!!! Note "Note"
!!! Note
This document relates to the old versions R2013 and R2014. For MATLAB 2015, please use [this documentation instead](matlab/).
Matlab is available in the latest stable version. There are always two variants of the release:
......@@ -71,7 +71,7 @@ extras = {};
System MPI library allows Matlab to communicate through 40 Gbit/s InfiniBand QDR interconnect instead of slower 1 Gbit Ethernet network.
!!! Note "Note"
!!! Note
The path to MPI library in "mpiLibConf.m" has to match with version of loaded Intel MPI module. In this example the version 4.1.1.036 of Intel MPI is used by Matlab and therefore module impi/4.1.1.036 has to be loaded prior to starting Matlab.
### Parallel Matlab Interactive Session
......@@ -144,7 +144,7 @@ set(sched, 'EnvironmentSetMethod', 'setenv');
This script creates scheduler object "sched" of type "mpiexec" that starts workers using mpirun tool. To use correct version of mpirun, the second line specifies the path to correct version of system Intel MPI library.
!!! Note "Note"
!!! Note
Every Matlab script that needs to initialize/use matlabpool has to contain these three lines prior to calling matlabpool(sched, ...) function.
The last step is to start matlabpool with "sched" object and correct number of workers. In this case qsub asked for total number of 32 cores, therefore the number of workers is also set to 32.
......
......@@ -96,7 +96,7 @@ A version of [native](../intel-xeon-phi/#section-4) Octave is compiled for Xeon
Octave is linked with parallel Intel MKL, so it best suited for batch processing of tasks that utilize BLAS, LAPACK and FFT operations. By default, number of threads is set to 120, you can control this with > OMP_NUM_THREADS environment
variable.
!!! Note "Note"
!!! Note
Calculations that do not employ parallelism (either by using parallel MKL e.g. via matrix operations, fork() function, [parallel package](http://octave.sourceforge.net/parallel/) or other mechanism) will actually run slower than on host CPU.
To use Octave on a node with Xeon Phi: