Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • sccs/docs.it4i.cz
  • soj0018/docs.it4i.cz
  • lszustak/docs.it4i.cz
  • jarosjir/docs.it4i.cz
  • strakpe/docs.it4i.cz
  • beranekj/docs.it4i.cz
  • tab0039/docs.it4i.cz
  • davidciz/docs.it4i.cz
  • gui0013/docs.it4i.cz
  • mrazek/docs.it4i.cz
  • lriha/docs.it4i.cz
  • it4i-vhapla/docs.it4i.cz
  • hol0598/docs.it4i.cz
  • sccs/docs-it-4-i-cz-fumadocs
  • siw019/docs-it-4-i-cz-fumadocs
15 results
Show changes
Showing
with 4288 additions and 0 deletions
Intel Inspector
===============
Intel Inspector is a dynamic memory and threading error checking tool
for C/C++/Fortran applications. It can detect issues such as memory
leaks, invalid memory references, uninitalized variables, race
conditions, deadlocks etc.
Installed versions
------------------
The following versions are currently available on Salomon as modules:
--------------- -------------------------
Version** **Module**
2016 Update 1 Inspector/2016_update1
--------------- -------------------------
Usage
-----
Your program should be compiled with -g switch to include symbol names.
Optimizations can be turned on.
Debugging is possible either directly from the GUI, or from command
line.
### GUI mode
To debug from GUI, launch Inspector:
$ inspxe-gui &
Then select menu File -> New -> Project. Choose a directory to
save project data to. After clicking OK, Project properties window will
appear, where you can configure path to your binary, launch arguments,
working directory etc. After clicking OK, the project is ready.
In the main pane, you can start a predefined analysis type or define
your own. Click Start to start the analysis. Alternatively, you can
click on Command Line, to see the command line required to run the
analysis directly from command line.
### Batch mode
Analysis can be also run from command line in batch mode. Batch mode
analysis is run with command inspxe-cl.
To obtain the required parameters, either consult the documentation or
you can configure the analysis in the GUI and then click "Command Line"
button in the lower right corner to the respective command line.
Results obtained from batch mode can be then viewed in the GUI by
selecting File -> Open -> Result...
References
----------
1.[Product
page](https://software.intel.com/en-us/intel-inspector-xe)
2.[Documentation and Release
Notes](https://software.intel.com/en-us/intel-inspector-xe-support/documentation)
3.[Tutorials](https://software.intel.com/en-us/articles/inspectorxe-tutorials)
Intel IPP
=========
Intel Integrated Performance Primitives
---------------------------------------
Intel Integrated Performance Primitives, version 9.0.1, compiled for
AVX2 vector instructions is available, via module ipp. The IPP is a very
rich library of highly optimized algorithmic building blocks for media
and data applications. This includes signal, image and frame processing
algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough
transform, Sum, MinMax, as well as cryptographic functions, linear
algebra functions and many more.
Check out IPP before implementing own math functions for data
processing, it is likely already there.
$ module load ipp
The module sets up environment variables, required for linking and
running ipp enabled applications.
IPP example
-----------
#include "ipp.h"
#include <stdio.h>
int main(int argc, char* argv[])
{
const IppLibraryVersion *lib;
Ipp64u fm;
IppStatus status;
status= ippInit(); //IPP initialization with the best optimization layer
if( status != ippStsNoErr ) {
printf("IppInit() Error:n");
printf("%sn", ippGetStatusString(status) );
return -1;
}
//Get version info
lib = ippiGetLibVersion();
printf("%s %sn", lib->Name, lib->Version);
//Get CPU features enabled with selected library level
fm=ippGetEnabledCpuFeatures();
printf("SSE :%cn",(fm>>1)&1?'Y':'N');
printf("SSE2 :%cn",(fm>>2)&1?'Y':'N');
printf("SSE3 :%cn",(fm>>3)&1?'Y':'N');
printf("SSSE3 :%cn",(fm>>4)&1?'Y':'N');
printf("SSE41 :%cn",(fm>>6)&1?'Y':'N');
printf("SSE42 :%cn",(fm>>7)&1?'Y':'N');
printf("AVX :%cn",(fm>>8)&1 ?'Y':'N');
printf("AVX2 :%cn", (fm>>15)&1 ?'Y':'N' );
printf("----------n");
printf("OS Enabled AVX :%cn", (fm>>9)&1 ?'Y':'N');
printf("AES :%cn", (fm>>10)&1?'Y':'N');
printf("CLMUL :%cn", (fm>>11)&1?'Y':'N');
printf("RDRAND :%cn", (fm>>13)&1?'Y':'N');
printf("F16C :%cn", (fm>>14)&1?'Y':'N');
return 0;
}
Compile above example, using any compiler and the ipp module.
$ module load intel
$ module load ipp
$ icc testipp.c -o testipp.x -lippi -lipps -lippcore
You will need the ipp module loaded to run the ipp enabled executable.
This may be avoided, by compiling library search paths into the
executable
$ module load intel
$ module load ipp
$ icc testipp.c -o testipp.x -Wl,-rpath=$LIBRARY_PATH -lippi -lipps -lippcore
Code samples and documentation
------------------------------
Intel provides number of [Code Samples for
IPP](https://software.intel.com/en-us/articles/code-samples-for-intel-integrated-performance-primitives-library),
illustrating use of IPP.
Read full documentation on IPP [on Intel
website,](http://software.intel.com/sites/products/search/search.php?q=&x=15&y=6&product=ipp&version=7.1&docos=lin)
in particular the [IPP Reference
manual.](http://software.intel.com/sites/products/documentation/doclib/ipp_sa/71/ipp_manual/index.htm)
Intel MKL
=========
Intel Math Kernel Library
-------------------------
Intel Math Kernel Library (Intel MKL) is a library of math kernel
subroutines, extensively threaded and optimized for maximum performance.
Intel MKL provides these basic math kernels:
-
BLAS (level 1, 2, and 3) and LAPACK linear algebra routines,
offering vector, vector-matrix, and matrix-matrix operations.
-
The PARDISO direct sparse solver, an iterative sparse solver,
and supporting sparse BLAS (level 1, 2, and 3) routines for solving
sparse systems of equations.
-
ScaLAPACK distributed processing linear algebra routines for
Linux* and Windows* operating systems, as well as the Basic Linear
Algebra Communications Subprograms (BLACS) and the Parallel Basic
Linear Algebra Subprograms (PBLAS).
-
Fast Fourier transform (FFT) functions in one, two, or three
dimensions with support for mixed radices (not limited to sizes that
are powers of 2), as well as distributed versions of
these functions.
-
Vector Math Library (VML) routines for optimized mathematical
operations on vectors.
-
Vector Statistical Library (VSL) routines, which offer
high-performance vectorized random number generators (RNG) for
several probability distributions, convolution and correlation
routines, and summary statistics functions.
-
Data Fitting Library, which provides capabilities for
spline-based approximation of functions, derivatives and integrals
of functions, and search.
- Extended Eigensolver, a shared memory version of an eigensolver
based on the Feast Eigenvalue Solver.
For details see the [Intel MKL Reference
Manual](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mklman/index.htm).
Intel MKL version 11.2.3.187 is available on the cluster
$ module load imkl
The module sets up environment variables, required for linking and
running mkl enabled applications. The most important variables are the
$MKLROOT, $CPATH, $LD_LIBRARY_PATH and $MKL_EXAMPLES
Intel MKL library may be linked using any compiler.
With intel compiler use -mkl option to link default threaded MKL.
### Interfaces
Intel MKL library provides number of interfaces. The fundamental once
are the LP64 and ILP64. The Intel MKL ILP64 libraries use the 64-bit
integer type (necessary for indexing large arrays, with more than
231^-1 elements), whereas the LP64 libraries index arrays with the
32-bit integer type.
Interface Integer type
----------- -----------------------------------------------
LP64 32-bit, int, integer(kind=4), MPI_INT
ILP64 64-bit, long int, integer(kind=8), MPI_INT64
### Linking
Linking Intel MKL libraries may be complex. Intel [mkl link line
advisor](http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor)
helps. See also [examples](intel-mkl.html#examples) below.
You will need the mkl module loaded to run the mkl enabled executable.
This may be avoided, by compiling library search paths into the
executable. Include rpath on the compile line:
$ icc .... -Wl,-rpath=$LIBRARY_PATH ...
### Threading
Advantage in using Intel MKL library is that it brings threaded
parallelization to applications that are otherwise not parallel.
For this to work, the application must link the threaded MKL library
(default). Number and behaviour of MKL threads may be controlled via the
OpenMP environment variables, such as OMP_NUM_THREADS and
KMP_AFFINITY. MKL_NUM_THREADS takes precedence over OMP_NUM_THREADS
$ export OMP_NUM_THREADS=24
$ export KMP_AFFINITY=granularity=fine,compact,1,0
The application will run with 24 threads with affinity optimized for
fine grain parallelization.
Examples
------------
Number of examples, demonstrating use of the Intel MKL library and its
linking is available on clusters, in the $MKL_EXAMPLES directory. In
the examples below, we demonstrate linking Intel MKL to Intel and GNU
compiled program for multi-threaded matrix multiplication.
### Working with examples
$ module load intel
$ module load imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ make sointel64 function=cblas_dgemm
In this example, we compile, link and run the cblas_dgemm example,
demonstrating use of MKL example suite installed on clusters.
### Example: MKL and Intel compiler
$ module load intel
$ module load imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$
$ icc -w source/cblas_dgemmx.c source/common_func.c -mkl -o cblas_dgemmx.x
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
In this example, we compile, link and run the cblas_dgemm example,
demonstrating use of MKL with icc -mkl option. Using the -mkl option is
equivalent to:
$ icc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x
-I$MKL_INC_DIR -L$MKL_LIB_DIR -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
In this example, we compile and link the cblas_dgemm example, using
LP64 interface to threaded MKL and Intel OMP threads implementation.
### Example: Intel MKL and GNU compiler
$ module load GCC
$ module load imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ gcc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x
-lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lm
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
In this example, we compile, link and run the cblas_dgemm example,
using LP64 interface to threaded MKL and gnu OMP threads implementation.
MKL and MIC accelerators
------------------------
The Intel MKL is capable to automatically offload the computations o the
MIC accelerator. See section [Intel Xeon
Phi](../intel-xeon-phi.html) for details.
LAPACKE C Interface
-------------------
MKL includes LAPACKE C Interface to LAPACK. For some reason, although
Intel is the author of LAPACKE, the LAPACKE header files are not present
in MKL. For this reason, we have prepared
LAPACKE module, which includes Intel's LAPACKE
headers from official LAPACK, which you can use to compile code using
LAPACKE interface against MKL.
Further reading
---------------
Read more on [Intel
website](http://software.intel.com/en-us/intel-mkl), in
particular the [MKL users
guide](https://software.intel.com/en-us/intel-mkl/documentation/linux).
Intel Parallel Studio
=====================
The Salomon cluster provides following elements of the Intel Parallel
Studio XE
Intel Parallel Studio XE
-------------------------------------------------
Intel Compilers
Intel Debugger
Intel MKL Library
Intel Integrated Performance Primitives Library
Intel Threading Building Blocks Library
Intel Trace Analyzer and Collector
Intel Advisor
Intel Inspector
Intel compilers
---------------
The Intel compilers version 131.3 are available, via module
iccifort/2013.5.192-GCC-4.8.3. The compilers include the icc C and C++
compiler and the ifort fortran 77/90/95 compiler.
$ module load intel
$ icc -v
$ ifort -v
Read more at the [Intel Compilers](intel-compilers.html)
page.
Intel debugger
--------------
IDB is no longer available since Parallel Studio 2015.
The intel debugger version 13.0 is available, via module intel. The
debugger works for applications compiled with C and C++ compiler and the
ifort fortran 77/90/95 compiler. The debugger provides java GUI
environment. Use [X
display](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-and-vnc.html)
for running the GUI.
$ module load intel
$ idb
Read more at the [Intel Debugger](intel-debugger.html)
page.
Intel Math Kernel Library
-------------------------
Intel Math Kernel Library (Intel MKL) is a library of math kernel
subroutines, extensively threaded and optimized for maximum performance.
Intel MKL unites and provides these basic components: BLAS, LAPACK,
ScaLapack, PARDISO, FFT, VML, VSL, Data fitting, Feast Eigensolver and
many more.
$ module load imkl
Read more at the [Intel MKL](intel-mkl.html) page.
Intel Integrated Performance Primitives
---------------------------------------
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX
is available, via module ipp. The IPP is a library of highly optimized
algorithmic building blocks for media and data applications. This
includes signal, image and frame processing algorithms, such as FFT,
FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax and many
more.
$ module load ipp
Read more at the [Intel
IPP](intel-integrated-performance-primitives.html) page.
Intel Threading Building Blocks
-------------------------------
Intel Threading Building Blocks (Intel TBB) is a library that supports
scalable parallel programming using standard ISO C++ code. It does not
require special languages or compilers. It is designed to promote
scalable data parallel programming. Additionally, it fully supports
nested parallelism, so you can build larger parallel components from
smaller parallel components. To use the library, you specify tasks, not
threads, and let the library map tasks onto threads in an efficient
manner.
$ module load tbb
Read more at the [Intel TBB](intel-tbb.html) page.
Intel TBB
=========
Intel Threading Building Blocks
-------------------------------
Intel Threading Building Blocks (Intel TBB) is a library that supports
scalable parallel programming using standard ISO C++ code. It does not
require special languages or compilers. To use the library, you specify
tasks, not threads, and let the library map tasks onto threads in an
efficient manner. The tasks are executed by a runtime scheduler and may
be offloaded to [MIC
accelerator](../intel-xeon-phi.html).
Intel TBB version 4.3.5.187 is available on the cluster.
$ module load tbb
The module sets up environment variables, required for linking and
running tbb enabled applications.
Link the tbb library, using -ltbb
Examples
--------
Number of examples, demonstrating use of TBB and its built-in scheduler
is available on Anselm, in the $TBB_EXAMPLES directory.
$ module load intel
$ module load tbb
$ cp -a $TBB_EXAMPLES/common $TBB_EXAMPLES/parallel_reduce /tmp/
$ cd /tmp/parallel_reduce/primes
$ icc -O2 -DNDEBUG -o primes.x main.cpp primes.cpp -ltbb
$ ./primes.x
In this example, we compile, link and run the primes example,
demonstrating use of parallel task-based reduce in computation of prime
numbers.
You will need the tbb module loaded to run the tbb enabled executable.
This may be avoided, by compiling library search paths into the
executable.
$ icc -O2 -o primes.x main.cpp primes.cpp -Wl,-rpath=$LIBRARY_PATH -ltbb
Further reading
---------------
Read more on Intel website,
<http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm>
Intel Trace Analyzer and Collector
==================================
Intel Trace Analyzer and Collector (ITAC) is a tool to collect and
graphicaly analyze behaviour of MPI applications. It helps you to
analyze communication patterns of your application, identify hotspots,
perform correctnes checking (identify deadlocks, data corruption etc),
simulate how your application would run on a different interconnect.
ITAC is a offline analysis tool - first you run your application to
collect a trace file, then you can open the trace in a GUI analyzer to
view it.
Installed version
-----------------
Currently on Salomon is version 9.1.2.024 available as module
itac/9.1.2.024
Collecting traces
-----------------
ITAC can collect traces from applications that are using Intel MPI. To
generate a trace, simply add -trace option to your mpirun command :
$ module load itac/9.1.2.024
$ mpirun -trace myapp
The trace will be saved in file myapp.stf in the current directory.
Viewing traces
--------------
To view and analyze the trace, open ITAC GUI in a [graphical
environment](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-and-vnc.html)
:
$ module load itac/9.1.2.024
$ traceanalyzer
The GUI will launch and you can open the produced *.stf file.
![](Snmekobrazovky20151204v15.35.12.png)
Please refer to Intel documenation about usage of the GUI tool.
References
----------
1.[Getting Started with Intel® Trace Analyzer and
Collector](https://software.intel.com/en-us/get-started-with-itac-for-linux)
2.[Intel® Trace Analyzer and Collector -
Documentation](http://Intel®%20Trace%20Analyzer%20and%20Collector%20-%20Documentation)
Intel Xeon Phi
==============
A guide to Intel Xeon Phi usage
Intel Xeon Phi accelerator can be programmed in several modes. The
default mode on the cluster is offload mode, but all modes described in
this document are supported.
Intel Utilities for Xeon Phi
----------------------------
To get access to a compute node with Intel Xeon Phi accelerator, use the
PBS interactive session
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
To set up the environment module "intel" has to be loaded, without
specifying the version, default version is loaded (at time of writing
this, it's 2015b)
$ module load intel
Information about the hardware can be obtained by running
the micinfo program on the host.
$ /usr/bin/micinfo
The output of the "micinfo" utility executed on one of the cluster node
is as follows. (note: to get PCIe related details the command has to be
run with root privileges)
MicInfo Utility Log
Created Mon Aug 17 13:55:59 2015
System Info
HOST OS : Linux
OS Version : 2.6.32-504.16.2.el6.x86_64
Driver Version : 3.4.1-1
MPSS Version : 3.4.1
Host Physical Memory : 131930 MB
Device No: 0, Device Name: mic0
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC44601414
Board
Vendor ID : 0x8086
Device ID : 0x225c
Subsystem ID : 0x7d95
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-7120 P/A/X/D
ECC Mode : Enabled
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 61
Voltage : 1007000 uV
Frequency : 1238095 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 60 C
GDDR
GDDR Vendor : Samsung
GDDR Version : 0x6
GDDR Density : 4096 Mb
GDDR Size : 15872 MB
GDDR Technology : GDDR5
GDDR Speed : 5.500000 GT/s
GDDR Frequency : 2750000 kHz
GDDR Voltage : 1501000 uV
Device No: 1, Device Name: mic1
Version
Flash Version : 2.1.02.0390
SMC Firmware Version : 1.16.5078
SMC Boot Loader Version : 1.8.4326
uOS Version : 2.6.38.8+mpss3.4.1
Device Serial Number : ADKC44500454
Board
Vendor ID : 0x8086
Device ID : 0x225c
Subsystem ID : 0x7d95
Coprocessor Stepping ID : 2
PCIe Width : x16
PCIe Speed : 5 GT/s
PCIe Max payload size : 256 bytes
PCIe Max read req size : 512 bytes
Coprocessor Model : 0x01
Coprocessor Model Ext : 0x00
Coprocessor Type : 0x00
Coprocessor Family : 0x0b
Coprocessor Family Ext : 0x00
Coprocessor Stepping : C0
Board SKU : C0PRQ-7120 P/A/X/D
ECC Mode : Enabled
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 61
Voltage : 998000 uV
Frequency : 1238095 kHz
Thermal
Fan Speed Control : N/A
Fan RPM : N/A
Fan PWM : N/A
Die Temp : 59 C
GDDR
GDDR Vendor : Samsung
GDDR Version : 0x6
GDDR Density : 4096 Mb
GDDR Size : 15872 MB
GDDR Technology : GDDR5
GDDR Speed : 5.500000 GT/s
GDDR Frequency : 2750000 kHz
GDDR Voltage : 1501000 uV
Offload Mode
------------
To compile a code for Intel Xeon Phi a MPSS stack has to be installed on
the machine where compilation is executed. Currently the MPSS stack is
only installed on compute nodes equipped with accelerators.
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ module load intel
For debugging purposes it is also recommended to set environment
variable "OFFLOAD_REPORT". Value can be set from 0 to 3, where higher
number means more debugging information.
export OFFLOAD_REPORT=3
A very basic example of code that employs offload programming technique
is shown in the next listing. Please note that this code is sequential
and utilizes only single core of the accelerator.
$ vim source-offload.cpp
#include <iostream>
int main(int argc, char* argv[])
{
const int niter = 100000;
double result = 0;
#pragma offload target(mic)
for (int i = 0; i < niter; ++i) {
const double t = (i + 0.5) / niter;
result += 4.0 / (t * t + 1.0);
}
result /= niter;
std::cout << "Pi ~ " << result << 'n';
}
To compile a code using Intel compiler run
$ icc source-offload.cpp -o bin-offload
To execute the code, run the following command on the host
./bin-offload
### Parallelization in Offload Mode Using OpenMP
One way of paralelization a code for Xeon Phi is using OpenMP
directives. The following example shows code for parallel vector
addition.
$ vim ./vect-add
#include <stdio.h>
typedef int T;
#define SIZE 1000
#pragma offload_attribute(push, target(mic))
T in1[SIZE];
T in2[SIZE];
T res[SIZE];
#pragma offload_attribute(pop)
// MIC function to add two vectors
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to add two vectors
void add_cpu (T *a, T *b, T *c, int size) {
int i;
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare(T *a, T *b, T size ){
int pass = 0;
int i;
for (i = 0; i < size; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %dn",i, a[i], b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passedn"); else printf ("Test Failedn");
return pass;
}
int main()
{
int i;
random_T(in1, SIZE);
random_T(in2, SIZE);
#pragma offload target(mic) in(in1,in2) inout(res)
{
// Parallel loop from main function
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
// or parallel loop is called inside the function
add_mic(in1, in2, res, SIZE);
}
//Check the results with CPU implementation
T res_cpu[SIZE];
add_cpu(in1, in2, res_cpu, SIZE);
compare(res, res_cpu, SIZE);
}
During the compilation Intel compiler shows which loops have been
vectorized in both host and accelerator. This can be enabled with
compiler option "-vec-report2". To compile and execute the code run
$ icc vect-add.c -openmp_report2 -vec-report2 -o vect-add
$ ./vect-add
Some interesting compiler flags useful not only for code debugging are:
Debugging
openmp_report[0|1|2] - controls the compiler based vectorization
diagnostic level
vec-report[0|1|2] - controls the OpenMP parallelizer diagnostic
level
Performance ooptimization
xhost - FOR HOST ONLY - to generate AVX (Advanced Vector Extensions)
instructions.
Automatic Offload using Intel MKL Library
-----------------------------------------
Intel MKL includes an Automatic Offload (AO) feature that enables
computationally intensive MKL functions called in user code to benefit
from attached Intel Xeon Phi coprocessors automatically and
transparently.
Behavioural of automatic offload mode is controlled by functions called
within the program or by environmental variables. Complete list of
controls is listed [
class="external-link">here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
The Automatic Offload may be enabled by either an MKL function call
within the code:
mkl_mic_enable();
or by setting environment variable
$ export MKL_MIC_ENABLE=1
To get more information about automatic offload please refer to "[Using
Intel® MKL Automatic Offload on Intel ® Xeon Phi™
Coprocessors](http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf)"
white paper or [ class="external-link">Intel MKL
documentation](https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation).
### Automatic offload example #1
Following example show how to automatically offload an SGEMM (single
precision - g dir="auto">eneral matrix multiply) function to
MIC coprocessor.
At first get an interactive PBS session on a node with MIC accelerator
and load "intel" module that automatically loads "mkl" module as well.
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ module load intel
The code can be copied to a file and compiled without any necessary
modification.
$ vim sgemm-ao-short.c
```
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include "mkl.h"
int main(int argc, char **argv)
{
float *A, *B, *C; /* Matrices */
MKL_INT N = 2560; /* Matrix dimensions */
MKL_INT LD = N; /* Leading dimension */
int matrix_bytes; /* Matrix size in bytes */
int matrix_elements; /* Matrix size in elements */
float alpha = 1.0, beta = 1.0; /* Scaling factors */
char transa = 'N', transb = 'N'; /* Transposition options */
int i, j; /* Counters */
matrix_elements = N * N;
matrix_bytes = sizeof(float) * matrix_elements;
/* Allocate the matrices */
A = malloc(matrix_bytes); B = malloc(matrix_bytes); C = malloc(matrix_bytes);
/* Initialize the matrices */
for (i = 0; i < matrix_elements; i++) {
A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
}
printf("Computing SGEMM on the hostn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
printf("Enabling Automatic Offloadn");
/* Alternatively, set environment variable MKL_MIC_ENABLE=1 */
mkl_mic_enable();
int ndevices = mkl_mic_get_device_count(); /* Number of MIC devices */
printf("Automatic Offload enabled: %d MIC devices presentn", ndevices);
printf("Computing SGEMM with automatic workdivisionn");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
/* Free the matrix memory */
free(A); free(B); free(C);
printf("Donen");
return 0;
}
```
Please note: This example is simplified version of an example from MKL.
The expanded version can be found here:
$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c**
To compile a code using Intel compiler use:
$ icc -mkl sgemm-ao-short.c -o sgemm
For debugging purposes enable the offload report to see more information
about automatic offloading.
$ export OFFLOAD_REPORT=2
The output of a code should look similar to following listing, where
lines starting with [MKL] are generated by offload reporting:
[user@r31u03n799 ~]$ ./sgemm
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 2 MIC devices present
Computing SGEMM with automatic workdivision
[MKL] [MIC --] [AO Function] SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision] 0.44 0.28 0.28
[MKL] [MIC 00] [AO SGEMM CPU Time] 0.252427 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time] 0.091001 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 7864320 bytes
[MKL] [MIC 01] [AO SGEMM CPU Time] 0.252427 seconds
[MKL] [MIC 01] [AO SGEMM MIC Time] 0.094758 seconds
[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 7864320 bytes
Done
Behavioral of automatic offload mode is controlled by functions called
within the program or by environmental variables. Complete list of
controls is listed [
class="external-link">here](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/GUID-3DC4FC7D-A1E4-423D-9C0C-06AB265FFA86.htm).
To get more information about automatic offload please refer to "[Using
Intel® MKL Automatic Offload on Intel ® Xeon Phi™
Coprocessors](http://software.intel.com/sites/default/files/11MIC42_How_to_Use_MKL_Automatic_Offload_0.pdf)"
white paper or [ class="external-link">Intel MKL
documentation](https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation).
### Automatic offload example #2
In this example, we will demonstrate automatic offload control via an
environment vatiable MKL_MIC_ENABLE. The function DGEMM will be
offloaded.
At first get an interactive PBS session on a node with MIC accelerator.
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
Once in, we enable the offload and run the Octave software. In octave,
we generate two large random matrices and let them multiply together.
$ export MKL_MIC_ENABLE=1
$ export OFFLOAD_REPORT=2
$ module load Octave/3.8.2-intel-2015b
$ octave -q
octave:1> A=rand(10000);
octave:2> B=rand(10000);
octave:3> C=A*B;
[MKL] [MIC --] [AO Function] DGEMM
[MKL] [MIC --] [AO DGEMM Workdivision] 0.14 0.43 0.43
[MKL] [MIC 00] [AO DGEMM CPU Time] 3.814714 seconds
[MKL] [MIC 00] [AO DGEMM MIC Time] 2.781595 seconds
[MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 1145600000 bytes
[MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 1382400000 bytes
[MKL] [MIC 01] [AO DGEMM CPU Time] 3.814714 seconds
[MKL] [MIC 01] [AO DGEMM MIC Time] 2.843016 seconds
[MKL] [MIC 01] [AO DGEMM CPU->MIC Data] 1145600000 bytes
[MKL] [MIC 01] [AO DGEMM MIC->CPU Data] 1382400000 bytes
octave:4> exit
On the example above we observe, that the DGEMM function workload was
split over CPU, MIC 0 and MIC 1, in the ratio 0.14 0.43 0.43. The matrix
multiplication was done on the CPU, accelerated by two Xeon Phi
accelerators.
Native Mode
-----------
In the native mode a program is executed directly on Intel Xeon Phi
without involvement of the host machine. Similarly to offload mode, the
code is compiled on the host computer with Intel compilers.
To compile a code user has to be connected to a compute with MIC and
load Intel compilers module. To get an interactive session on a compute
node with an Intel Xeon Phi and load the module use following commands:
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ module load intel
Please note that particular version of the Intel module is specified.
This information is used later to specify the correct library paths.
To produce a binary compatible with Intel Xeon Phi architecture user has
to specify "-mmic" compiler flag. Two compilation examples are shown
below. The first example shows how to compile OpenMP parallel code
"vect-add.c" for host only:
$ icc -xhost -no-offload -fopenmp vect-add.c -o vect-add-host
To run this code on host, use:
$ ./vect-add-host
The second example shows how to compile the same code for Intel Xeon
Phi:
$ icc -mmic -fopenmp vect-add.c -o vect-add-mic
### Execution of the Program in Native Mode on Intel Xeon Phi
The user access to the Intel Xeon Phi is through the SSH. Since user
home directories are mounted using NFS on the accelerator, users do not
have to copy binary files or libraries between the host and accelerator.
Get the PATH of MIC enabled libraries for currently used Intel Compiler
(here was icc/2015.3.187-GNU-5.1.0-2.25 used) :
$ echo $MIC_LD_LIBRARY_PATH
/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic
To connect to the accelerator run:
$ ssh mic0
If the code is sequential, it can be executed directly:
mic0 $ ~/path_to_binary/vect-add-seq-mic
If the code is parallelized using OpenMP a set of additional libraries
is required for execution. To locate these libraries new path has to be
added to the LD_LIBRARY_PATH environment variable prior to the
execution:
mic0 $ export LD_LIBRARY_PATH=/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
Please note that the path exported in the previous example contains path
to a specific compiler (here the version is 2015.3.187-GNU-5.1.0-2.25).
This version number has to match with the version number of the Intel
compiler module that was used to compile the code on the host computer.
For your information the list of libraries and their location required
for execution of an OpenMP parallel code on Intel Xeon Phi is:
>/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic
libiomp5.so
libimf.so
libsvml.so
libirng.so
libintlc.so.5
>Finally, to run the compiled code use:
$ ~/path_to_binary/vect-add-mic
>OpenCL
-------------------
>OpenCL (Open Computing Language) is an open standard for
general-purpose parallel programming for diverse mix of multi-core CPUs,
GPU coprocessors, and other parallel processors. OpenCL provides a
flexible execution model and uniform programming environment for
software developers to write portable code for systems running on both
the CPU and graphics processors or accelerators like the Intel® Xeon
Phi.
>On Anselm OpenCL is installed only on compute nodes with MIC
accelerator, therefore OpenCL code can be compiled only on these nodes.
module load opencl-sdk opencl-rt
>Always load "opencl-sdk" (providing devel files like headers) and
"opencl-rt" (providing dynamic library libOpenCL.so) modules to compile
and link OpenCL code. Load "opencl-rt" for running your compiled code.
>There are two basic examples of OpenCL code in the following
directory:
/apps/intel/opencl-examples/
>First example "CapsBasic" detects OpenCL compatible hardware, here
CPU and MIC, and prints basic information about the capabilities of
it.
/apps/intel/opencl-examples/CapsBasic/capsbasic
>To compile and run the example copy it to your home directory, get
a PBS interactive session on of the nodes with MIC and run make for
compilation. Make files are very basic and shows how the OpenCL code can
be compiled on Anselm.
$ cp /apps/intel/opencl-examples/CapsBasic/* .
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ make
>The compilation command for this example is:
$ g++ capsbasic.cpp -lOpenCL -o capsbasic -I/apps/intel/opencl/include/
>After executing the complied binary file, following output should
be displayed.
./capsbasic
Number of available platforms: 1
Platform names:
[0] Intel(R) OpenCL [Selected]
Number of devices available for each type:
CL_DEVICE_TYPE_CPU: 1
CL_DEVICE_TYPE_GPU: 0
CL_DEVICE_TYPE_ACCELERATOR: 1
*** Detailed information for each device ***
CL_DEVICE_TYPE_CPU[0]
CL_DEVICE_NAME: Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
CL_DEVICE_AVAILABLE: 1
...
CL_DEVICE_TYPE_ACCELERATOR[0]
CL_DEVICE_NAME: Intel(R) Many Integrated Core Acceleration Card
CL_DEVICE_AVAILABLE: 1
...
>More information about this example can be found on Intel website:
<http://software.intel.com/en-us/vcsource/samples/caps-basic/>
>The second example that can be found in
"/apps/intel/opencl-examples" >directory is General Matrix
Multiply. You can follow the the same procedure to download the example
to your directory and compile it.
$ cp -r /apps/intel/opencl-examples/* .
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ cd GEMM
$ make
>The compilation command for this example is:
$ g++ cmdoptions.cpp gemm.cpp ../common/basic.cpp ../common/cmdparser.cpp ../common/oclobject.cpp -I../common -lOpenCL -o gemm -I/apps/intel/opencl/include/
>To see the performance of Intel Xeon Phi performing the DGEMM run
the example as follows:
./gemm -d 1
Platforms (1):
[0] Intel(R) OpenCL [Selected]
Devices (2):
[0] Intel(R) Xeon(R) CPU E5-2470 0 @ 2.30GHz
[1] Intel(R) Many Integrated Core Acceleration Card [Selected]
Build program options: "-DT=float -DTILE_SIZE_M=1 -DTILE_GROUP_M=16 -DTILE_SIZE_N=128 -DTILE_GROUP_N=1 -DTILE_SIZE_K=8"
Running gemm_nn kernel with matrix size: 3968x3968
Memory row stride to ensure necessary alignment: 15872 bytes
Size of memory region for one matrix: 62980096 bytes
Using alpha = 0.57599 and beta = 0.872412
...
Host time: 0.292953 sec.
Host perf: 426.635 GFLOPS
Host time: 0.293334 sec.
Host perf: 426.081 GFLOPS
...
>Please note: GNU compiler is used to compile the OpenCL codes for
Intel MIC. You do not need to load Intel compiler module.
>MPI
----------------
### Environment setup and compilation
To achieve best MPI performance always use following setup for Intel MPI
on Xeon Phi accelerated nodes:
$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1
This ensures, that MPI inside node will use SHMEM communication, between
HOST and Phi the IB SCIF will be used and between different nodes or
Phi's on diferent nodes a CCL-Direct proxy will be used.
Please note: Other FABRICS like tcp,ofa may be used (even combined with
shm) but there's severe loss of performance (by order of magnitude).
Usage of single DAPL PROVIDER (e. g.
I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u) will cause failure of
Host&lt;-&gt;Phi and/or Phi&lt;-&gt;Phi communication.
Usage of the I_MPI_DAPL_PROVIDER_LIST on non-accelerated node will
cause failure of any MPI communication, since those nodes don't have
SCIF device and there's no CCL-Direct proxy runnig.
Again an MPI code for Intel Xeon Phi has to be compiled on a compute
node with accelerator and MPSS software stack installed. To get to a
compute node with accelerator use:
$ qsub -I -q qprod -l select=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
The only supported implementation of MPI standard for Intel Xeon Phi is
Intel MPI. To setup a fully functional development environment a
combination of Intel compiler and Intel MPI has to be used. On a host
load following modules before compilation:
$ module load intel impi
To compile an MPI code for host use:
$ mpiicc -xhost -o mpi-test mpi-test.c
To compile the same code for Intel Xeon Phi architecture use:
$ mpiicc -mmic -o mpi-test-mic mpi-test.c
Or, if you are using Fortran :
$ mpiifort -mmic -o mpi-test-mic mpi-test.f90
An example of basic MPI version of "hello-world" example in C language,
that can be executed on both host and Xeon Phi is (can be directly copy
and pasted to a .c file)
```
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size;
int len;
char node[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(node,&len);
printf( "Hello world from process %d of %d on host %s n", rank, size, node );
MPI_Finalize();
return 0;
}
```
### MPI programming models
>Intel MPI for the Xeon Phi coprocessors offers different MPI
programming models:
Host-only model** - all MPI ranks reside on the host. The coprocessors
can be used by using offload pragmas. (Using MPI calls inside offloaded
code is not supported.)**
Coprocessor-only model** - all MPI ranks reside only on the
coprocessors.
Symmetric model** - the MPI ranks reside on both the host and the
coprocessor. Most general MPI case.
### >Host-only model
>In this case all environment variables are set by modules,
so to execute the compiled MPI program on a single node, use:
$ mpirun -np 4 ./mpi-test
The output should be similar to:
Hello world from process 1 of 4 on host r38u31n1000
Hello world from process 3 of 4 on host r38u31n1000
Hello world from process 2 of 4 on host r38u31n1000
Hello world from process 0 of 4 on host r38u31n1000
### Coprocessor-only model
>There are two ways how to execute an MPI code on a single
coprocessor: 1.) lunch the program using "**mpirun**" from the
coprocessor; or 2.) lunch the task using "**mpiexec.hydra**" from a
host.
Execution on coprocessor**
Similarly to execution of OpenMP programs in native mode, since the
environmental module are not supported on MIC, user has to setup paths
to Intel MPI libraries and binaries manually. One time setup can be done
by creating a "**.profile**" file in user's home directory. This file
sets up the environment on the MIC automatically once user access to the
accelerator through the SSH.
At first get the LD_LIBRARY_PATH for currenty used Intel Compiler and
Intel MPI:
$ echo $MIC_LD_LIBRARY_PATH
/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic/
Use it in your ~/.profile:
$ vim ~/.profile
PS1='[u@h W]$ '
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#IMPI
export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187-GNU-5.1.0-2.25/mic/bin/:$PATH
#OpenMP (ICC, IFORT), IMKL and IMPI
export LD_LIBRARY_PATH=/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
Please note:
- this file sets up both environmental variable for both MPI and OpenMP
libraries.
- this file sets up the paths to a particular version of Intel MPI
library and particular version of an Intel compiler. These versions have
to match with loaded modules.
To access a MIC accelerator located on a node that user is currently
connected to, use:
$ ssh mic0
or in case you need specify a MIC accelerator on a particular node, use:
$ ssh r38u31n1000-mic0
To run the MPI code in parallel on multiple core of the accelerator,
use:
$ mpirun -np 4 ./mpi-test-mic
The output should be similar to:
Hello world from process 1 of 4 on host r38u31n1000-mic0
Hello world from process 2 of 4 on host r38u31n1000-mic0
Hello world from process 3 of 4 on host r38u31n1000-mic0
Hello world from process 0 of 4 on host r38u31n1000-mic0
**
Execution on host**
If the MPI program is launched from host instead of the coprocessor, the
environmental variables are not set using the ".profile" file. Therefore
user has to specify library paths from the command line when calling
"mpiexec".
First step is to tell mpiexec that the MPI should be executed on a local
accelerator by setting up the environmental variable "I_MPI_MIC"
$ export I_MPI_MIC=1
Now the MPI program can be executed as:
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 -n 4 ~/mpi-test-mic
or using mpirun
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 -n 4 ~/mpi-test-mic
Please note:
- the full path to the binary has to specified (here:
"**>~/mpi-test-mic**")
- the LD_LIBRARY_PATH has to match with Intel MPI module used to
compile the MPI code
The output should be again similar to:
Hello world from process 1 of 4 on host r38u31n1000-mic0
Hello world from process 2 of 4 on host r38u31n1000-mic0
Hello world from process 3 of 4 on host r38u31n1000-mic0
Hello world from process 0 of 4 on host r38u31n1000-mic0
Please note that the "mpiexec.hydra" requires a file
"**>pmi_proxy**" from Intel MPI library to be copied to the
MIC filesystem. If the file is missing please contact the system
administrators. A simple test to see if the file is present is to
execute:
$ ssh mic0 ls /bin/pmi_proxy
/bin/pmi_proxy
**
Execution on host - MPI processes distributed over multiple
accelerators on multiple nodes**
>To get access to multiple nodes with MIC accelerator, user has to
use PBS to allocate the resources. To start interactive session, that
allocates 2 compute nodes = 2 MIC accelerators run qsub command with
following parameters:
$ qsub -I -q qprod -l select=2:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120 -A NONE-0-0
$ module load intel impi
>This command connects user through ssh to one of the nodes
immediately. To see the other nodes that have been allocated use:
$ cat $PBS_NODEFILE
>For example:
r38u31n1000.bullx
r38u32n1001.bullx
>This output means that the PBS allocated nodes r38u31n1000 and
r38u32n1001, which means that user has direct access to
"**r38u31n1000-mic0**" and "**>r38u32n1001-mic0**"
accelerators.
>Please note: At this point user can connect to any of the
allocated nodes or any of the allocated MIC accelerators using ssh:
- to connect to the second node : ** $
ssh >r38u32n1001**
>- to connect to the accelerator on the first node from the first
node: **$ ssh
>r38u31n1000-mic0**</span> or **
$ ssh mic0**
-** to connect to the accelerator on the second node from the first
node: **$ ssh
>r38u32n1001-mic0**
>At this point we expect that correct modules are loaded and binary
is compiled. For parallel execution the mpiexec.hydra is used.
Again the first step is to tell mpiexec that the MPI can be executed on
MIC accelerators by setting up the environmental variable "I_MPI_MIC",
don't forget to have correct FABRIC and PROVIDER defined.
$ export I_MPI_MIC=1
$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1
>The launch the MPI program use:
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH
-host r38u31n1000-mic0 -n 4 ~/mpi-test-mic
: -host r38u32n1001-mic0 -n 6 ~/mpi-test-mic
or using mpirun:
$ mpirun -genv LD_LIBRARY_PATH
-host r38u31n1000-mic0 -n 4 ~/mpi-test-mic
: -host r38u32n1001-mic0 -n 6 ~/mpi-test-mic
In this case four MPI processes are executed on accelerator
r38u31n1000-mic and six processes are executed on accelerator
r38u32n1001-mic0. The sample output (sorted after execution) is:
Hello world from process 0 of 10 on host r38u31n1000-mic0
Hello world from process 1 of 10 on host r38u31n1000-mic0
Hello world from process 2 of 10 on host r38u31n1000-mic0
Hello world from process 3 of 10 on host r38u31n1000-mic0
Hello world from process 4 of 10 on host r38u32n1001-mic0
Hello world from process 5 of 10 on host r38u32n1001-mic0
Hello world from process 6 of 10 on host r38u32n1001-mic0
Hello world from process 7 of 10 on host r38u32n1001-mic0
Hello world from process 8 of 10 on host r38u32n1001-mic0
Hello world from process 9 of 10 on host r38u32n1001-mic0
The same way MPI program can be executed on multiple hosts:
$ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH
-host r38u31n1000 -n 4 ~/mpi-test
: -host r38u32n1001 -n 6 ~/mpi-test
### >Symmetric model
>In a symmetric mode MPI programs are executed on both host
computer(s) and MIC accelerator(s). Since MIC has a different
architecture and requires different binary file produced by the Intel
compiler two different files has to be compiled before MPI program is
executed.
>In the previous section we have compiled two binary files, one for
hosts "**mpi-test**" and one for MIC accelerators "**mpi-test-mic**".
These two binaries can be executed at once using mpiexec.hydra:
$ mpirun
-genv $MIC_LD_LIBRARY_PATH
-host r38u32n1001 -n 2 ~/mpi-test
: -host r38u32n1001-mic0 -n 2 ~/mpi-test-mic
In this example the first two parameters (line 2 and 3) sets up required
environment variables for execution. The third line specifies binary
that is executed on host (here r38u32n1001) and the last line specifies
the binary that is execute on the accelerator (here r38u32n1001-mic0).
>The output of the program is:
Hello world from process 0 of 4 on host r38u32n1001
Hello world from process 1 of 4 on host r38u32n1001
Hello world from process 2 of 4 on host r38u32n1001-mic0
Hello world from process 3 of 4 on host r38u32n1001-mic0
>The execution procedure can be simplified by using the mpirun
command with the machine file a a parameter. Machine file contains list
of all nodes and accelerators that should used to execute MPI processes.
>An example of a machine file that uses 2 >hosts (r38u32n1001
and r38u33n1002) and 2 accelerators **(r38u32n1001-mic0** and
>>r38u33n1002-mic0**) to run 2 MPI processes
on each of them:
$ cat hosts_file_mix
r38u32n1001:2
r38u32n1001-mic0:2
r38u33n1002:2
r38u33n1002-mic0:2
>In addition if a naming convention is set in a way that the name
of the binary for host is **"bin_name"** and the name of the binary
for the accelerator is **"bin_name-mic"** then by setting up the
environment variable **I_MPI_MIC_POSTFIX** to **"-mic"** user do not
have to specify the names of booth binaries. In this case mpirun needs
just the name of the host binary file (i.e. "mpi-test") and uses the
suffix to get a name of the binary for accelerator (i..e.
"mpi-test-mic").
$ export I_MPI_MIC_POSTFIX=-mic
>To run the MPI code using mpirun and the machine file
"hosts_file_mix" use:
$ mpirun
-genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH
-machinefile hosts_file_mix
~/mpi-test
>A possible output of the MPI "hello-world" example executed on two
hosts and two accelerators is:
Hello world from process 0 of 8 on host r38u31n1000
Hello world from process 1 of 8 on host r38u31n1000
Hello world from process 2 of 8 on host r38u31n1000-mic0
Hello world from process 3 of 8 on host r38u31n1000-mic0
Hello world from process 4 of 8 on host r38u32n1001
Hello world from process 5 of 8 on host r38u32n1001
Hello world from process 6 of 8 on host r38u32n1001-mic0
Hello world from process 7 of 8 on host r38u32n1001-mic0
Using the PBS automatically generated node-files
PBS also generates a set of node-files that can be used instead of
manually creating a new one every time. Three node-files are genereated:
Host only node-file:**
- /lscratch/${PBS_JOBID}/nodefile-cn
MIC only node-file**:
- /lscratch/${PBS_JOBID}/nodefile-mic
Host and MIC node-file**:
- /lscratch/${PBS_JOBID}/nodefile-mix
Please note each host or accelerator is listed only per files. User has
to specify how many jobs should be executed per node using "-n"
parameter of the mpirun command.
Optimization
------------
For more details about optimization techniques please read Intel
document [Optimization and Performance Tuning for Intel® Xeon Phi™
Coprocessors](http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization "http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization")
Java
====
Java on the cluster
Java is available on the cluster. Activate java by loading the Java
module
$ module load Java
Note that the Java module must be loaded on the compute nodes as well,
in order to run java on compute nodes.
Check for java version and path
$ java -version
$ which java
With the module loaded, not only the runtime environment (JRE), but also
the development environment (JDK) with the compiler is available.
$ javac -version
$ which javac
Java applications may use MPI for interprocess communication, in
conjunction with OpenMPI. Read more
on <http://www.open-mpi.org/faq/?category=java>.
This functionality is currently not supported on Anselm cluster. In case
you require the java interface to MPI, please contact [cluster
support](https://support.it4i.cz/rt/).
Running OpenMPI
===============
OpenMPI program execution
-------------------------
The OpenMPI programs may be executed only via the PBS Workload manager,
by entering an appropriate queue. On the cluster, the **OpenMPI 1.8.6**
is OpenMPI based MPI implementation.
### Basic usage
Use the mpiexec to run the OpenMPI code.
Example:
$ qsub -q qexp -l select=4:ncpus=24 -I
qsub: waiting for job 15210.isrv5 to start
qsub: job 15210.isrv5 ready
$ pwd
/home/username
$ module load OpenMPI
$ mpiexec -pernode ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host r1i0n17
Hello world! from rank 1 of 4 on host r1i0n5
Hello world! from rank 2 of 4 on host r1i0n6
Hello world! from rank 3 of 4 on host r1i0n7
Please be aware, that in this example, the directive **-pernode** is
used to run only **one task per node**, which is normally an unwanted
behaviour (unless you want to run hybrid code with just one MPI and 24
OpenMP tasks per node). In normal MPI programs **omit the -pernode
directive** to run up to 24 MPI tasks per each node.
In this example, we allocate 4 nodes via the express queue
interactively. We set up the openmpi environment and interactively run
the helloworld_mpi.x program.
Note that the executable
helloworld_mpi.x must be available within the
same path on all nodes. This is automatically fulfilled on the /home and
/scratch filesystem.
You need to preload the executable, if running on the local ramdisk /tmp
filesystem
$ pwd
/tmp/pbs.15210.isrv5
$ mpiexec -pernode --preload-binary ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host r1i0n17
Hello world! from rank 1 of 4 on host r1i0n5
Hello world! from rank 2 of 4 on host r1i0n6
Hello world! from rank 3 of 4 on host r1i0n7
In this example, we assume the executable
helloworld_mpi.x is present on compute node
r1i0n17 on ramdisk. We call the mpiexec whith the **--preload-binary**
argument (valid for openmpi). The mpiexec will copy the executable from
r1i0n17 to the /tmp/pbs.15210.isrv5
directory on r1i0n5, r1i0n6 and r1i0n7 and execute the program.
MPI process mapping may be controlled by PBS parameters.
The mpiprocs and ompthreads parameters allow for selection of number of
running MPI processes per node as well as number of OpenMP threads per
MPI process.
### One MPI process per node
Follow this example to run one MPI process per node, 24 threads per
process.
$ qsub -q qexp -l select=4:ncpus=24:mpiprocs=1:ompthreads=24 -I
$ module load OpenMPI
$ mpiexec --bind-to-none ./helloworld_mpi.x
In this example, we demonstrate recommended way to run an MPI
application, using 1 MPI processes per node and 24 threads per socket,
on 4 nodes.
### Two MPI processes per node
Follow this example to run two MPI processes per node, 8 threads per
process. Note the options to mpiexec.
$ qsub -q qexp -l select=4:ncpus=24:mpiprocs=2:ompthreads=12 -I
$ module load OpenMPI
$ mpiexec -bysocket -bind-to-socket ./helloworld_mpi.x
In this example, we demonstrate recommended way to run an MPI
application, using 2 MPI processes per node and 12 threads per socket,
each process and its threads bound to a separate processor socket of the
node, on 4 nodes
### 24 MPI processes per node
Follow this example to run 24 MPI processes per node, 1 thread per
process. Note the options to mpiexec.
$ qsub -q qexp -l select=4:ncpus=24:mpiprocs=24:ompthreads=1 -I
$ module load OpenMPI
$ mpiexec -bycore -bind-to-core ./helloworld_mpi.x
In this example, we demonstrate recommended way to run an MPI
application, using 24 MPI processes per node, single threaded. Each
process is bound to separate processor core, on 4 nodes.
### OpenMP thread affinity
Important! Bind every OpenMP thread to a core!
In the previous two examples with one or two MPI processes per node, the
operating system might still migrate OpenMP threads between cores. You
might want to avoid this by setting these environment variable for GCC
OpenMP:
$ export GOMP_CPU_AFFINITY="0-23"
or this one for Intel OpenMP:
$ export KMP_AFFINITY=granularity=fine,compact,1,0
As of OpenMP 4.0 (supported by GCC 4.9 and later and Intel 14.0 and
later) the following variables may be used for Intel or GCC:
$ export OMP_PROC_BIND=true
$ export OMP_PLACES=cores
>OpenMPI Process Mapping and Binding
------------------------------------------------
The mpiexec allows for precise selection of how the MPI processes will
be mapped to the computational nodes and how these processes will bind
to particular processor sockets and cores.
MPI process mapping may be specified by a hostfile or rankfile input to
the mpiexec program. Altough all implementations of MPI provide means
for process mapping and binding, following examples are valid for the
openmpi only.
### Hostfile
Example hostfile
r1i0n17.smc.salomon.it4i.cz
r1i0n5.smc.salomon.it4i.cz
r1i0n6.smc.salomon.it4i.cz
r1i0n7.smc.salomon.it4i.cz
Use the hostfile to control process placement
$ mpiexec -hostfile hostfile ./helloworld_mpi.x
Hello world! from rank 0 of 4 on host r1i0n17
Hello world! from rank 1 of 4 on host r1i0n5
Hello world! from rank 2 of 4 on host r1i0n6
Hello world! from rank 3 of 4 on host r1i0n7
In this example, we see that ranks have been mapped on nodes according
to the order in which nodes show in the hostfile
### Rankfile
Exact control of MPI process placement and resource binding is provided
by specifying a rankfile
Appropriate binding may boost performance of your application.
Example rankfile
rank 0=r1i0n7.smc.salomon.it4i.cz slot=1:0,1
rank 1=r1i0n6.smc.salomon.it4i.cz slot=0:*
rank 2=r1i0n5.smc.salomon.it4i.cz slot=1:1-2
rank 3=r1i0n17.smc.salomon slot=0:1,1:0-2
rank 4=r1i0n6.smc.salomon.it4i.cz slot=0:*,1:*
This rankfile assumes 5 ranks will be running on 4 nodes and provides
exact mapping and binding of the processes to the processor sockets and
cores
Explanation:
rank 0 will be bounded to r1i0n7, socket1 core0 and core1
rank 1 will be bounded to r1i0n6, socket0, all cores
rank 2 will be bounded to r1i0n5, socket1, core1 and core2
rank 3 will be bounded to r1i0n17, socket0 core1, socket1 core0, core1,
core2
rank 4 will be bounded to r1i0n6, all cores on both sockets
$ mpiexec -n 5 -rf rankfile --report-bindings ./helloworld_mpi.x
[r1i0n17:11180] MCW rank 3 bound to socket 0[core 1] socket 1[core 0-2]: [. B . . . . . . . . . .][B B B . . . . . . . . .] (slot list 0:1,1:0-2)
[r1i0n7:09928] MCW rank 0 bound to socket 1[core 0-1]: [. . . . . . . . . . . .][B B . . . . . . . . . .] (slot list 1:0,1)
[r1i0n6:10395] MCW rank 1 bound to socket 0[core 0-7]: [B B B B B B B B B B B B][. . . . . . . . . . . .] (slot list 0:*)
[r1i0n5:10406] MCW rank 2 bound to socket 1[core 1-2]: [. . . . . . . . . . . .][. B B . . . . . . . . .] (slot list 1:1-2)
[r1i0n6:10406] MCW rank 4 bound to socket 0[core 0-7] socket 1[core 0-7]: [B B B B B B B B B B B B][B B B B B B B B B B B B] (slot list 0:*,1:*)
Hello world! from rank 3 of 5 on host r1i0n17
Hello world! from rank 1 of 5 on host r1i0n6
Hello world! from rank 0 of 5 on host r1i0n7
Hello world! from rank 4 of 5 on host r1i0n6
Hello world! from rank 2 of 5 on host r1i0n5
In this example we run 5 MPI processes (5 ranks) on four nodes. The
rankfile defines how the processes will be mapped on the nodes, sockets
and cores. The **--report-bindings** option was used to print out the
actual process location and bindings. Note that ranks 1 and 4 run on the
same node and their core binding overlaps.
It is users responsibility to provide correct number of ranks, sockets
and cores.
### Bindings verification
In all cases, binding and threading may be verified by executing for
example:
$ mpiexec -bysocket -bind-to-socket --report-bindings echo
$ mpiexec -bysocket -bind-to-socket numactl --show
$ mpiexec -bysocket -bind-to-socket echo $OMP_NUM_THREADS
Changes in OpenMPI 1.8
----------------------
Some options have changed in OpenMPI version 1.8.
<table>
<colgroup>
<col width="50%" />
<col width="50%" />
</colgroup>
<thead>
<tr class="header">
<th align="left">version 1.6.5</th>
<th align="left">version 1.8.1</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td align="left">--bind-to-none</td>
<td align="left">--bind-to none</td>
</tr>
<tr class="even">
<td align="left">--bind-to-core</td>
<td align="left">--bind-to core</td>
</tr>
<tr class="odd">
<td align="left">--bind-to-socket</td>
<td align="left">--bind-to socket</td>
</tr>
<tr class="even">
<td align="left">-bysocket</td>
<td align="left">--map-by socket</td>
</tr>
<tr class="odd">
<td align="left">-bycore</td>
<td align="left">--map-by core</td>
</tr>
<tr class="even">
<td align="left">-pernode</td>
<td align="left"><p> class="s1">--map-by ppr:1:node</p></td>
</tr>
</tbody>
</table>
MPI
===
Setting up MPI Environment
--------------------------
The Salomon cluster provides several implementations of the MPI library:
-------------------------------------------------------------------------
MPI Library Thread support
------------------------------------ ------------------------------------
Intel MPI 4.1** Full thread support up to
MPI_THREAD_MULTIPLE
Intel MPI 5.0** Full thread support up to
MPI_THREAD_MULTIPLE
OpenMPI 1.8.6 Full thread support up to
MPI_THREAD_MULTIPLE, MPI-3.0
support
SGI MPT 2.12
-------------------------------------------------------------------------
MPI libraries are activated via the environment modules.
Look up section modulefiles/mpi in module avail
$ module avail
------------------------------ /apps/modules/mpi -------------------------------
impi/4.1.1.036-iccifort-2013.5.192
impi/4.1.1.036-iccifort-2013.5.192-GCC-4.8.3
impi/5.0.3.048-iccifort-2015.3.187
impi/5.0.3.048-iccifort-2015.3.187-GNU-5.1.0-2.25
MPT/2.12
OpenMPI/1.8.6-GNU-5.1.0-2.25
There are default compilers associated with any particular MPI
implementation. The defaults may be changed, the MPI libraries may be
used in conjunction with any compiler.
The defaults are selected via the modules in following way
--------------------------------------------------------------------------
Module MPI Compiler suite
------------------------ ------------------------ ------------------------
impi-5.0.3.048-iccifort- Intel MPI 5.0.3
2015.3.187
OpenMP-1.8.6-GNU-5.1.0-2 OpenMPI 1.8.6
.25
--------------------------------------------------------------------------
Examples:
$ module load gompi/2015b
In this example, we activate the latest OpenMPI with latest GNU
compilers (OpenMPI 1.8.6 and GCC 5.1). Please see more information about
toolchains in section [Environment and
Modules](../../environment-and-modules.html) .
To use OpenMPI with the intel compiler suite, use
$ module load iompi/2015.03
In this example, the openmpi 1.8.6 using intel compilers is activated.
It's used "iompi" toolchain.
Compiling MPI Programs
----------------------
After setting up your MPI environment, compile your program using one of
the mpi wrappers
$ mpicc -v
$ mpif77 -v
$ mpif90 -v
When using Intel MPI, use the following MPI wrappers:
$ mpicc
$ mpiifort
Wrappers mpif90, mpif77 that are provided by Intel MPI are designed for
gcc and gfortran. You might be able to compile MPI code by them even
with Intel compilers, but you might run into problems (for example,
native MIC compilation with -mmic does not work with mpif90).
Example program:
// helloworld_mpi.c
#include <stdio.h>
#include<mpi.h>
int main(int argc, char **argv) {
int len;
int rank, size;
char node[MPI_MAX_PROCESSOR_NAME];
// Initiate MPI
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
// Get hostame and print
MPI_Get_processor_name(node,&len);
printf("Hello world! from rank %d of %d on host %sn",rank,size,node);
// Finalize and exit
MPI_Finalize();
return 0;
}
Compile the above example with
$ mpicc helloworld_mpi.c -o helloworld_mpi.x
Running MPI Programs
--------------------
The MPI program executable must be compatible with the loaded MPI
module.
Always compile and execute using the very same MPI module.
It is strongly discouraged to mix mpi implementations. Linking an
application with one MPI implementation and running mpirun/mpiexec form
other implementation may result in unexpected errors.
The MPI program executable must be available within the same path on all
nodes. This is automatically fulfilled on the /home and /scratch
filesystem. You need to preload the executable, if running on the local
scratch /lscratch filesystem.
### Ways to run MPI programs
Optimal way to run an MPI program depends on its memory requirements,
memory access pattern and communication pattern.
Consider these ways to run an MPI program:
1. One MPI process per node, 24 threads per process
2. Two MPI processes per node, 12 threads per process
3. 24 MPI processes per node, 1 thread per process.
One MPI** process per node, using 24 threads, is most useful for
memory demanding applications, that make good use of processor cache
memory and are not memory bound. This is also a preferred way for
communication intensive applications as one process per node enjoys full
bandwidth access to the network interface.
Two MPI** processes per node, using 12 threads each, bound to
processor socket is most useful for memory bandwidth bound applications
such as BLAS1 or FFT, with scalable memory demand. However, note that
the two processes will share access to the network interface. The 12
threads and socket binding should ensure maximum memory access bandwidth
and minimize communication, migration and numa effect overheads.
Important! Bind every OpenMP thread to a core!
In the previous two cases with one or two MPI processes per node, the
operating system might still migrate OpenMP threads between cores. You
want to avoid this by setting the KMP_AFFINITY or GOMP_CPU_AFFINITY
environment variables.
24 MPI** processes per node, using 1 thread each bound to processor
core is most suitable for highly scalable applications with low
communication demand.
### Running OpenMPI
The [**OpenMPI 1.8.6**](http://www.open-mpi.org/) is
based on OpenMPI. Read more on [how to run
OpenMPI](Running_OpenMPI.html) based MPI.
The Intel MPI may run on the[Intel Xeon
Ph](../intel-xeon-phi.html)i accelerators as well. Read
more on [how to run Intel MPI on
accelerators](../intel-xeon-phi.html).
MPI4Py (MPI for Python)
=======================
OpenMPI interface to Python
Introduction
------------
MPI for Python provides bindings of the Message Passing Interface (MPI)
standard for the Python programming language, allowing any Python
program to exploit multiple processors.
This package is constructed on top of the MPI-1/2 specifications and
provides an object oriented interface which closely follows MPI-2 C++
bindings. It supports point-to-point (sends, receives) and collective
(broadcasts, scatters, gathers) communications of any picklable Python
object, as well as optimized communications of Python object exposing
the single-segment buffer interface (NumPy arrays, builtin
bytes/string/array objects).
On Anselm MPI4Py is available in standard Python modules.
Modules
-------
MPI4Py is build for OpenMPI. Before you start with MPI4Py you need to
load Python and OpenMPI modules. You can use toolchain, that loads
Python and OpenMPI at once.
$ module load Python/2.7.9-foss-2015g
Execution
---------
You need to import MPI to your python program. Include the following
line to the python script:
from mpi4py import MPI
The MPI4Py enabled python programs [execute as any other
OpenMPI](Running_OpenMPI.html) code.The simpliest way is
to run
$ mpiexec python <script>.py
>For example
$ mpiexec python hello_world.py
Examples
--------
### Hello world!
from mpi4py import MPI
comm = MPI.COMM_WORLD
print "Hello! I'm rank %d from %d running in total..." % (comm.rank, comm.size)
comm.Barrier() # wait for everybody to synchronize
### >Collective Communication with NumPy arrays
from __future__ import division
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
print("-"*78)
print(" Running on %d cores" % comm.size)
print("-"*78)
comm.Barrier()
# Prepare a vector of N=5 elements to be broadcasted...
N = 5
if comm.rank == 0:
A = np.arange(N, dtype=np.float64) # rank 0 has proper data
else:
A = np.empty(N, dtype=np.float64) # all other just an empty array
# Broadcast A from rank 0 to everybody
comm.Bcast( [A, MPI.DOUBLE] )
# Everybody should now have the same...
print "[%02d] %s" % (comm.rank, A)
Execute the above code as:
$ qsub -q qexp -l select=4:ncpus=24:mpiprocs=24:ompthreads=1 -I
$ module load Python/2.7.9-foss-2015g
$ mpiexec --map-by core --bind-to core python hello_world.py
In this example, we run MPI4Py enabled code on 4 nodes, 24 cores per
node (total of 96 processes), each python process is bound to a
different core.
More examples and documentation can be found on [MPI for Python
webpage](https://pythonhosted.org/mpi4py/usrman/index.html).
Numerical languages
===================
Interpreted languages for numerical computations and analysis
Introduction
------------
This section contains a collection of high-level interpreted languages,
primarily intended for numerical computations.
Matlab
------
MATLAB®^ is a high-level language and interactive environment for
numerical computation, visualization, and programming.
$ module load MATLAB
$ matlab
Read more at the [Matlab
page](matlab.html).
Octave
------
GNU Octave is a high-level interpreted language, primarily intended for
numerical computations. The Octave language is quite similar to Matlab
so that most programs are easily portable.
$ module load Octave
$ octave
Read more at the [Octave page](octave.html).
R
-
The R is an interpreted language and environment for statistical
computing and graphics.
$ module load R
$ R
Read more at the [R page](r.html).
Matlab
======
Introduction
------------
Matlab is available in versions R2015a and R2015b. There are always two
variants of the release:
- Non commercial or so called EDU variant, which can be used for
common research and educational purposes.
- Commercial or so called COM variant, which can used also for
commercial activities. The licenses for commercial variant are much
more expensive, so usually the commercial variant has only subset of
features compared to the EDU available.
To load the latest version of Matlab load the module
$ module load MATLAB
By default the EDU variant is marked as default. If you need other
version or variant, load the particular version. To obtain the list of
available versions use
$ module avail MATLAB
If you need to use the Matlab GUI to prepare your Matlab programs, you
can use Matlab directly on the login nodes. But for all computations use
Matlab on the compute nodes via PBS Pro scheduler.
If you require the Matlab GUI, please follow the general informations
about [running graphical
applications](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-and-vnc.html).
Matlab GUI is quite slow using the X forwarding built in the PBS (qsub
-X), so using X11 display redirection either via SSH or directly by
xauth (please see the "GUI Applications on Compute Nodes over VNC" part
[here](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-and-vnc.html))
is recommended.
To run Matlab with GUI, use
$ matlab
To run Matlab in text mode, without the Matlab Desktop GUI environment,
use
$ matlab -nodesktop -nosplash
plots, images, etc... will be still available.
Running parallel Matlab using Distributed Computing Toolbox / Engine
------------------------------------------------------------------------
Distributed toolbox is available only for the EDU variant
The MPIEXEC mode available in previous versions is no longer available
in MATLAB 2015. Also, the programming interface has changed. Refer
to [Release
Notes](http://www.mathworks.com/help/distcomp/release-notes.html#buanp9e-1).
Delete previously used file mpiLibConf.m, we have observed crashes when
using Intel MPI.
To use Distributed Computing, you first need to setup a parallel
profile. We have provided the profile for you, you can either import it
in MATLAB command line:
>> parallel.importProfile('/apps/all/MATLAB/2015b-EDU/SalomonPBSPro.settings')
ans =
SalomonPBSPro
Or in the GUI, go to tab HOME -&gt; Parallel -&gt; Manage Cluster
Profiles..., click Import and navigate to :
/apps/all/MATLAB/2015b-EDU/SalomonPBSPro.settings
With the new mode, MATLAB itself launches the workers via PBS, so you
can either use interactive mode or a batch mode on one node, but the
actual parallel processing will be done in a separate job started by
MATLAB itself. Alternatively, you can use "local" mode to run parallel
code on just a single node.
### Parallel Matlab interactive session
Following example shows how to start interactive session with support
for Matlab GUI. For more information about GUI based applications on
Anselm see [this
page](../../../get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface/x-window-system/x-window-and-vnc.html).
$ xhost +
$ qsub -I -v DISPLAY=$(uname -n):$(echo $DISPLAY | cut -d ':' -f 2) -A NONE-0-0 -q qexp -l select=1 -l walltime=00:30:00
-l feature__matlab__MATLAB=1
This qsub command example shows how to run Matlab on a single node.
The second part of the command shows how to request all necessary
licenses. In this case 1 Matlab-EDU license and 48 Distributed Computing
Engines licenses.
Once the access to compute nodes is granted by PBS, user can load
following modules and start Matlab:
r1i0n17$ module load MATLAB/2015a-EDU
r1i0n17$ matlab &
### Parallel Matlab batch job in Local mode
To run matlab in batch mode, write an matlab script, then write a bash
jobscript and execute via the qsub command. By default, matlab will
execute one matlab worker instance per allocated core.
#!/bin/bash
#PBS -A PROJECT ID
#PBS -q qprod
#PBS -l select=1:ncpus=24:mpiprocs=24:ompthreads=1
# change to shared scratch directory
SCR=/scratch/work/user/$USER/$PBS_JOBID
mkdir -p $SCR ; cd $SCR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/matlabcode.m .
# load modules
module load MATLAB/2015a-EDU
# execute the calculation
matlab -nodisplay -r matlabcode > output.out
# copy output file to home
cp output.out $PBS_O_WORKDIR/.
This script may be submitted directly to the PBS workload manager via
the qsub command. The inputs and matlab script are in matlabcode.m
file, outputs in output.out file. Note the missing .m extension in the
matlab -r matlabcodefile call, **the .m must not be included**. Note
that the **shared /scratch must be used**. Further, it is **important to
include quit** statement at the end of the matlabcode.m script.
Submit the jobscript using qsub
$ qsub ./jobscript
### Parallel Matlab Local mode program example
The last part of the configuration is done directly in the user Matlab
script before Distributed Computing Toolbox is started.
cluster = parcluster('local')
This script creates scheduler object "cluster" of type "local" that
starts workers locally.
Please note: Every Matlab script that needs to initialize/use matlabpool
has to contain these three lines prior to calling parpool(sched, ...)
function.
The last step is to start matlabpool with "cluster" object and correct
number of workers. We have 24 cores per node, so we start 24 workers.
parpool(cluster,24);
... parallel code ...
parpool close
The complete example showing how to use Distributed Computing Toolbox in
local mode is shown here.
cluster = parcluster('local');
cluster
parpool(cluster,24);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
T;
whos % T and W are both distributed arrays here.
parpool close
quit
You can copy and paste the example in a .m file and execute. Note that
the parpool size should correspond to **total number of cores**
available on allocated nodes.
### Parallel Matlab Batch job using PBS mode (workers spawned in a separate job)
This mode uses PBS scheduler to launch the parallel pool. It uses the
SalomonPBSPro profile that needs to be imported to Cluster Manager, as
mentioned before. This methodod uses MATLAB's PBS Scheduler interface -
it spawns the workers in a separate job submitted by MATLAB using qsub.
This is an example of m-script using PBS mode:
cluster = parcluster('SalomonPBSPro');
set(cluster, 'SubmitArguments', '-A OPEN-0-0');
set(cluster, 'ResourceTemplate', '-q qprod -l select=10:ncpus=24');
set(cluster, 'NumWorkers', 240);
pool = parpool(cluster,240);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
whos % T and W are both distributed arrays here.
% shut down parallel pool
delete(pool)
Note that we first construct a cluster object using the imported
profile, then set some important options, namely : SubmitArguments,
where you need to specify accounting id, and ResourceTemplate, where you
need to specify number of nodes to run the job.
You can start this script using batch mode the same way as in Local mode
example.
### Parallel Matlab Batch with direct launch (workers spawned within the existing job)
This method is a "hack" invented by us to emulate the mpiexec
functionality found in previous MATLAB versions. We leverage the MATLAB
Generic Scheduler interface, but instead of submitting the workers to
PBS, we launch the workers directly within the running job, thus we
avoid the issues with master script and workers running in separate jobs
(issues with license not available, waiting for the worker's job to
spawn etc.)
Please note that this method is experimental.
For this method, you need to use SalomonDirect profile, import it
using [the same way as
SalomonPBSPro](matlab.html#running-parallel-matlab-using-distributed-computing-toolbox---engine)
This is an example of m-script using direct mode:
parallel.importProfile('/apps/all/MATLAB/2015b-EDU/SalomonDirect.settings')
cluster = parcluster('SalomonDirect');
set(cluster, 'NumWorkers', 48);
pool = parpool(cluster, 48);
n=2000;
W = rand(n,n);
W = distributed(W);
x = (1:n)';
x = distributed(x);
spmd
[~, name] = system('hostname')
T = W*x; % Calculation performed on labs, in parallel.
% T and W are both codistributed arrays here.
end
whos % T and W are both distributed arrays here.
% shut down parallel pool
delete(pool)
### Non-interactive Session and Licenses
If you want to run batch jobs with Matlab, be sure to request
appropriate license features with the PBS Pro scheduler, at least the "
-l __feature__matlab__MATLAB=1" for EDU variant of Matlab. More
information about how to check the license features states and how to
request them with PBS Pro, please [look
here](../../../anselm-cluster-documentation/software/isv_licenses.html).
The licensing feature of PBS is currently disabled.
In case of non-interactive session please read the [following
information](../../../anselm-cluster-documentation/software/isv_licenses.html)
on how to modify the qsub command to test for available licenses prior
getting the resource allocation.
### Matlab Distributed Computing Engines start up time
Starting Matlab workers is an expensive process that requires certain
amount of time. For your information please see the following table:
compute nodes number of workers start-up time[s]
--------------- ------------------- --------------------
16 384 831
8 192 807
4 96 483
2 48 16
MATLAB on UV2000
-----------------
UV2000 machine available in queue "qfat" can be used for MATLAB
computations. This is a SMP NUMA machine with large amount of RAM, which
can be beneficial for certain types of MATLAB jobs. CPU cores are
allocated in chunks of 8 for this machine.
You can use MATLAB on UV2000 in two parallel modes :
### Threaded mode
Since this is a SMP machine, you can completely avoid using Parallel
Toolbox and use only MATLAB's threading. MATLAB will automatically
detect the number of cores you have allocated and will set
maxNumCompThreads accordingly and certain
operations, such as fft, , eig, svd,
etc. will be automatically run in threads. The advantage of this mode is
that you don't need to modify your existing sequential codes.
### Local cluster mode
You can also use Parallel Toolbox on UV2000. Use l[ocal cluster
mode](matlab.html#parallel-matlab-batch-job-in-local-mode),
"SalomonPBSPro" profile will not work.
Octave
======
GNU Octave is a high-level interpreted language, primarily intended for
numerical computations. It provides capabilities for the numerical
solution of linear and nonlinear problems, and for performing other
numerical experiments. It also provides extensive graphics capabilities
for data visualization and manipulation. Octave is normally used through
its interactive command line interface, but it can also be used to write
non-interactive programs. The Octave language is quite similar to Matlab
so that most programs are easily portable. Read more on
<http://www.gnu.org/software/octave/>****
Two versions of octave are available on the cluster, via module
Status Version module
------------ -------------- --------
Stable** Octave 3.8.2 Octave
$ module load Octave
The octave on the cluster is linked to highly optimized MKL mathematical
library. This provides threaded parallelization to many octave kernels,
notably the linear algebra subroutines. Octave runs these heavy
calculation kernels without any penalty. By default, octave would
parallelize to 24 threads. You may control the threads by setting the
OMP_NUM_THREADS environment variable.
To run octave interactively, log in with ssh -X parameter for X11
forwarding. Run octave:
$ octave
To run octave in batch mode, write an octave script, then write a bash
jobscript and execute via the qsub command. By default, octave will use
16 threads when running MKL kernels.
#!/bin/bash
# change to local scratch directory
mkdir -p /scratch/work/user/$USER/$PBS_JOBID
cd /scratch/work/user/$USER/$PBS_JOBID || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/octcode.m .
# load octave module
module load Octave
# execute the calculation
octave -q --eval octcode > output.out
# copy output file to home
cp output.out $PBS_O_WORKDIR/.
#exit
exit
This script may be submitted directly to the PBS workload manager via
the qsub command. The inputs are in octcode.m file, outputs in
output.out file. See the single node jobscript example in the [Job
execution
section](../../resource-allocation-and-job-execution.html).
The octave c compiler mkoctfile calls the GNU gcc 4.8.1 for compiling
native c code. This is very useful for running native c subroutines in
octave environment.
$ mkoctfile -v
Octave may use MPI for interprocess communication
This functionality is currently not supported on the cluster cluster. In
case you require the octave interface to MPI, please contact our
[cluster support](https://support.it4i.cz/rt/).
R
=
Introduction
------------
The R is a language and environment for statistical computing and
graphics. R provides a wide variety of statistical (linear and
nonlinear modelling, classical statistical tests, time-series analysis,
classification, clustering, ...) and graphical techniques, and is highly
extensible.
One of R's strengths is the ease with which well-designed
publication-quality plots can be produced, including mathematical
symbols and formulae where needed. Great care has been taken over the
defaults for the minor design choices in graphics, but the user retains
full control.
Another convenience is the ease with which the C code or third party
libraries may be integrated within R.
Extensive support for parallel computing is available within R.
Read more on <http://www.r-project.org/>,
<http://cran.r-project.org/doc/manuals/r-release/R-lang.html>
Modules
-------
**The R version 3.1.1 is available on the cluster, along with GUI
interface Rstudio
Application Version module
------------- -------------- ---------------------
R** R 3.1.1 R/3.1.1-intel-2015b
Rstudio** Rstudio 0.97 Rstudio
$ module load R
Execution
---------
The R on Anselm is linked to highly optimized MKL mathematical
library. This provides threaded parallelization to many R kernels,
notably the linear algebra subroutines. The R runs these heavy
calculation kernels without any penalty. By default, the R would
parallelize to 24 threads. You may control the threads by setting the
OMP_NUM_THREADS environment variable.
### Interactive execution
To run R interactively, using Rstudio GUI, log in with ssh -X parameter
for X11 forwarding. Run rstudio:
$ module load Rstudio
$ rstudio
### Batch execution
To run R in batch mode, write an R script, then write a bash jobscript
and execute via the qsub command. By default, R will use 24 threads when
running MKL kernels.
Example jobscript:
#!/bin/bash
# change to local scratch directory
cd /lscratch/$PBS_JOBID || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/rscript.R .
# load R module
module load R
# execute the calculation
R CMD BATCH rscript.R routput.out
# copy output file to home
cp routput.out $PBS_O_WORKDIR/.
#exit
exit
This script may be submitted directly to the PBS workload manager via
the qsub command. The inputs are in rscript.R file, outputs in
routput.out file. See the single node jobscript example in the [Job
execution
section](../../resource-allocation-and-job-execution/job-submission-and-execution.html).
Parallel R
----------
Parallel execution of R may be achieved in many ways. One approach is
the implied parallelization due to linked libraries or specially enabled
functions, as [described
above](r.html#interactive-execution). In the following
sections, we focus on explicit parallelization, where parallel
constructs are directly stated within the R script.
Package parallel
--------------------
The package parallel provides support for parallel computation,
including by forking (taken from package multicore), by sockets (taken
from package snow) and random-number generation.
The package is activated this way:
$ R
> library(parallel)
More information and examples may be obtained directly by reading the
documentation available in R
> ?parallel
> library(help = "parallel")
> vignette("parallel")
Download the package
[parallell](package-parallel-vignette) vignette.
The forking is the most simple to use. Forking family of functions
provide parallelized, drop in replacement for the serial apply() family
of functions.
Forking via package parallel provides functionality similar to OpenMP
construct
#omp parallel for
Only cores of single node can be utilized this way!
Forking example:
library(parallel)
#integrand function
f <- function(i,h) {
x <- h*(i-0.5)
return (4/(1 + x*x))
}
#initialize
size <- detectCores()
while (TRUE)
{
#read number of intervals
cat("Enter the number of intervals: (0 quits) ")
fp<-file("stdin"); n<-scan(fp,nmax=1); close(fp)
if(n<=0) break
#run the calculation
n <- max(n,size)
h <- 1.0/n
i <- seq(1,n);
pi3 <- h*sum(simplify2array(mclapply(i,f,h,mc.cores=size)));
#print results
cat(sprintf("Value of PI %16.14f, diff= %16.14fn",pi3,pi3-pi))
}
The above example is the classic parallel example for calculating the
number π. Note the **detectCores()** and **mclapply()** functions.
Execute the example as:
$ R --slave --no-save --no-restore -f pi3p.R
Every evaluation of the integrad function runs in parallel on different
process.
Package Rmpi
------------
package Rmpi provides an interface (wrapper) to MPI APIs.
It also provides interactive R slave environment. On the cluster, Rmpi
provides interface to the
[OpenMPI](../mpi-1/Running_OpenMPI.html).
Read more on Rmpi at <http://cran.r-project.org/web/packages/Rmpi/>,
reference manual is available at
<http://cran.r-project.org/web/packages/Rmpi/Rmpi.pdf>
When using package Rmpi, both openmpi and R modules must be loaded
$ module load OpenMPI
$ module load R
Rmpi may be used in three basic ways. The static approach is identical
to executing any other MPI programm. In addition, there is Rslaves
dynamic MPI approach and the mpi.apply approach. In the following
section, we will use the number π integration example, to illustrate all
these concepts.
### static Rmpi
Static Rmpi programs are executed via mpiexec, as any other MPI
programs. Number of processes is static - given at the launch time.
Static Rmpi example:
library(Rmpi)
#integrand function
f <- function(i,h) {
x <- h*(i-0.5)
return (4/(1 + x*x))
}
#initialize
invisible(mpi.comm.dup(0,1))
rank <- mpi.comm.rank()
size <- mpi.comm.size()
n<-0
while (TRUE)
{
#read number of intervals
if (rank==0) {
cat("Enter the number of intervals: (0 quits) ")
fp<-file("stdin"); n<-scan(fp,nmax=1); close(fp)
}
#broadcat the intervals
n <- mpi.bcast(as.integer(n),type=1)
if(n<=0) break
#run the calculation
n <- max(n,size)
h <- 1.0/n
i <- seq(rank+1,n,size);
mypi <- h*sum(sapply(i,f,h));
pi3 <- mpi.reduce(mypi)
#print results
if (rank==0) cat(sprintf("Value of PI %16.14f, diff= %16.14fn",pi3,pi3-pi))
}
mpi.quit()
The above is the static MPI example for calculating the number π. Note
the **library(Rmpi)** and **mpi.comm.dup()** function calls.
Execute the example as:
$ mpirun R --slave --no-save --no-restore -f pi3.R
### dynamic Rmpi
Dynamic Rmpi programs are executed by calling the R directly. OpenMPI
module must be still loaded. The R slave processes will be spawned by a
function call within the Rmpi program.
Dynamic Rmpi example:
#integrand function
f <- function(i,h) {
x <- h*(i-0.5)
return (4/(1 + x*x))
}
#the worker function
workerpi <- function()
{
#initialize
rank <- mpi.comm.rank()
size <- mpi.comm.size()
n<-0
while (TRUE)
{
#read number of intervals
if (rank==0) {
cat("Enter the number of intervals: (0 quits) ")
fp<-file("stdin"); n<-scan(fp,nmax=1); close(fp)
}
#broadcat the intervals
n <- mpi.bcast(as.integer(n),type=1)
if(n<=0) break
#run the calculation
n <- max(n,size)
h <- 1.0/n
i <- seq(rank+1,n,size);
mypi <- h*sum(sapply(i,f,h));
pi3 <- mpi.reduce(mypi)
#print results
if (rank==0) cat(sprintf("Value of PI %16.14f, diff= %16.14fn",pi3,pi3-pi))
}
}
#main
library(Rmpi)
cat("Enter the number of slaves: ")
fp<-file("stdin"); ns<-scan(fp,nmax=1); close(fp)
mpi.spawn.Rslaves(nslaves=ns)
mpi.bcast.Robj2slave(f)
mpi.bcast.Robj2slave(workerpi)
mpi.bcast.cmd(workerpi())
workerpi()
mpi.quit()
The above example is the dynamic MPI example for calculating the number
π. Both master and slave processes carry out the calculation. Note the
mpi.spawn.Rslaves(), mpi.bcast.Robj2slave()** and the
mpi.bcast.cmd()** function calls.
Execute the example as:
$ mpirun -np 1 R --slave --no-save --no-restore -f pi3Rslaves.R
Note that this method uses MPI_Comm_spawn (Dynamic process feature of
MPI-2) to start the slave processes - the master process needs to be
launched with MPI. In general, Dynamic processes are not well supported
among MPI implementations, some issues might arise. Also, environment
variables are not propagated to spawned processes, so they will not see
paths from modules.
### mpi.apply Rmpi
mpi.apply is a specific way of executing Dynamic Rmpi programs.
mpi.apply() family of functions provide MPI parallelized, drop in
replacement for the serial apply() family of functions.
Execution is identical to other dynamic Rmpi programs.
mpi.apply Rmpi example:
#integrand function
f <- function(i,h) {
x <- h*(i-0.5)
return (4/(1 + x*x))
}
#the worker function
workerpi <- function(rank,size,n)
{
#run the calculation
n <- max(n,size)
h <- 1.0/n
i <- seq(rank,n,size);
mypi <- h*sum(sapply(i,f,h));
return(mypi)
}
#main
library(Rmpi)
cat("Enter the number of slaves: ")
fp<-file("stdin"); ns<-scan(fp,nmax=1); close(fp)
mpi.spawn.Rslaves(nslaves=ns)
mpi.bcast.Robj2slave(f)
mpi.bcast.Robj2slave(workerpi)
while (TRUE)
{
#read number of intervals
cat("Enter the number of intervals: (0 quits) ")
fp<-file("stdin"); n<-scan(fp,nmax=1); close(fp)
if(n<=0) break
#run workerpi
i=seq(1,2*ns)
pi3=sum(mpi.parSapply(i,workerpi,2*ns,n))
#print results
cat(sprintf("Value of PI %16.14f, diff= %16.14fn",pi3,pi3-pi))
}
mpi.quit()
The above is the mpi.apply MPI example for calculating the number π.
Only the slave processes carry out the calculation. Note the
mpi.parSapply(), ** function call. The package
parallel
[example](r.html#package-parallel)[above](r.html#package-parallel){.anchor
may be trivially adapted (for much better performance) to this structure
using the mclapply() in place of mpi.parSapply().
Execute the example as:
$ mpirun -np 1 R --slave --no-save --no-restore -f pi3parSapply.R
Combining parallel and Rmpi
---------------------------
Currently, the two packages can not be combined for hybrid calculations.
Parallel execution
------------------
The R parallel jobs are executed via the PBS queue system exactly as any
other parallel jobs. User must create an appropriate jobscript and
submit via the **qsub**
Example jobscript for [static Rmpi](r.html#static-rmpi)
parallel R execution, running 1 process per core:
#!/bin/bash
#PBS -q qprod
#PBS -N Rjob
#PBS -l select=100:ncpus=24:mpiprocs=24:ompthreads=1
# change to scratch directory
SCRDIR=/scratch/work/user/$USER/myjob
cd $SCRDIR || exit
# copy input file to scratch
cp $PBS_O_WORKDIR/rscript.R .
# load R and openmpi module
module load R
module load OpenMPI
# execute the calculation
mpiexec -bycore -bind-to-core R --slave --no-save --no-restore -f rscript.R
# copy output file to home
cp routput.out $PBS_O_WORKDIR/.
#exit
exit
For more information about jobscripts and MPI execution refer to the
[Job
submission](../../resource-allocation-and-job-execution/job-submission-and-execution.html)
and general [MPI](../mpi-1.html) sections.
Xeon Phi Offload
----------------
By leveraging MKL, R can accelerate certain computations, most notably
linear algebra operations on the Xeon Phi accelerator by using Automated
Offload. To use MKL Automated Offload, you need to first set this
environment variable before R execution :
$ export MKL_MIC_ENABLE=1
[Read more about automatic
offload](../intel-xeon-phi.html)
Operating System
================
The operating system, deployed on Salomon cluster
The operating system on Salomon is Linux - CentOS 6.6.
>The CentOS Linux distribution is a stable, predictable, manageable
and reproducible platform derived from the sources of Red Hat Enterprise
Linux (RHEL).
CESNET Data Storage
===================
Introduction
------------
Do not use shared filesystems at IT4Innovations as a backup for large
amount of data or long-term archiving purposes.
The IT4Innovations does not provide storage capacity for data archiving.
Academic staff and students of research institutions in the Czech
Republic can use [CESNET Storage
service](https://du.cesnet.cz/).
The CESNET Storage service can be used for research purposes, mainly by
academic staff and students of research institutions in the Czech
Republic.
User of data storage CESNET (DU) association can become organizations or
an individual person who is either in the current employment
relationship (employees) or the current study relationship (students) to
a legal entity (organization) that meets the “Principles for access to
CESNET Large infrastructure (Access Policy)”.
User may only use data storage CESNET for data transfer and storage
which are associated with activities in science, research, development,
the spread of education, culture and prosperity. In detail see
“Acceptable Use Policy CESNET Large Infrastructure (Acceptable Use
Policy, AUP)”.
The service is documented at
<https://du.cesnet.cz/wiki/doku.php/en/start>. For special requirements
please contact directly CESNET Storage Department via e-mail
[du-support(at)cesnet.cz](mailto:du-support@cesnet.cz).
The procedure to obtain the CESNET access is quick and trouble-free.
(source
[https://du.cesnet.cz/](https://du.cesnet.cz/wiki/doku.php/en/start "CESNET Data Storage"))
CESNET storage access
---------------------
### Understanding Cesnet storage
It is very important to understand the Cesnet storage before uploading
data. Please read
<https://du.cesnet.cz/en/navody/home-migrace-plzen/start> first.
Once registered for CESNET Storage, you may [access the
storage](https://du.cesnet.cz/en/navody/faq/start) in
number of ways. We recommend the SSHFS and RSYNC methods.
### SSHFS Access
SSHFS: The storage will be mounted like a local hard drive
The SSHFS provides a very convenient way to access the CESNET Storage.
The storage will be mounted onto a local directory, exposing the vast
CESNET Storage as if it was a local removable harddrive. Files can be
than copied in and out in a usual fashion.
First, create the mountpoint
$ mkdir cesnet
Mount the storage. Note that you can choose among the ssh.du1.cesnet.cz
(Plzen), ssh.du2.cesnet.cz (Jihlava), ssh.du3.cesnet.cz (Brno)
Mount tier1_home **(only 5120M !)**:
$ sshfs username@ssh.du1.cesnet.cz:. cesnet/
For easy future access from Anselm, install your public key
$ cp .ssh/id_rsa.pub cesnet/.ssh/authorized_keys
Mount tier1_cache_tape for the Storage VO:
$ sshfs username@ssh.du1.cesnet.cz:/cache_tape/VO_storage/home/username cesnet/
View the archive, copy the files and directories in and out
$ ls cesnet/
$ cp -a mydir cesnet/.
$ cp cesnet/myfile .
Once done, please remember to unmount the storage
$ fusermount -u cesnet
### Rsync access
Rsync provides delta transfer for best performance, can resume
interrupted transfers
Rsync is a fast and extraordinarily versatile file copying tool. It is
famous for its delta-transfer algorithm, which reduces the amount of
data sent over the network by sending only the differences between the
source files and the existing files in the destination. Rsync is widely
used for backups and mirroring and as an improved copy command for
everyday use.
Rsync finds files that need to be transferred using a "quick check"
algorithm (by default) that looks for files that have changed in size or
in last-modified time. Any changes in the other preserved attributes
(as requested by options) are made on the destination file directly when
the quick check indicates that the file's data does not need to be
updated.
More about Rsync at
<https://du.cesnet.cz/en/navody/rsync/start#pro_bezne_uzivatele>
Transfer large files to/from Cesnet storage, assuming membership in the
Storage VO
$ rsync --progress datafile username@ssh.du1.cesnet.cz:VO_storage-cache_tape/.
$ rsync --progress username@ssh.du1.cesnet.cz:VO_storage-cache_tape/datafile .
Transfer large directories to/from Cesnet storage, assuming membership
in the Storage VO
$ rsync --progress -av datafolder username@ssh.du1.cesnet.cz:VO_storage-cache_tape/.
$ rsync --progress -av username@ssh.du1.cesnet.cz:VO_storage-cache_tape/datafolder .
Transfer rates of about 28MB/s can be expected.
Storage
=======
Introduction
------------
There are two main shared file systems on Salomon cluster, the [
HOME](storage.html#home)
and [
SCRATCH](storage.html#shared-filesystems).
All login and compute nodes may access same data on shared filesystems.
Compute nodes are also equipped with local (non-shared) scratch, ramdisk
and tmp filesystems.
Policy (in a nutshell)
----------------------
Use [HOME](storage.html#home) for your most valuable data
and programs.
Use [WORK](storage.html#work) for your large project
files
Use [TEMP](storage.html#temp) for large scratch data.
Do not use for [archiving](storage.html#archiving)!
Archiving
-------------
Please don't use shared filesystems as a backup for large amount of data
or long-term archiving mean. The academic staff and students of research
institutions in the Czech Republic can use [CESNET storage
service](../../anselm-cluster-documentation/storage-1/cesnet-data-storage.html),
which is available via SSHFS.
Shared Filesystems
----------------------
Salomon computer provides two main shared filesystems, the [
HOME
filesystem](storage.html#home-filesystem) and the
[SCRATCH filesystem](storage.html#scratch-filesystem). The
SCRATCH filesystem is partitioned to [WORK and TEMP
workspaces](storage.html#shared-workspaces). The HOME
filesystem is realized as a tiered NFS disk storage. The SCRATCH
filesystem is realized as a parallel Lustre filesystem. Both shared file
systems are accessible via the Infiniband network. Extended ACLs are
provided on both HOME/SCRATCH filesystems for the purpose of sharing
data with other users using fine-grained control.
###HOME filesystem
The HOME filesystem is realized as a Tiered filesystem, exported via
NFS. The first tier has capacity 100TB, second tier has capacity 400TB.
The filesystem is available on all login and computational nodes. The
Home filesystem hosts the [HOME
workspace](storage.html#home).
###SCRATCH filesystem
The architecture of Lustre on Salomon is composed of two metadata
servers (MDS) and six data/object storage servers (OSS). Accessible
capacity is 1.69 PB, shared among all users. The SCRATCH filesystem
hosts the [WORK and TEMP
workspaces](storage.html#shared-workspaces).
class="listitem">Configuration of the SCRATCH Lustre storage
class="emphasis">
- SCRATCH Lustre object storage
- Disk array SFA12KX
- 540 4TB SAS 7.2krpm disks
- 54 OSTs of 10 disks in RAID6 (8+2)
- 15 hot-spare disks
- 4x 400GB SSD cache
- SCRATCH Lustre metadata storage
- Disk array EF3015
- 12 600GB SAS 15krpm disks
### Understanding the Lustre Filesystems
(source <http://www.nas.nasa.gov>)
A user file on the Lustre filesystem can be divided into multiple chunks
(stripes) and stored across a subset of the object storage targets
(OSTs) (disks). The stripes are distributed among the OSTs in a
round-robin fashion to ensure load balancing.
When a client (a compute
node from your job) needs to create
or access a file, the client queries the metadata server (
MDS) and the metadata target (
MDT) for the layout and location of the
[file's
stripes](http://www.nas.nasa.gov/hecc/support/kb/Lustre_Basics_224.html#striping).
Once the file is opened and the client obtains the striping information,
the MDS is no longer involved in the
file I/O process. The client interacts directly with the object storage
servers (OSSes) and OSTs to perform I/O operations such as locking, disk
allocation, storage, and retrieval.
If multiple clients try to read and write the same part of a file at the
same time, the Lustre distributed lock manager enforces coherency so
that all clients see consistent results.
There is default stripe configuration for Salomon Lustre filesystems.
However, users can set the following stripe parameters for their own
directories or files to get optimum I/O performance:
1.stripe_size: the size of the chunk in bytes; specify with k, m, or
g to use units of KB, MB, or GB, respectively; the size must be an
even multiple of 65,536 bytes; default is 1MB for all Salomon Lustre
filesystems
2.stripe_count the number of OSTs to stripe across; default is 1 for
Salomon Lustre filesystems one can specify -1 to use all OSTs in
the filesystem.
3.stripe_offset The index of the
OST where the first stripe is to be
placed; default is -1 which results in random selection; using a
non-default value is NOT recommended.
Setting stripe size and stripe count correctly for your needs may
significantly impact the I/O performance you experience.
Use the lfs getstripe for getting the stripe parameters. Use the lfs
setstripe command for setting the stripe parameters to get optimal I/O
performance The correct stripe setting depends on your needs and file
access patterns.
```
$ lfs getstripe dir|filename
$ lfs setstripe -s stripe_size -c stripe_count -o stripe_offset dir|filename
```
Example:
```
$ lfs getstripe /scratch/work/user/username
/scratch/work/user/username
stripe_count: 1 stripe_size: 1048576 stripe_offset: -1
$ lfs setstripe -c -1 /scratch/work/user/username/
$ lfs getstripe /scratch/work/user/username/
/scratch/work/user/username/
stripe_count:-1 stripe_size: 1048576 stripe_offset: -1
```
In this example, we view current stripe setting of the
/scratch/username/ directory. The stripe count is changed to all OSTs,
and verified. All files written to this directory will be striped over
all (54) OSTs
Use lfs check OSTs to see the number and status of active OSTs for each
filesystem on Salomon. Learn more by reading the man page
```
$ lfs check osts
$ man lfs
```
### Hints on Lustre Stripping
Increase the stripe_count for parallel I/O to the same file.
When multiple processes are writing blocks of data to the same file in
parallel, the I/O performance for large files will improve when the
stripe_count is set to a larger value. The stripe count sets the number
of OSTs the file will be written to. By default, the stripe count is set
to 1. While this default setting provides for efficient access of
metadata (for example to support the ls -l command), large files should
use stripe counts of greater than 1. This will increase the aggregate
I/O bandwidth by using multiple OSTs in parallel instead of just one. A
rule of thumb is to use a stripe count approximately equal to the number
of gigabytes in the file.
Another good practice is to make the stripe count be an integral factor
of the number of processes performing the write in parallel, so that you
achieve load balance among the OSTs. For example, set the stripe count
to 16 instead of 15 when you have 64 processes performing the writes.
Using a large stripe size can improve performance when accessing very
large files
Large stripe size allows each client to have exclusive access to its own
part of a file. However, it can be counterproductive in some cases if it
does not match your I/O pattern. The choice of stripe size has no effect
on a single-stripe file.
Read more on
<http://wiki.lustre.org/manual/LustreManual20_HTML/ManagingStripingFreeSpace.html>
>Disk usage and quota commands
------------------------------------------
>User quotas on the Lustre file systems (SCRATCH) can be checked
and reviewed using following command:
```
$ lfs quota dir
```
Example for Lustre SCRATCH directory:
```
$ lfs quota /scratch
Disk quotas for user user001 (uid 1234):
Filesystem kbytes quota limit grace files quota limit grace
/scratch 8 0 100000000000 - 3 0 0 -
Disk quotas for group user001 (gid 1234):
Filesystem kbytes quota limit grace files quota limit grace
/scratch 8 0 0 - 3 0 0 -
```
In this example, we view current quota size limit of 100TB and 8KB
currently used by user001.
HOME directory is mounted via NFS, so a different command must be used
to obtain quota information:
$ quota
Example output:
$ quota
Disk quotas for user vop999 (uid 1025):
Filesystem blocks quota limit grace files quota limit grace
home-nfs-ib.salomon.it4i.cz:/home
28 0 250000000 10 0 500000
To have a better understanding of where the space is exactly used, you
can use following command to find out.
```
$ du -hs dir
```
Example for your HOME directory:
```
$ cd /home
$ du -hs * .[a-zA-z0-9]* | grep -E "[0-9]*G|[0-9]*M" | sort -hr
258M cuda-samples
15M .cache
13M .mozilla
5,5M .eclipse
2,7M .idb_13.0_linux_intel64_app
```
This will list all directories which are having MegaBytes or GigaBytes
of consumed space in your actual (in this example HOME) directory. List
is sorted in descending order from largest to smallest
files/directories.
>To have a better understanding of previous commands, you can read
manpages.
```
$ man lfs
```
```
$ man du
```
Extended Access Control List (ACL)
----------------------------------
Extended ACLs provide another security mechanism beside the standard
POSIX ACLs which are defined by three entries (for
owner/group/others). Extended ACLs have more than the three basic
entries. In addition, they also contain a mask entry and may contain any
number of named user and named group entries.
ACLs on a Lustre file system work exactly like ACLs on any Linux file
system. They are manipulated with the standard tools in the standard
manner. Below, we create a directory and allow a specific user access.
```
[vop999@login1.salomon ~]$ umask 027
[vop999@login1.salomon ~]$ mkdir test
[vop999@login1.salomon ~]$ ls -ld test
drwxr-x--- 2 vop999 vop999 4096 Nov 5 14:17 test
[vop999@login1.salomon ~]$ getfacl test
# file: test
# owner: vop999
# group: vop999
user::rwx
group::r-x
other::---
[vop999@login1.salomon ~]$ setfacl -m user:johnsm:rwx test
[vop999@login1.salomon ~]$ ls -ld test
drwxrwx---+ 2 vop999 vop999 4096 Nov 5 14:17 test
[vop999@login1.salomon ~]$ getfacl test
# file: test
# owner: vop999
# group: vop999
user::rwx
user:johnsm:rwx
group::r-x
mask::rwx
other::---
```
Default ACL mechanism can be used to replace setuid/setgid permissions
on directories. Setting a default ACL on a directory (-d flag to
setfacl) will cause the ACL permissions to be inherited by any newly
created file or subdirectory within the directory. Refer to this page
for more information on Linux ACL:
[http://www.vanemery.com/Linux/ACL/POSIX_ACL_on_Linux.html ](http://www.vanemery.com/Linux/ACL/POSIX_ACL_on_Linux.html)
Shared Workspaces
---------------------
###HOME
Users home directories /home/username reside on HOME filesystem.
Accessible capacity is 0.5PB, shared among all users. Individual users
are restricted by filesystem usage quotas, set to 250GB per user.
>If 250GB should prove as insufficient for particular user, please
contact [support](https://support.it4i.cz/rt),
the quota may be lifted upon request.
The HOME filesystem is intended for preparation, evaluation, processing
and storage of data generated by active Projects.
The HOME should not be used to archive data of past Projects or other
unrelated data.
The files on HOME will not be deleted until end of the [users
lifecycle](../../get-started-with-it4innovations/obtaining-login-credentials/obtaining-login-credentials.html).
The workspace is backed up, such that it can be restored in case of
catasthropic failure resulting in significant data loss. This backup
however is not intended to restore old versions of user data or to
restore (accidentaly) deleted files.
HOME workspace
Accesspoint
/home/username
Capacity
0.5PB
Throughput
6GB/s
User quota
250GB
Protocol
NFS, 2-Tier
### WORK
The WORK workspace resides on SCRATCH filesystem. Users may create
subdirectories and files in directories **/scratch/work/user/username**
and **/scratch/work/project/projectid. **The /scratch/work/user/username
is private to user, much like the home directory. The
/scratch/work/project/projectid is accessible to all users involved in
project projectid. >
The WORK workspace is intended to store users project data as well as
for high performance access to input and output files. All project data
should be removed once the project is finished. The data on the WORK
workspace are not backed up.
Files on the WORK filesystem are **persistent** (not automatically
deleted) throughout duration of the project.
The WORK workspace is hosted on SCRATCH filesystem. The SCRATCH is
realized as Lustre parallel filesystem and is available from all login
and computational nodes. Default stripe size is 1MB, stripe count is 1.
There are 54 OSTs dedicated for the SCRATCH filesystem.
Setting stripe size and stripe count correctly for your needs may
significantly impact the I/O performance you experience.
WORK workspace
Accesspoints
/scratch/work/user/username
/scratch/work/user/projectid
Capacity
1.6P
Throughput
30GB/s
User quota
100TB
Default stripe size
1MB
Default stripe count
1
Number of OSTs
54
Protocol
Lustre
### TEMP
The TEMP workspace resides on SCRATCH filesystem. The TEMP workspace
accesspoint is /scratch/temp. Users may freely create subdirectories
and files on the workspace. Accessible capacity is 1.6P, shared among
all users on TEMP and WORK. Individual users are restricted by
filesystem usage quotas, set to 100TB per user. The purpose of this
quota is to prevent runaway programs from filling the entire filesystem
and deny service to other users. >If 100TB should prove as
insufficient for particular user, please contact
[support](https://support.it4i.cz/rt), the quota may be
lifted upon request.
The TEMP workspace is intended for temporary scratch data generated
during the calculation as well as for high performance access to input
and output files. All I/O intensive jobs must use the TEMP workspace as
their working directory.
Users are advised to save the necessary data from the TEMP workspace to
HOME or WORK after the calculations and clean up the scratch files.
Files on the TEMP filesystem that are **not accessed for more than 90
days** will be automatically **deleted**.
The TEMP workspace is hosted on SCRATCH filesystem. The SCRATCH is
realized as Lustre parallel filesystem and is available from all login
and computational nodes. Default stripe size is 1MB, stripe count is 1.
There are 54 OSTs dedicated for the SCRATCH filesystem.
Setting stripe size and stripe count correctly for your needs may
significantly impact the I/O performance you experience.
TEMP workspace
Accesspoint
/scratch/temp
Capacity
1.6P
Throughput
30GB/s
User quota
100TB
Default stripe size
1MB
Default stripe count
1
Number of OSTs
54
Protocol
Lustre
RAM disk
--------
Every computational node is equipped with filesystem realized in memory,
so called RAM disk.
Use RAM disk in case you need really fast access to your data of limited
size during your calculation.
Be very careful, use of RAM disk filesystem is at the expense of
operational memory.
The local RAM disk is mounted as /ramdisk and is accessible to user
at /ramdisk/$PBS_JOBID directory.
The local RAM disk filesystem is intended for temporary scratch data
generated during the calculation as well as for high performance access
to input and output files. Size of RAM disk filesystem is limited. Be
very careful, use of RAM disk filesystem is at the expense of
operational memory. It is not recommended to allocate large amount of
memory and use large amount of data in RAM disk filesystem at the same
time.
The local RAM disk directory /ramdisk/$PBS_JOBID will be deleted
immediately after the calculation end. Users should take care to save
the output data from within the jobscript.
RAM disk
Mountpoint
/ramdisk
Accesspoint
/ramdisk/$PBS_JOBID
Capacity
120 GB
Throughput
over 1.5 GB/s write, over 5 GB/s read, single thread
over 10 GB/s write, over 50 GB/s read, 16 threads
User quota
none
Summary
----------
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Mountpoint Usage Protocol Net Capacity Throughput Limitations Access Services
---------------------------------------------- -------------------------------- ------------- -------------- ------------ ------------- ------------------------- -----------------------------
/home home directory NFS, 2-Tier 0.5 PB 6 GB/s Quota 250GB Compute and login nodes backed up
/scratch/work large project files Lustre 1.69 PB 30 GB/s Quota Compute and login nodes none
1TB
/scratch/temp job temporary data Lustre 1.69 PB 30 GB/s Quota 100TB Compute and login nodes files older 90 days removed
/ramdisk job temporary data, node local local 120GB 90 GB/s none Compute nodes purged after job ends
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
### DOWNLOAD AND CONVERT DOCUMENTATION
# autor: kru0052
# version: 0.4
# change: repair bugs and optimalizations
# bugs: bad formatting tables, bad links for other files, stayed a few html elements, formatting bugs...
###
if [ "$1" = "-t" ]; then
# testing new function
fi
if [ "$1" = "-w" ]; then
# download html pages
wget -X pbspro-documentation,changelog,whats-new,portal_css,portal_javascripts,++resource++jquery-ui-themes,anselm-cluster-documentation/icon.jpg -R favicon.ico,pdf.png,logo.png,background.png,application.png,search_icon.png,png.png,sh.png,touch_icon.png,anselm-cluster-documentation/icon.jpg,*js,robots.txt,*xml,RSS,download_icon.png,pdf,*zip,*rar,@@*,anselm-cluster-documentation/icon.jpg.1 --mirror --convert-links --adjust-extension --page-requisites --no-parent https://docs.it4i.cz;
wget --directory-prefix=./docs.it4i.cz/ http://verif.cs.vsb.cz/aislinn/doc/report.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/virtualization/virtualization-job-workflow
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig1.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig2.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig3.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig4.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig5.png
wget --directory-prefix=./docs.it4i.cz/ https://docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1/images/fig6.png
fi
if [ "$1" = "-c" ]; then
### convert html to md
# erasing the previous transfer
rm -rf converted;
rm -rf info;
# erasing duplicate files and unwanted files
(while read i;
do
if [ -f "$i" ];
then
echo "$(tput setaf 9)$i deleted";
rm "$i";
fi
done) < ./source/list_rm
counter=1
count=$(find . -name "*.html" -type f | wc -l)
find . -name "*.ht*" |
while read i;
do
# first filtering html
echo "$(tput setaf 12)($counter/$count)$(tput setaf 11)$i";
counter=$((counter+1))
printf "$(tput setaf 15)\t\tfirst filtering html files...\n";
HEAD=$(grep -n -m1 '<h1' "$i" |cut -f1 -d: | tr --delete '\n')
END=$(grep -n -m1 '<!-- <div tal:content=' "$i" |cut -f1 -d: | tr --delete '\n')
LAST=$(wc -l "$i" | cut -f1 -d' ')
DOWN=$((LAST-END+2))
sed '1,'"$((HEAD-1))"'d' "$i" | sed -n -e :a -e '1,'"$DOWN"'!{P;N;D;};N;ba' > "${i%.*}TMP.html"
# converted .html to .md
printf "\t\t.html -> .md\n"
pandoc -f html -t markdown+pipe_tables-grid_tables "${i%.*}TMP.html" -o "${i%.*}.md";
rm "${i%.*}TMP.html";
# second filtering html and css elements...
printf "\t\tsecond filtering html and css elements...\n"
sed -e 's/``` /```/' "${i%.*}.md" | sed -e 's/ //' | sed -e 's/<\/div>//g' | sed '/^<div/d' | sed -e 's/<\/span>//' | sed -e 's/^\*\*//' | sed -e 's/\\//g' | sed -e 's/^: //g' | sed -e 's/^Obsah//g' > "${i%.*}TMP.md";
while read x ; do
arg1=`echo "$x" | cut -d"&" -f1 | sed 's:[]\[\^\$\.\*\/\"]:\\\\&:g'`;
arg2=`echo "$x" | cut -d"&" -f2 | sed 's:[]\[\^\$\.\*\/\"]:\\\\&:g'`;
sed -e 's/'"$arg1"'/'"$arg2"'/' "${i%.*}TMP.md" > "${i%.*}TMP.TEST.md";
cat "${i%.*}TMP.TEST.md" > "${i%.*}TMP.md";
done < ./source/replace
# repair image...
printf "\t\trepair images...\n"
while read x ; do
arg1=`echo "$x" | cut -d"&" -f1 | sed 's:[]\[\^\$\.\*\/\"]:\\\\&:g'`;
arg2=`echo "$x" | cut -d"&" -f2 | sed 's:[]\[\^\$\.\*\/\"]:\\\\&:g'`;
sed -e 's/'"$arg1"'/'"$arg2"'/' "${i%.*}TMP.md" > "${i%.*}.md";
cat "${i%.*}.md" > "${i%.*}TMP.md";
done < ./source/repairIMG
cat "${i%.*}TMP.md" > "${i%.*}.md";
# delete temporary files
rm "${i%.*}TMP.md";
rm "${i%.*}TMP.TEST.md";
done
# delete empty files
find -type f -size -10c |
while read i;
do
rm "$i";
echo "$(tput setaf 9)$i deleted";
done
### create new folder and move converted files
# create folder info and view all files and folder
mkdir info;
find ./docs.it4i.cz -name "*.png" -type f > ./info/list_image.txt;
find ./docs.it4i.cz -name "*.jpg" -type f >> ./info/list_image.txt;
find ./docs.it4i.cz -name "*.jpeg" -type f >> ./info/list_image.txt;
find ./docs.it4i.cz -name "*.md" -type f> ./info/list_md.txt;
find ./docs.it4i.cz -type d | sort > ./info/list_folder.txt
rm -rf ./converted
mkdir converted;
(while read i;
do
mkdir "./converted/$i";
done) < ./source/list_folder
# move md files to new folders
while read a b ; do
cp "$a" "./converted/$b";
done < <(paste ./info/list_md.txt ./source/list_md_mv)
# copy jpg and jpeg to new folders
while read a b ; do
cp "$a" "./converted/$b";
done < <(paste ./info/list_image.txt ./source/list_image_mv.txt)
cp ./docs.it4i.cz/salomon/salomon ./converted/docs.it4i.cz/salomon/salomon
cp ./docs.it4i.cz/salomon/salomon-2 ./converted/docs.it4i.cz/salomon/salomon-2
cp ./converted/docs.it4i.cz/salomon/resource-allocation-and-job-execution/fairshare_formula.png ./converted/docs.it4i.cz/anselm-cluster-documentation/resource-allocation-and-job-execution/fairshare_formula.png
cp ./converted/docs.it4i.cz/salomon/resource-allocation-and-job-execution/job_sort_formula.png ./converted/docs.it4i.cz/anselm-cluster-documentation/resource-allocation-and-job-execution/job_sort_formula.png
cp ./converted/docs.it4i.cz/salomon/software/debuggers/vtune-amplifier.png ./converted/docs.it4i.cz/anselm-cluster-documentation/software/debuggers/vtune-amplifier.png
cp ./converted/docs.it4i.cz/salomon/software/debuggers/Snmekobrazovky20160708v12.33.35.png ./converted/docs.it4i.cz/anselm-cluster-documentation/software/debuggers/Snmekobrazovky20160708v12.33.35.png
cp ./docs.it4i.cz/virtualization-job-workflow ./converted/docs.it4i.cz/anselm-cluster-documentation/software/
fi
./docs.it4i.cz
./docs.it4i.cz/anselm-cluster-documentation
./docs.it4i.cz/anselm-cluster-documentation/accessing-the-cluster
./docs.it4i.cz/anselm-cluster-documentation/accessing-the-cluster/shell-and-data-access
./docs.it4i.cz/anselm-cluster-documentation/resource-allocation-and-job-execution
./docs.it4i.cz/anselm-cluster-documentation/software
./docs.it4i.cz/anselm-cluster-documentation/software/ansys
./docs.it4i.cz/anselm-cluster-documentation/software/chemistry
./docs.it4i.cz/anselm-cluster-documentation/software/comsol
./docs.it4i.cz/anselm-cluster-documentation/software/debuggers
./docs.it4i.cz/anselm-cluster-documentation/software/intel-suite
./docs.it4i.cz/anselm-cluster-documentation/software/mpi-1
./docs.it4i.cz/anselm-cluster-documentation/software/numerical-languages
./docs.it4i.cz/anselm-cluster-documentation/software/numerical-libraries
./docs.it4i.cz/anselm-cluster-documentation/software/omics-master-1
./docs.it4i.cz/anselm-cluster-documentation/storage-1
./docs.it4i.cz/get-started-with-it4innovations
./docs.it4i.cz/get-started-with-it4innovations/accessing-the-clusters
./docs.it4i.cz/get-started-with-it4innovations/accessing-the-clusters/graphical-user-interface
./docs.it4i.cz/get-started-with-it4innovations/accessing-the-clusters/shell-access-and-data-transfer
./docs.it4i.cz/get-started-with-it4innovations/obtaining-login-credentials
./docs.it4i.cz/salomon
./docs.it4i.cz/salomon/accessing-the-cluster
./docs.it4i.cz/salomon/hardware-overview-1
./docs.it4i.cz/salomon/network-1
./docs.it4i.cz/salomon/resource-allocation-and-job-execution
./docs.it4i.cz/salomon/software
./docs.it4i.cz/salomon/software/ansys
./docs.it4i.cz/salomon/software/chemistry
./docs.it4i.cz/salomon/software/comsol
./docs.it4i.cz/salomon/software/debuggers
./docs.it4i.cz/salomon/software/intel-suite
./docs.it4i.cz/salomon/software/mpi-1
./docs.it4i.cz/salomon/software/numerical-languages
./docs.it4i.cz/salomon/storage
./docs.it4i.cz/salomon/uv-2000