Commit d6dc47d8 authored by Pavel Gajdušek's avatar Pavel Gajdušek
Browse files

intel-suite changed, bad links, to fix

parent e30f7196
# Intel Advisor
is tool aiming to assist you in vectorization and threading of your code. You can use it to profile your application and identify loops, that could benefit from vectorization and/or threading parallelism.
## Installed Versions
The following versions are currently available on Salomon as modules:
2016 Update 2 - Advisor/2016_update2
## Usage
Your program should be compiled with -g switch to include symbol names. You should compile with -O2 or higher to see code that is already vectorized by the compiler.
Profiling is possible either directly from the GUI, or from command line.
To profile from GUI, launch Advisor:
```console
$ advixe-gui
```
Then select menu File -> New -> Project. Choose a directory to save project data to. After clicking OK, Project properties window will appear, where you can configure path to your binary, launch arguments, working directory etc. After clicking OK, the project is ready.
In the left pane, you can switch between Vectorization and Threading workflows. Each has several possible steps which you can execute by clicking Collect button. Alternatively, you can click on Command Line, to see the command line required to run the analysis directly from command line.
## References
1. [Intel® Advisor 2015 Tutorial: Find Where to Add Parallelism - C++ Sample](https://software.intel.com/en-us/intel-advisor-tutorial-vectorization-windows-cplusplus)
1. [Product page](https://software.intel.com/en-us/intel-advisor-xe)
1. [Documentation](https://software.intel.com/en-us/intel-advisor-2016-user-guide-linux)
# Intel Compilers
The Intel compilers in multiple versions are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```console
$ ml intel
$ icc -v
$ ifort -v
```
The intel compilers provide for vectorization of the code, via the AVX2 instructions and support threading parallelization via OpenMP
For maximum performance on the Salomon cluster compute nodes, compile your programs using the AVX2 instructions, with reporting where the vectorization was used. We recommend following compilation options for high performance
```console
$ icc -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec myprog.f mysubroutines.f -o myprog.x
```
In this example, we compile the program enabling interprocedural optimizations between source files (-ipo), aggresive loop optimizations (-O3) and vectorization (-xCORE-AVX2)
The compiler recognizes the omp, simd, vector and ivdep pragmas for OpenMP parallelization and AVX2 vectorization. Enable the OpenMP parallelization by the **-openmp** compiler switch.
```console
$ icc -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec -openmp myprog.c mysubroutines.c -o myprog.x
$ ifort -ipo -O3 -xCORE-AVX2 -qopt-report1 -qopt-report-phase=vec -openmp myprog.f mysubroutines.f -o myprog.x
```
Read more at <https://software.intel.com/en-us/intel-cplusplus-compiler-16.0-user-and-reference-guide>
## Sandy Bridge/Ivy Bridge/Haswell Binary Compatibility
Anselm nodes are currently equipped with Sandy Bridge CPUs, while Salomon compute nodes are equipped with Haswell based architecture. The UV1 SMP compute server has Ivy Bridge CPUs, which are equivalent to Sandy Bridge (only smaller manufacturing technology). The new processors are backward compatible with the Sandy Bridge nodes, so all programs that ran on the Sandy Bridge processors, should also run on the new Haswell nodes. To get optimal performance out of the Haswell processors a program should make use of the special AVX2 instructions for this processor. One can do this by recompiling codes with the compiler flags designated to invoke these instructions. For the Intel compiler suite, there are two ways of doing this:
* Using compiler flag (both for Fortran and C): -xCORE-AVX2. This will create a binary with AVX2 instructions, specifically for the Haswell processors. Note that the executable will not run on Sandy Bridge/Ivy Bridge nodes.
* Using compiler flags (both for Fortran and C): -xAVX -axCORE-AVX2. This will generate multiple, feature specific auto-dispatch code paths for Intel® processors, if there is a performance benefit. So this binary will run both on Sandy Bridge/Ivy Bridge and Haswell processors. During runtime it will be decided which path to follow, dependent on which processor you are running on. In general this will result in larger binaries.
# Intel Debugger
IDB is no longer available since Intel Parallel Studio 2015
## Debugging Serial Applications
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment. Use [X display](../../../general/accessing-the-clusters/graphical-user-interface/x-window-system/) for running the GUI.
```console
$ ml intel
$ ml Java
$ idb
```
The debugger may run in text mode. To debug in text mode, use
```console
$ idbc
```
To debug on the compute nodes, module intel must be loaded. The GUI on compute nodes may be accessed using the same way as in [the GUI section](../../../general/accessing-the-clusters/graphical-user-interface/x-window-system/)
Example:
```console
$ qsub -q qexp -l select=1:ncpus=24 -X -I # use 16 threads for Anselm
qsub: waiting for job 19654.srv11 to start
qsub: job 19654.srv11 ready
$ ml intel
$ ml Java
$ icc -O0 -g myprog.c -o myprog.x
$ idb ./myprog.x
```
In this example, we allocate 1 full compute node, compile program myprog.c with debugging options -O0 -g and run the idb debugger interactively on the myprog.x executable. The GUI access is via X11 port forwarding provided by the PBS workload manager.
## Debugging Parallel Applications
Intel debugger is capable of debugging multithreaded and MPI parallel programs as well.
### Small Number of MPI Ranks
For debugging small number of MPI ranks, you may execute and debug each rank in separate xterm terminal (do not forget the [X display](../../../general/accessing-the-clusters/graphical-user-interface/x-window-system/)). Using Intel MPI, this may be done in following way:
```console
$ qsub -q qexp -l select=2:ncpus=24 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ ml intel
$ mpirun -ppn 1 -hostfile $PBS_NODEFILE --enable-x xterm -e idbc ./mympiprog.x
```
In this example, we allocate 2 full compute node, run xterm on each node and start idb debugger in command line mode, debugging two ranks of mympiprog.x application. The xterm will pop up for each rank, with idb prompt ready. The example is not limited to use of Intel MPI
### Large Number of MPI Ranks
Run the idb debugger from within the MPI debug option. This will cause the debugger to bind to all ranks and provide aggregated outputs across the ranks, pausing execution automatically just after startup. You may then set break points and step the execution manually. Using Intel MPI:
```console
$ qsub -q qexp -l select=2:ncpus=24 -X -I
qsub: waiting for job 19654.srv11 to start
qsub: job 19655.srv11 ready
$ ml intel
$ mpirun -n 48 -idb ./mympiprog.x
```
### Debugging Multithreaded Application
Run the idb debugger in GUI mode. The menu Parallel contains number of tools for debugging multiple threads. One of the most useful tools is the **Serialize Execution** tool, which serializes execution of concurrent threads for easy orientation and identification of concurrency related bugs.
## Further Information
Exhaustive manual on idb features and usage is published at Intel website, <https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/>
# Intel Inspector
Intel Inspector is a dynamic memory and threading error checking tool for C/C++/Fortran applications. It can detect issues such as memory leaks, invalid memory references, uninitalized variables, race conditions, deadlocks etc.
## Installed Versions
The following versions are currently available on Salomon as modules:
2016 Update 1 - Inspector/2016_update1
## Usage
Your program should be compiled with -g switch to include symbol names. Optimizations can be turned on.
Debugging is possible either directly from the GUI, or from command line.
### GUI Mode
To debug from GUI, launch Inspector:
```console
$ inspxe-gui &
```
Then select menu File -> New -> Project. Choose a directory to save project data to. After clicking OK, Project properties window will appear, where you can configure path to your binary, launch arguments, working directory etc. After clicking OK, the project is ready.
In the main pane, you can start a predefined analysis type or define your own. Click Start to start the analysis. Alternatively, you can click on Command Line, to see the command line required to run the analysis directly from command line.
### Batch Mode
Analysis can be also run from command line in batch mode. Batch mode analysis is run with command inspxe-cl. To obtain the required parameters, either consult the documentation or you can configure the analysis in the GUI and then click "Command Line" button in the lower right corner to the respective command line.
Results obtained from batch mode can be then viewed in the GUI by selecting File -> Open -> Result...
## References
1. [Product page](https://software.intel.com/en-us/intel-inspector-xe)
1. [Documentation and Release Notes](https://software.intel.com/en-us/intel-inspector-xe-support/documentation)
1. [Tutorials](https://software.intel.com/en-us/articles/inspectorxe-tutorials)
# Intel IPP
## Intel Integrated Performance Primitives
Intel Integrated Performance Primitives, version 9.0.1, compiled for AVX2 vector instructions is available, via module ipp. The IPP is a very rich library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax, as well as cryptographic functions, linear algebra functions and many more.
Check out IPP before implementing own math functions for data processing, it is likely already there.
```console
$ ml ipp
```
The module sets up environment variables, required for linking and running ipp enabled applications.
## IPP Example
```cpp
#include "ipp.h"
#include <stdio.h>
int main(int argc, char* argv[])
{
const IppLibraryVersion *lib;
Ipp64u fm;
IppStatus status;
status= ippInit(); //IPP initialization with the best optimization layer
if( status != ippStsNoErr ) {
printf("IppInit() Error:n");
printf("%sn", ippGetStatusString(status) );
return -1;
}
//Get version info
lib = ippiGetLibVersion();
printf("%s %sn", lib->Name, lib->Version);
//Get CPU features enabled with selected library level
fm=ippGetEnabledCpuFeatures();
printf("SSE :%cn",(fm>1)&1?'Y':'N');
printf("SSE2 :%cn",(fm>2)&1?'Y':'N');
printf("SSE3 :%cn",(fm>3)&1?'Y':'N');
printf("SSSE3 :%cn",(fm>4)&1?'Y':'N');
printf("SSE41 :%cn",(fm>6)&1?'Y':'N');
printf("SSE42 :%cn",(fm>7)&1?'Y':'N');
printf("AVX :%cn",(fm>8)&1 ?'Y':'N');
printf("AVX2 :%cn", (fm>15)&1 ?'Y':'N' );
printf("----------n");
printf("OS Enabled AVX :%cn", (fm>9)&1 ?'Y':'N');
printf("AES :%cn", (fm>10)&1?'Y':'N');
printf("CLMUL :%cn", (fm>11)&1?'Y':'N');
printf("RDRAND :%cn", (fm>13)&1?'Y':'N');
printf("F16C :%cn", (fm>14)&1?'Y':'N');
return 0;
}
```
Compile above example, using any compiler and the ipp module.
```console
$ ml intel
$ ml ipp
$ icc testipp.c -o testipp.x -lippi -lipps -lippcore
```
You will need the ipp module loaded to run the ipp enabled executable. This may be avoided, by compiling library search paths into the executable
```console
$ ml intel
$ ml ipp
$ icc testipp.c -o testipp.x -Wl,-rpath=$LIBRARY_PATH -lippi -lipps -lippcore
```
## Code Samples and Documentation
Intel provides number of [Code Samples for IPP](https://software.intel.com/en-us/articles/code-samples-for-intel-integrated-performance-primitives-library), illustrating use of IPP.
Read full documentation on IPP [on Intel website,](http://software.intel.com/sites/products/search/search.php?q=&x=15&y=6&product=ipp&version=7.1&docos=lin) in particular the [IPP Reference manual.](http://software.intel.com/sites/products/documentation/doclib/ipp_sa/71/ipp_manual/index.htm)
# Intel MKL
## Intel Math Kernel Library
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL provides these basic math kernels:
* BLAS (level 1, 2, and 3) and LAPACK linear algebra routines, offering vector, vector-matrix, and matrix-matrix operations.
* The PARDISO direct sparse solver, an iterative sparse solver, and supporting sparse BLAS (level 1, 2, and 3) routines for solving sparse systems of equations.
* ScaLAPACK distributed processing linear algebra routines for Linux and Windows operating systems, as well as the Basic Linear Algebra Communications Subprograms (BLACS) and the Parallel Basic Linear Algebra Subprograms (PBLAS).
* Fast Fourier transform (FFT) functions in one, two, or three dimensions with support for mixed radices (not limited to sizes that are powers of 2), as well as distributed versions of these functions.
* Vector Math Library (VML) routines for optimized mathematical operations on vectors.
* Vector Statistical Library (VSL) routines, which offer high-performance vectorized random number generators (RNG) for several probability distributions, convolution and correlation routines, and summary statistics functions.
* Data Fitting Library, which provides capabilities for spline-based approximation of functions, derivatives and integrals of functions, and search.
* Extended Eigensolver, a shared memory version of an eigensolver based on the Feast Eigenvalue Solver.
For details see the [Intel MKL Reference Manual](http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mklman/index.htm).
Intel MKL is available on the cluster
```console
$ ml av imkl
$ ml imkl
```
The module sets up environment variables, required for linking and running mkl enabled applications. The most important variables are the $MKLROOT, $CPATH, $LD_LIBRARY_PATH and $MKL_EXAMPLES
Intel MKL library may be linked using any compiler. With intel compiler use -mkl option to link default threaded MKL.
### Interfaces
Intel MKL library provides number of interfaces. The fundamental once are the LP64 and ILP64. The Intel MKL ILP64 libraries use the 64-bit integer type (necessary for indexing large arrays, with more than 231^-1 elements), whereas the LP64 libraries index arrays with the 32-bit integer type.
| Interface | Integer type |
| --------- | -------------------------------------------- |
| LP64 | 32-bit, int, integer(kind=4), MPI_INT |
| ILP64 | 64-bit, long int, integer(kind=8), MPI_INT64 |
### Linking
Linking Intel MKL libraries may be complex. Intel [mkl link line advisor](http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor) helps. See also [examples](intel-mkl/#examples) below.
You will need the mkl module loaded to run the mkl enabled executable. This may be avoided, by compiling library search paths into the executable. Include rpath on the compile line:
```console
$ icc .... -Wl,-rpath=$LIBRARY_PATH ...
```
### Threading
Advantage in using Intel MKL library is that it brings threaded parallelization to applications that are otherwise not parallel.
For this to work, the application must link the threaded MKL library (default). Number and behaviour of MKL threads may be controlled via the OpenMP environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY. MKL_NUM_THREADS takes precedence over OMP_NUM_THREADS
```console
$ export OMP_NUM_THREADS=24 # 16 for Anselm
$ export KMP_AFFINITY=granularity=fine,compact,1,0
```
The application will run with 24 threads with affinity optimized for fine grain parallelization.
## Examples
Number of examples, demonstrating use of the Intel MKL library and its linking is available on clusters, in the $MKL_EXAMPLES directory. In the examples below, we demonstrate linking Intel MKL to Intel and GNU compiled program for multi-threaded matrix multiplication.
### Working With Examples
```console
$ ml intel
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ make sointel64 function=cblas_dgemm
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL example suite installed on clusters.
### Example: MKL and Intel Compiler
```console
$ ml intel
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$
$ icc -w source/cblas_dgemmx.c source/common_func.c -mkl -o cblas_dgemmx.x
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, demonstrating use of MKL with icc -mkl option. Using the -mkl option is equivalent to:
```console
$ icc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x -I$MKL_INC_DIR -L$MKL_LIB_DIR -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5
```
In this example, we compile and link the cblas_dgemm example, using LP64 interface to threaded MKL and Intel OMP threads implementation.
### Example: Intel MKL and GNU Compiler
```console
$ ml GCC
$ ml imkl
$ cp -a $MKL_EXAMPLES/cblas /tmp/
$ cd /tmp/cblas
$ gcc -w source/cblas_dgemmx.c source/common_func.c -o cblas_dgemmx.x -lmkl_intel_lp64 -lmkl_gnu_thread -lmkl_core -lgomp -lm
$ ./cblas_dgemmx.x data/cblas_dgemmx.d
```
In this example, we compile, link and run the cblas_dgemm example, using LP64 interface to threaded MKL and gnu OMP threads implementation.
## MKL and MIC Accelerators
The Intel MKL is capable to automatically offload the computations o the MIC accelerator. See section [Intel Xeon Phi](../intel-xeon-phi/) for details.
## LAPACKE C Interface
MKL includes LAPACKE C Interface to LAPACK. For some reason, although Intel is the author of LAPACKE, the LAPACKE header files are not present in MKL. For this reason, we have prepared LAPACKE module, which includes Intel's LAPACKE headers from official LAPACK, which you can use to compile code using LAPACKE interface against MKL.
## Further Reading
Read more on [Intel website](http://software.intel.com/en-us/intel-mkl), in particular the [MKL users guide](https://software.intel.com/en-us/intel-mkl/documentation/linux).
# Intel Parallel Studio
The Salomon cluster provides following elements of the Intel Parallel Studio XE
Intel Parallel Studio XE
* Intel Compilers
* Intel Debugger
* Intel MKL Library
* Intel Integrated Performance Primitives Library
* Intel Threading Building Blocks Library
* Intel Trace Analyzer and Collector
* Intel Advisor
* Intel Inspector
## Intel Compilers
The Intel compilers are available, via module intel. The compilers include the icc C and C++ compiler and the ifort fortran 77/90/95 compiler.
```console
$ ml intel
$ icc -v
$ ifort -v
```
Read more at the [Intel Compilers](intel-compilers/) page.
## Intel Debugger
IDB is no longer available since Parallel Studio 2015.
The intel debugger version 13.0 is available, via module intel. The debugger works for applications compiled with C and C++ compiler and the ifort fortran 77/90/95 compiler. The debugger provides java GUI environment.
```console
$ ml intel
$ idb
```
Read more at the [Intel Debugger](intel-debugger/) page.
## Intel Math Kernel Library
Intel Math Kernel Library (Intel MKL) is a library of math kernel subroutines, extensively threaded and optimized for maximum performance. Intel MKL unites and provides these basic components: BLAS, LAPACK, ScaLapack, PARDISO, FFT, VML, VSL, Data fitting, Feast Eigensolver and many more.
```console
$ ml imkl
```
Read more at the [Intel MKL](intel-mkl/) page.
## Intel Integrated Performance Primitives
Intel Integrated Performance Primitives, version 7.1.1, compiled for AVX is available, via module ipp. The IPP is a library of highly optimized algorithmic building blocks for media and data applications. This includes signal, image and frame processing algorithms, such as FFT, FIR, Convolution, Optical Flow, Hough transform, Sum, MinMax and many more.
```console
$ ml ipp
```
Read more at the [Intel IPP](intel-integrated-performance-primitives/) page.
## Intel Threading Building Blocks
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. It is designed to promote scalable data parallel programming. Additionally, it fully supports nested parallelism, so you can build larger parallel components from smaller parallel components. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner.
```console
$ ml tbb
```
Read more at the [Intel TBB](intel-tbb/) page.
# Intel TBB
## Intel Threading Building Blocks
Intel Threading Building Blocks (Intel TBB) is a library that supports scalable parallel programming using standard ISO C++ code. It does not require special languages or compilers. To use the library, you specify tasks, not threads, and let the library map tasks onto threads in an efficient manner. The tasks are executed by a runtime scheduler and may be offloaded to [MIC accelerator](../intel-xeon-phi/).
Intel is available on the cluster.
```console
$ ml av tbb
```
The module sets up environment variables, required for linking and running tbb enabled applications.
Link the tbb library, using -ltbb
## Examples
Number of examples, demonstrating use of TBB and its built-in scheduler is available on Anselm, in the $TBB_EXAMPLES directory.
```console
$ ml intel
$ ml tbb
$ cp -a $TBB_EXAMPLES/common $TBB_EXAMPLES/parallel_reduce /tmp/
$ cd /tmp/parallel_reduce/primes
$ icc -O2 -DNDEBUG -o primes.x main.cpp primes.cpp -ltbb
$ ./primes.x
```
In this example, we compile, link and run the primes example, demonstrating use of parallel task-based reduce in computation of prime numbers.
You will need the tbb module loaded to run the tbb enabled executable. This may be avoided, by compiling library search paths into the executable.
```console
$ icc -O2 -o primes.x main.cpp primes.cpp -Wl,-rpath=$LIBRARY_PATH -ltbb
```
## Further Reading
Read more on Intel website, <http://software.intel.com/sites/products/documentation/doclib/tbb_sa/help/index.htm>
# Intel Trace Analyzer and Collector
Intel Trace Analyzer and Collector (ITAC) is a tool to collect and graphicaly analyze behaviour of MPI applications. It helps you to analyze communication patterns of your application, identify hotspots, perform correctnes checking (identify deadlocks, data corruption etc), simulate how your application would run on a different interconnect.
ITAC is a offline analysis tool - first you run your application to collect a trace file, then you can open the trace in a GUI analyzer to view it.
## Installed Version
Currently on Salomon is version 9.1.2.024 available as module itac/9.1.2.024
## Collecting Traces
ITAC can collect traces from applications that are using Intel MPI. To generate a trace, simply add -trace option to your mpirun command :
```console
$ ml itac/9.1.2.024
$ mpirun -trace myapp
```
The trace will be saved in file myapp.stf in the current directory.
## Viewing Traces
To view and analyze the trace, open ITAC GUI in a [graphical environment](../../../general/accessing-the-clusters/graphical-user-interface/x-window-system/):
```console
$ ml itac/9.1.2.024
$ traceanalyzer
```
The GUI will launch and you can open the produced `*`.stf file.
![](../../../img/Snmekobrazovky20151204v15.35.12.png)
Please refer to Intel documenation about usage of the GUI tool.
## References
1. [Getting Started with Intel® Trace Analyzer and Collector](https://software.intel.com/en-us/get-started-with-itac-for-linux)
1. [Intel® Trace Analyzer and Collector - Documentation](https://software.intel.com/en-us/intel-trace-analyzer)
# FFTW
The discrete Fourier transform in one or more dimensions, MPI parallel
FFTW is a C subroutine library for computing the discrete Fourier transform in one or more dimensions, of arbitrary input size, and of both real and complex data (as well as of even/odd data, e.g. the discrete cosine/sine transforms or DCT/DST). The FFTW library allows for MPI parallel, in-place discrete Fourier transform, with data distributed over number of nodes.
Two versions, **3.3.3** and **2.1.5** of FFTW are available on Anselm, each compiled for **Intel MPI** and **OpenMPI** using **intel** and **gnu** compilers. These are available via modules:
| Version | Parallelization | module | linker options |
| -------------- | --------------- | ------------------- | ----------------------------------- |
| FFTW3 gcc3.3.3 | pthread, OpenMP | fftw3/3.3.3-gcc | -lfftw3, -lfftw3_threads-lfftw3_omp |
| FFTW3 icc3.3.3 | pthread, OpenMP | fftw3 | -lfftw3, -lfftw3_threads-lfftw3_omp |
| FFTW2 gcc2.1.5 | pthread | fftw2/2.1.5-gcc | -lfftw, -lfftw_threads |
| FFTW2 icc2.1.5 | pthread | fftw2 | -lfftw, -lfftw_threads |
| FFTW3 gcc3.3.3 | OpenMPI | fftw-mpi3/3.3.3-gcc | -lfftw3_mpi |
| FFTW3 icc3.3.3 | Intel MPI | fftw3-mpi | -lfftw3_mpi |
| FFTW2 gcc2.1.5 | OpenMPI | fftw2-mpi/2.1.5-gcc | -lfftw_mpi |
| FFTW2 gcc2.1.5 | IntelMPI | fftw2-mpi/2.1.5-gcc | -lfftw_mpi |
```console
$ ml fftw3 **or** ml FFTW
```
The module sets up environment variables, required for linking and running FFTW enabled applications. Make sure that the choice of FFTW module is consistent with your choice of MPI library. Mixing MPI of different implementations may have unpredictable results.
## Example
```cpp
#include <fftw3-mpi.h>
int main(int argc, char **argv)
{
const ptrdiff_t N0 = 100, N1 = 1000;
fftw_plan plan;
fftw_complex *data;
ptrdiff_t alloc_local, local_n0, local_0_start, i, j;
MPI_Init(&argc, &argv);
fftw_mpi_init();
/* get local data size and allocate */
alloc_local = fftw_mpi_local_size_2d(N0, N1, MPI_COMM_WORLD,
&local_n0, &local_0_start);
data = fftw_alloc_complex(alloc_local);
/* create plan for in-place forward DFT */
plan = fftw_mpi_plan_dft_2d(N0, N1, data, data, MPI_COMM_WORLD,
FFTW_FORWARD, FFTW_ESTIMATE);
/* initialize data */
for (i = 0; i < local_n0; ++i) for (j = 0; j < N1; ++j)
{ data[i*N1 + j][0] = i;
data[i*N1 + j][1] = j; }
/* compute transforms, in-place, as many times as desired */
fftw_execute(plan);
fftw_destroy_plan(plan);
MPI_Finalize();
}
```