Skip to content
Snippets Groups Projects
Running_OpenMPI.md 8.25 KiB
Newer Older
  • Learn to ignore specific revisions
  • David Hrbáč's avatar
    David Hrbáč committed
    # Running OpenMPI
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ## OpenMPI Program Execution
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    The OpenMPI programs may be executed only via the PBS Workload manager, by entering an appropriate queue. On the cluster, the **OpenMPI 1.8.6** is OpenMPI based MPI implementation.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### Basic Usage
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    Use the mpiexec to run the OpenMPI code.
    
    Example:
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ qsub -q qexp -l select=4:ncpus=24 -I
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        qsub: waiting for job 15210.isrv5 to start
        qsub: job 15210.isrv5 ready
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    $ pwd
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        /home/username
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    $ ml OpenMPI
    $ mpiexec -pernode ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        Hello world! from rank 0 of 4 on host r1i0n17
        Hello world! from rank 1 of 4 on host r1i0n5
        Hello world! from rank 2 of 4 on host r1i0n6
        Hello world! from rank 3 of 4 on host r1i0n7
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Please be aware, that in this example, the directive **-pernode** is used to run only **one task per node**, which is normally an unwanted behaviour (unless you want to run hybrid code with just one MPI and 24 OpenMP tasks per node). In normal MPI programs **omit the -pernode directive** to run up to 24 MPI tasks per each node.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we allocate 4 nodes via the express queue interactively. We set up the openmpi environment and interactively run the helloworld_mpi.x program.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Note that the executable helloworld_mpi.x must be available within the same path on all nodes. This is automatically fulfilled on the /home and /scratch filesystem.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    You need to preload the executable, if running on the local ramdisk /tmp filesystem
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ pwd
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        /tmp/pbs.15210.isrv5
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    $ mpiexec -pernode --preload-binary ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        Hello world! from rank 0 of 4 on host r1i0n17
        Hello world! from rank 1 of 4 on host r1i0n5
        Hello world! from rank 2 of 4 on host r1i0n6
        Hello world! from rank 3 of 4 on host r1i0n7
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we assume the executable helloworld_mpi.x is present on compute node r1i0n17 on ramdisk. We call the mpiexec whith the **--preload-binary** argument (valid for openmpi). The mpiexec will copy the executable from r1i0n17 to the  /tmp/pbs.15210.isrv5 directory on r1i0n5, r1i0n6 and r1i0n7 and execute the program.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    MPI process mapping may be controlled by PBS parameters.
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    The mpiprocs and ompthreads parameters allow for selection of number of running MPI processes per node as well as number of OpenMP threads per MPI process.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### One MPI Process Per Node
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Follow this example to run one MPI process per node, 24 threads per process.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ qsub -q qexp -l select=4:ncpus=24:mpiprocs=1:ompthreads=24 -I
    $ ml OpenMPI
    $ mpiexec --bind-to-none ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we demonstrate recommended way to run an MPI application, using 1 MPI processes per node and 24 threads per socket, on 4 nodes.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### Two MPI Processes Per Node
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Follow this example to run two MPI processes per node, 8 threads per process. Note the options to mpiexec.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ qsub -q qexp -l select=4:ncpus=24:mpiprocs=2:ompthreads=12 -I
    $ ml OpenMPI
    $ mpiexec -bysocket -bind-to-socket ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we demonstrate recommended way to run an MPI application, using 2 MPI processes per node and 12 threads per socket, each process and its threads bound to a separate processor socket of the node, on 4 nodes
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### 24 MPI Processes Per Node
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Follow this example to run 24 MPI processes per node, 1 thread per process. Note the options to mpiexec.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ qsub -q qexp -l select=4:ncpus=24:mpiprocs=24:ompthreads=1 -I
    $ ml OpenMPI
    $ mpiexec -bycore -bind-to-core ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we demonstrate recommended way to run an MPI application, using 24 MPI processes per node, single threaded. Each process is bound to separate processor core, on 4 nodes.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### OpenMP Thread Affinity
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    !!! note
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        Important!  Bind every OpenMP thread to a core!
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In the previous two examples with one or two MPI processes per node, the operating system might still migrate OpenMP threads between cores. You might want to avoid this by setting these environment variable for GCC OpenMP:
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ export GOMP_CPU_AFFINITY="0-23"
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    or this one for Intel OpenMP:
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ export KMP_AFFINITY=granularity=fine,compact,1,0
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    As of OpenMP 4.0 (supported by GCC 4.9 and later and Intel 14.0 and later) the following variables may be used for Intel or GCC:
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ export OMP_PROC_BIND=true
    $ export OMP_PLACES=cores
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ## OpenMPI Process Mapping and Binding
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    The mpiexec allows for precise selection of how the MPI processes will be mapped to the computational nodes and how these processes will bind to particular processor sockets and cores.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    MPI process mapping may be specified by a hostfile or rankfile input to the mpiexec program. Altough all implementations of MPI provide means for process mapping and binding, following examples are valid for the openmpi only.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    ### Hostfile
    
    Example hostfile
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        r1i0n17.smc.salomon.it4i.cz
        r1i0n5.smc.salomon.it4i.cz
        r1i0n6.smc.salomon.it4i.cz
        r1i0n7.smc.salomon.it4i.cz
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    Use the hostfile to control process placement
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ mpiexec -hostfile hostfile ./helloworld_mpi.x
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        Hello world! from rank 0 of 4 on host r1i0n17
        Hello world! from rank 1 of 4 on host r1i0n5
        Hello world! from rank 2 of 4 on host r1i0n6
        Hello world! from rank 3 of 4 on host r1i0n7
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example, we see that ranks have been mapped on nodes according to the order in which nodes show in the hostfile
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    ### Rankfile
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Exact control of MPI process placement and resource binding is provided by specifying a rankfile
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    Appropriate binding may boost performance of your application.
    
    Example rankfile
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        rank 0=r1i0n7.smc.salomon.it4i.cz slot=1:0,1
        rank 1=r1i0n6.smc.salomon.it4i.cz slot=0:*
        rank 2=r1i0n5.smc.salomon.it4i.cz slot=1:1-2
        rank 3=r1i0n17.smc.salomon slot=0:1,1:0-2
        rank 4=r1i0n6.smc.salomon.it4i.cz slot=0:*,1:*
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    This rankfile assumes 5 ranks will be running on 4 nodes and provides exact mapping and binding of the processes to the processor sockets and cores
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    Explanation:
    rank 0 will be bounded to r1i0n7, socket1 core0 and core1
    rank 1 will be bounded to r1i0n6, socket0, all cores
    rank 2 will be bounded to r1i0n5, socket1, core1 and core2
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    rank 3 will be bounded to r1i0n17, socket0 core1, socket1 core0, core1, core2
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    rank 4 will be bounded to r1i0n6, all cores on both sockets
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
        $ mpiexec -n 5 -rf rankfile --report-bindings ./helloworld_mpi.x
        [r1i0n17:11180]  MCW rank 3 bound to socket 0[core 1] socket 1[core 0-2]: [. B . . . . . . . . . .][B B B . . . . . . . . .] (slot list 0:1,1:0-2)
        [r1i0n7:09928] MCW rank 0 bound to socket 1[core 0-1]: [. . . . . . . . . . . .][B B . . . . . . . . . .] (slot list 1:0,1)
        [r1i0n6:10395] MCW rank 1 bound to socket 0[core 0-7]: [B B B B B B B B B B B B][. . . . . . . . . . . .] (slot list 0:*)
        [r1i0n5:10406]  MCW rank 2 bound to socket 1[core 1-2]: [. . . . . . . . . . . .][. B B . . . . . . . . .] (slot list 1:1-2)
        [r1i0n6:10406]  MCW rank 4 bound to socket 0[core 0-7] socket 1[core 0-7]: [B B B B B B B B B B B B][B B B B B B B B B B B B] (slot list 0:*,1:*)
        Hello world! from rank 3 of 5 on host r1i0n17
        Hello world! from rank 1 of 5 on host r1i0n6
        Hello world! from rank 0 of 5 on host r1i0n7
        Hello world! from rank 4 of 5 on host r1i0n6
        Hello world! from rank 2 of 5 on host r1i0n5
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In this example we run 5 MPI processes (5 ranks) on four nodes. The rankfile defines how the processes will be mapped on the nodes, sockets and cores. The **--report-bindings** option was used to print out the actual process location and bindings. Note that ranks 1 and 4 run on the same node and their core binding overlaps.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    It is users responsibility to provide correct number of ranks, sockets and cores.
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ### Bindings Verification
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    In all cases, binding and threading may be verified by executing for example:
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```console
    $ mpiexec -bysocket -bind-to-socket --report-bindings echo
    $ mpiexec -bysocket -bind-to-socket numactl --show
    $ mpiexec -bysocket -bind-to-socket echo $OMP_NUM_THREADS
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    ```
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    ## Changes in OpenMPI 1.8
    
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    Some options have changed in OpenMPI version 1.8.
    
    
    David Hrbáč's avatar
    David Hrbáč committed
    | version 1.6.5    | version 1.8.1       |
    
    Lukáš Krupčík's avatar
    Lukáš Krupčík committed
    | ---------------- | ------------------- |
    
    David Hrbáč's avatar
    David Hrbáč committed
    | --bind-to-none   | --bind-to none      |
    | --bind-to-core   | --bind-to core      |
    | --bind-to-socket | --bind-to socket    |
    | -bysocket        | --map-by socket     |
    | -bycore          | --map-by core       |
    | -pernode         | --map-by ppr:1:node |