Skip to content
Snippets Groups Projects
openacc-mpi.md 19.5 KiB
Newer Older
  • Learn to ignore specific revisions
  • Jan Siwiec's avatar
    Jan Siwiec committed
    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527
    # OpenACC MPI Tutorial
    
    This tutorial is an excerpt from Nvidia's [5× in 5 Hours: Porting a 3D Elastic Wave Simulator to GPUs Using OpenACC][1] tutorial.
    All source code for this tutorial can be downloaded as part of this [tarball][2].
    `SEISMIC_CPML`, developed by Dimitri Komatitsch and Roland Martin from University of Pau, France,
    is a set of ten open-source Fortran 90 programs.
    
    !!!note
        Before building and running each step,
        make sure that the compiler (`pgfortran`) and MPI wrappers (`mpif90`) are in your path.
    
    ## Step 0: Evaluation
    
    Before you start, you should evaluate the code to determine
    whether it is worth accelerating.
    Using the compiler flag `-⁠Minfo=intensity`, you can see
    that the average compute intensity of the various loops is between 2.5 and 2.64.
    As a rule, anything below 1.0 is generally not worth accelerating
    unless it is part of a larger program.
    
    To build and run the original MPI/OpenMP code on your system, do the following:
    
    ```console
    cd step0
    make build
    make run
    make verify
    ```
    
    ## Step 1: Adding Setup Code
    
    Because this is an MPI code where each process will use its own GPU,
    you need to add some utility code to ensure that happens.
    The `setDevice` routine first determines which node the process is on
    (via a call to `hostid`) and then gathers the hostids from all other processes.
    It then determines how many GPUs are available on the node
    and assigns the devices to each process.
    
    Note that in order to maintain portability with the CPU version,
    this section of code is guarded by the preprocessor macro `_OPENACC`,
    which is defined when the OpenACC directives are enabled in the HPC Fortran compiler
    through the use of the `-⁠acc` command-line compiler option.
    
    ```code
    #ifdef _OPENACC
    #
    function setDevice(nprocs,myrank)
    
      use iso_c_binding
      use openacc
      implicit none
      include 'mpif.h'
    
      interface
        function gethostid() BIND(C)
          use iso_c_binding
          integer (C_INT) :: gethostid
        end function gethostid
      end interface
    
      integer :: nprocs, myrank
      integer, dimension(nprocs) :: hostids, localprocs
      integer :: hostid, ierr, numdev, mydev, i, numlocal
      integer :: setDevice
    
    ! get the hostids so we can determine what other processes are on this node
      hostid = gethostid()
      CALL mpi_allgather(hostid,1,MPI_INTEGER,hostids,1,MPI_INTEGER, &
                         MPI_COMM_WORLD,ierr)
    
    ! determine which processors are on this node
      numlocal=0
      localprocs=0
      do i=1,nprocs
        if (hostid .eq. hostids(i)) then
          localprocs(i)=numlocal
          numlocal = numlocal+1
        endif
      enddo
    
    ! get the number of devices on this node
      numdev = acc_get_num_devices(ACC_DEVICE_NVIDIA)
    
      if (numdev .lt. 1) then
        print *, 'ERROR: There are no devices available on this host.  &
                  ABORTING.', myrank
        stop
      endif
    
    ! print a warning if the number of devices is less then the number
    ! of processes on this node.  Having multiple processes share devices is not
    ! recommended.
      if (numdev .lt. numlocal) then
       if (localprocs(myrank+1).eq.1) then
         ! print the message only once per node
       print *, 'WARNING: The number of process is greater then the number  &
                 of GPUs.', myrank
       endif
       mydev = mod(localprocs(myrank+1),numdev)
      else
       mydev = localprocs(myrank+1)
      endif
    
     call acc_set_device_num(mydev,ACC_DEVICE_NVIDIA)
     call acc_init(ACC_DEVICE_NVIDIA)
     setDevice = mydev
    
    end function setDevice
    #endif
    ```
    
    To build and run the step1 code on your system do the following:
    
    ```console
    cd step1
    make build
    make run
    make verify
    ```
    
    ## Step 2: Adding Compute Regions
    
    Next, you add six compute regions around the eight parallel loops.
    For example, here's the final reduction loop.
    
    ```code
    !$acc kernels
      do k = kmin,kmax
        do j = NPOINTS_PML+1, NY-NPOINTS_PML
          do i = NPOINTS_PML+1, NX-NPOINTS_PML
    
    ! compute kinetic energy first, defined as 1/2 rho ||v||^2
    ! in principle we should use rho_half_x_half_y instead of rho for vy
    ! in order to interpolate density at the right location in the staggered grid
    ! cell but in a homogeneous medium we can safely ignore it
    
          total_energy_kinetic = total_energy_kinetic + 0.5d0 * rho*( &
                  vx(i,j,k)**2 + vy(i,j,k)**2 + vz(i,j,k)**2)
    
    ! add potential energy, defined as 1/2 epsilon_ij sigma_ij
    ! in principle we should interpolate the medium parameters at the right location
    ! in the staggered grid cell but in a homogeneous medium we can safely ignore it
    
    ! compute total field from split components
          epsilon_xx = ((lambda + 2.d0*mu) * sigmaxx(i,j,k) - lambda *  &
          sigmayy(i,j,k) - lambda*sigmazz(i,j,k)) / (4.d0 * mu * (lambda + mu))
          epsilon_yy = ((lambda + 2.d0*mu) * sigmayy(i,j,k) - lambda *  &
              sigmaxx(i,j,k) - lambda*sigmazz(i,j,k)) / (4.d0 * mu * (lambda + mu))
          epsilon_zz = ((lambda + 2.d0*mu) * sigmazz(i,j,k) - lambda *  &
              sigmaxx(i,j,k) - lambda*sigmayy(i,j,k)) / (4.d0 * mu * (lambda + mu))
          epsilon_xy = sigmaxy(i,j,k) / (2.d0 * mu)
          epsilon_xz = sigmaxz(i,j,k) / (2.d0 * mu)
          epsilon_yz = sigmayz(i,j,k) / (2.d0 * mu)
    
          total_energy_potential = total_energy_potential + &
            0.5d0 * (epsilon_xx * sigmaxx(i,j,k) + epsilon_yy * sigmayy(i,j,k) + &
            epsilon_yy * sigmayy(i,j,k)+ 2.d0 * epsilon_xy * sigmaxy(i,j,k) + &
            2.d0*epsilon_xz * sigmaxz(i,j,k)+2.d0*epsilon_yz * sigmayz(i,j,k))
    
          enddo
        enddo
      enddo
    !$acc end kernels
    ```
    
    The `-⁠acc` command line option to the HPC Accelerator Fortran compiler enables OpenACC directives. Note that OpenACC is meant to model a generic class of devices.
    
    Another compiler option you'll want to use during development is `-⁠Minfo`,
    which provides feedback on optimizations and transformations performed on your code.
    For accelerator-specific information, use the `-⁠Minfo=accel` sub-option.
    
    Examples of feedback messages produced when compiling `SEISMIC_CPML` include:
    
    ```console
       1113, Generating copyin(vz(11:91,11:631,kmin:kmax))
             Generating copyin(vy(11:91,11:631,kmin:kmax))
             Generating copyin(vx(11:91,11:631,kmin:kmax))
             Generating copyin(sigmaxx(11:91,11:631,kmin:kmax))
             Generating copyin(sigmayy(11:91,11:631,kmin:kmax))
             Generating copyin(sigmazz(11:91,11:631,kmin:kmax))
             Generating copyin(sigmaxy(11:91,11:631,kmin:kmax))
             Generating copyin(sigmaxz(11:91,11:631,kmin:kmax))
             Generating copyin(sigmayz(11:91,11:631,kmin:kmax))
    ```
    
    To compute on a GPU, the first step is to move data from host memory to GPU memory.
    In the example above, the compiler tells you that it is copying over nine arrays.
    
    Note the `copyin` statements.
    These mean that the compiler will only copy the data to the GPU
    but not copy it back to the host.
    This is because line 1113 corresponds to the start of the reduction loop compute region,
    where these arrays are used but never modified.
    
    Data movement clauses:
    
    * `copyin` - the data is copied only to the GPU;
    * `copy` - the data is copied to the device at the beginning of the region and copied back at the end of the region;
    * `copyout` - the data is only copied back to the host.
    
    The compiler is conservative and only copies the data
    that's actually required to perform the necessary computations.
    Unfortunately, because the interior sub-arrays are not contiguous in host memory,
    the compiler needs to generate multiple data transfers for each array.
    
    ```console
       1114, Loop is parallelizable
       1115, Loop is parallelizable
       1116, Loop is parallelizable
             Accelerator kernel generated
    ```
    
    Here the compiler has performed dependence analysis
    on the loops at lines 1114, 1115, and 1116 (the reduction loop shown earlier).
    It finds that all three loops are parallelizable so it generates an accelerator kernel.
    
    The compiler may attempt to work around dependences that prevent parallelization by interchanging loops (i.e changing the order) where it's safe to do so. At least one outer or interchanged loop must be parallel for an accelerator kernel to be generated.
    
    How the threads are organized is called the loop schedule.
    Below you can see the loop schedule for our reduction loop.
    The do loops have been replaced with a three-dimensional gang,
    which in turn is composed of a two-dimensional vector section.
    
    ```console
           1114, !$acc loop gang ! blockidx%y
           1115, !$acc loop gang, vector(4) ! blockidx%z threadidx%y
           1116, !$acc loop gang, vector(32) ! blockidx%x threadidx%x
    ```
    
    In CUDA terminology, the gang clause corresponds to a grid dimension
    and the vector clause corresponds to a thread block dimension.
    
    So here we have a 3-D array that's being grouped into blocks of 32×4 elements
    where a single thread is working on a specific element.
    Because the number of gangs is not specified in the loop schedule,
    it will be determined dynamically when the kernel is launched.
    If the gang clause had a fixed width, such as gang(16),
    then each kernel would be written to loop over multiple elements.
    
    With CUDA, programming reductions and managing shared memory can be a fairly difficult task.
    In the example below, the compiler has automatically generated optimal code using these features.
    
    ```console
           1122, Sum reduction generated for total_energy_kinetic
           1140, Sum reduction generated for total_energy_potential
    ```
    
    To build and run the step2 code on your system do the following:
    
    ```console
    cd step2
    make build
    make run
    make verify
    ```
    
    ## Step 3: Adding Data Regions
    
    !!! tip
        Set the environment variable `PGI_ACC_TIME=1` and run your executable.
        This option prints basic profile information such as the kernel execution time,
        data transfer time, initialization time, the actual launch configuration,
        and total time spent in a compute region.
        Note that the total time is measured from the host and includes time spent executing host code within a region.
    
    To improve performance, you should minimize the amount of time transferring data,
    i.e. the data directive.
    You can use a data region to specify exact points in your program
    where data should be copied from host memory to GPU memory, and back again.
    Any compute region enclosed within a data region will use the previously copied data,
    without the need to copy at the boundaries of the compute region.
    A data region can span across host code and multiple compute regions,
    and even across subroutine boundaries.
    
    In looking at the arrays in `SEISMIC_CMPL`, there are 18 arrays with constant values.
    Another 21 are used only within compute regions so are never needed on the host.
    Let's start by adding a data region around the outer time step loop.
    The final three arrays do need to be copied back to the host to pass their halos.
    
    For those cases, we use the update directive.
    
    ```code
    !---
    !---  beginning of time loop
    !---
    !$acc data &
    !$acc copyin(a_x_half,b_x_half,k_x_half,                       &
    !$acc        a_y_half,b_y_half,k_y_half,                       &
    !$acc        a_z_half,b_z_half,k_z_half,                       &
    !$acc        a_x,a_y,a_z,b_x,b_y,b_z,k_x,k_y,k_z,              &
    !$acc        sigmaxx,sigmaxz,sigmaxy,sigmayy,sigmayz,sigmazz,  &
    !$acc        memory_dvx_dx,memory_dvy_dx,memory_dvz_dx,        &
    !$acc        memory_dvx_dy,memory_dvy_dy,memory_dvz_dy,        &
    !$acc        memory_dvx_dz,memory_dvy_dz,memory_dvz_dz,        &
    !$acc        memory_dsigmaxx_dx, memory_dsigmaxy_dy,           &
    !$acc        memory_dsigmaxz_dz, memory_dsigmaxy_dx,           &
    !$acc        memory_dsigmaxz_dx, memory_dsigmayz_dy,           &
    !$acc        memory_dsigmayy_dy, memory_dsigmayz_dz,           &
    !$acc        memory_dsigmazz_dz)
    
      do it = 1,NSTEP
    
    ...
    
    !$acc update host(sigmazz,sigmayz,sigmaxz)
    ! sigmazz(k+1), left shift
      call MPI_SENDRECV(sigmazz(:,:,1),number_of_values,MPI_DOUBLE_PRECISION, &
             receiver_left_shift,message_tag,sigmazz(:,:,NZ_LOCAL+1), &
             number_of_values,
    
    ...
    
    !$acc update device(sigmazz,sigmayz,sigmaxz)
    
    ...
    
      ! --- end of time loop
      enddo
    !$acc end data
    ```
    
    Data regions can be nested, and in fact we used this feature
    in the time loop body for the arrays vx, vy and vz as shown below.
    While these arrays are copied back and forth at the inner data region boundary,
    and so are moved more often than the arrays moved in the outer data region,
    they are used across multiple compute regions
    instead of being copied at each compute region boundary.
    
    Note that we do not specify any array dimensions in the copy clause.
    This instructs the compiler to copy each array in its entirety as a contiguous block,
    and eliminates the inefficiency we noted earlier
    when interior sub-arrays were being copied in multiple blocks.
    
    ```code
    !$acc data copy(vx,vy,vz)
    
    ... data region spans over 5 compute regions and host code
    
    !$acc kernels
    
    ...
    
    !$acc end kernels
    
    !$acc end data
    ```
    
    To build and run the step3 code on your system do the following:
    
    ```console
    cd step3
    make build
    make run
    make verify
    ```
    
    ## Step 4: Optimizing Data Transfers
    
    The next steps further optimizes the data transfers
    by migrating as much of the computation as we can over to the GPU
    and moving only the absolute minimum amount of data required.
    The first step is to move the start of the outer data region up
    so that it occurs earlier in the code, and to put the data initialization loops into compute kernels.
    This includes the `vx`, `vy`, and `vz` arrays.
    This approach enables you to remove the inner data region used in the previous optimization step.
    
    In the following example code, notice the use of the `create` clause.
    This instructs the compiler to allocate space for variables in GPU memory for local use
    but to perform no data movement on those variables.
    Essentially they are used as scratch variables in GPU memory.
    
    ```console
    !$acc data                                                     &
    !$acc copyin(a_x_half,b_x_half,k_x_half,                       &
    !$acc        a_y_half,b_y_half,k_y_half,                       &
    !$acc        a_z_half,b_z_half,k_z_half,                       &
    !$acc        ix_rec,iy_rec,                                    &
    !$acc        a_x,a_y,a_z,b_x,b_y,b_z,k_x,k_y,k_z),             &
    !$acc copyout(sisvx,sisvy),                                    &
    !$acc create(memory_dvx_dx,memory_dvy_dx,memory_dvz_dx,        &
    !$acc        memory_dvx_dy,memory_dvy_dy,memory_dvz_dy,        &
    !$acc        memory_dvx_dz,memory_dvy_dz,memory_dvz_dz,        &
    !$acc        memory_dsigmaxx_dx, memory_dsigmaxy_dy,           &
    !$acc        memory_dsigmaxz_dz, memory_dsigmaxy_dx,           &
    !$acc        memory_dsigmaxz_dx, memory_dsigmayz_dy,           &
    !$acc        memory_dsigmayy_dy, memory_dsigmayz_dz,           &
    !$acc        memory_dsigmazz_dz,                               &
    !$acc        vx,vy,vz,vx1,vy1,vz1,vx2,vy2,vz2,                 &
    !$acc        sigmazz1,sigmaxz1,sigmayz1,                       &
    !$acc        sigmazz2,sigmaxz2,sigmayz2)                       &
    !$acc copyin(sigmaxx,sigmaxz,sigmaxy,sigmayy,sigmayz,sigmazz)
    
    ...
    
    ! Initialize vx, vy and vz arrays on the device
    !$acc kernels
      vx(:,:,:) = ZERO
      vy(:,:,:) = ZERO
      vz(:,:,:) = ZERO
    !$acc end kernels
    
    ...
    ```
    
    One caveat to using data regions is that you must be aware of which copy
    (host or device) of the data you are actually using in a given loop or computation.
    For example, any update to the copy of a variable in device memory
    won't be reflected in the host copy until you specified
    using either an update directive or a `copy` clause at a data or compute region boundary.
    
    !!! important
        Unintentional loss of coherence between the host and device copy of a variable is one of the most common causes of validation errors in OpenACC programs.
    
    After making the above change to `SEISMIC_CPML`, the code generated incorrect results. After debugging, it was determined that the section of the time step loop
    that initializes boundary conditions was omitted from an OpenACC compute region.
    As a result, we were initializing the host copy of the data,
    rather than the device copy as intended, which resulted in uninitialized variables in device memory.
    
    The next challenge in optimizing the data transfers related to the handling of the halo regions.
    `SEISMIC_CPML` passes halos from six 3-D arrays between MPI processes during the course of the computations.
    
    After some experimentation, we settled on an approach whereby we added six new temporary 2-D arrays to hold the halo data.
    Within a compute region we gathered the 2-D halos from the main 3-D arrays
    into the new temp arrays, copied the temporaries back to the host in one contiguous block,
    passed the halos between MPI processes, and finally copied the exchanged values
    back to device memory and scattered the halos back into the 3-D arrays.
    While this approach does add to the kernel execution time, it saves a considerable amount of data transfer time.
    
    In the example code below, note that the source code added to support the halo
    gathers and transfers is guarded by the preprocessor `_OPENACC` macro
    and will only be executed if the code is compiled by an OpenACC-enabled compiler.
    
    ```code
    #ifdef _OPENACC
    #
    ! Gather the sigma 3D arrays to a 2D slice to allow for faster
    ! copy from the device to host
    !$acc kernels
       do i=1,NX
        do j=1,NY
          vx1(i,j)=vx(i,j,1)
          vy1(i,j)=vy(i,j,1)
          vz1(i,j)=vz(i,j,NZ_LOCAL)
        enddo
      enddo
    !$acc end kernels
    !$acc update host(vxl,vyl,vzl)
    
    ! vx(k+1), left shift
      call MPI_SENDRECV(vx1(:,:), number_of_values, MPI_DOUBLE_PRECISION, &
           receiver_left_shift, message_tag, vx2(:,:), number_of_values, &
           MPI_DOUBLE_PRECISION, sender_left_shift, message_tag, MPI_COMM_WORLD,&
           message_status, code)
    
    ! vy(k+1), left shift
      call MPI_SENDRECV(vy1(:,:), number_of_values, MPI_DOUBLE_PRECISION, &
           receiver_left_shift,message_tag, vy2(:,:),number_of_values,   &
           MPI_DOUBLE_PRECISION, sender_left_shift, message_tag, MPI_COMM_WORLD,&
           message_status, code)
    
    ! vz(k-1), right shift
      call MPI_SENDRECV(vz1(:,:), number_of_values, MPI_DOUBLE_PRECISION, &
           receiver_right_shift, message_tag, vz2(:,:), number_of_values, &
           MPI_DOUBLE_PRECISION, sender_right_shift, message_tag, MPI_COMM_WORLD, &
           message_status, code)
    
    !$acc update device(vx2,vy2,vz2)
    !$acc kernels
      do i=1,NX
        do j=1,NY
          vx(i,j,NZ_LOCAL+1)=vx2(i,j)
          vy(i,j,NZ_LOCAL+1)=vy2(i,j)
          vz(i,j,0)=vz2(i,j)
        enddo
      enddo
    !$acc end kernels
    
    #else
    ```
    
    To build and run the step4 code on your system do the following:
    
    ```console
    cd step4
    make build
    make run
    make verify
    ```
    
    ## Step 5: Loop Schedule Tuning
    
    The final step is to tune the OpenACC compute region loop schedules
    using the gang, worker, and vector clauses.
    The default kernel schedules chosen by the NVIDIA OpenACC compiler are usually quite good.
    Manual tuning efforts often don't improve timings significantly,
    but it's always worthwhile to spend a little time examining
    whether you can do better by overriding compiler-generated loop schedules
    using explicit loop scheduling clauses.
    You can usually tell fairly quickly if the clauses are having an effect.
    
    Note that there is no well-defined method for finding an optimal kernel schedule.
    The best advice is to start with the compiler's default schedule and try small adjustments
    to see if and how they affect execution time.
    The kernel schedule you choose will affect whether and how shared memory is used,
    global array accesses, and various types of optimizations.
    Typically, it's better to perform gang scheduling of loops with large iteration counts.
    
    ```code
    !$acc loop gang
      do k = k2begin,NZ_LOCAL
        kglobal = k + offset_k
    !$acc loop worker vector collapse(2)
        do j = 2,NY
          do i = 2,NX
    ```
    
    To build and run the step5 code on your system do the following:
    
    ```console
    cd step5
    make build
    make run
    make verify
    ```
    
    [1]: https://docs.nvidia.com/hpc-sdk/compilers/openacc-mpi-tutorial/index.html
    [2]: https://docs.nvidia.com/hpc-sdk/compilers/openacc-mpi-tutorial/openacc-mpi-tutorial.tar.gz