The `-acc` command line option to the HPC Accelerator Fortran compiler enables OpenACC directives. Note that OpenACC is meant to model a generic class of devices.
Another compiler option you'll want to use during development is `-Minfo`,
which provides feedback on optimizations and transformations performed on your code.
For accelerator-specific information, use the `-Minfo=accel` sub-option.
Examples of feedback messages produced when compiling `SEISMIC_CPML` include:
To compute on a GPU, the first step is to move data from host memory to GPU memory.
In the example above, the compiler tells you that it is copying over nine arrays.
Note the `copyin` statements.
These mean that the compiler will only copy the data to the GPU
but not copy it back to the host.
This is because line 1113 corresponds to the start of the reduction loop compute region,
where these arrays are used but never modified.
Data movement clauses:
*`copyin` - the data is copied only to the GPU;
*`copy` - the data is copied to the device at the beginning of the region and copied back at the end of the region;
*`copyout` - the data is only copied back to the host.
The compiler is conservative and only copies the data
that's actually required to perform the necessary computations.
Unfortunately, because the interior sub-arrays are not contiguous in host memory,
the compiler needs to generate multiple data transfers for each array.
```console
1114, Loop is parallelizable
1115, Loop is parallelizable
1116, Loop is parallelizable
Accelerator kernel generated
```
Here the compiler has performed dependence analysis
on the loops at lines 1114, 1115, and 1116 (the reduction loop shown earlier).
It finds that all three loops are parallelizable so it generates an accelerator kernel.
The compiler may attempt to work around dependences that prevent parallelization by interchanging loops (i.e changing the order) where it's safe to do so. At least one outer or interchanged loop must be parallel for an accelerator kernel to be generated.
How the threads are organized is called the loop schedule.
Below you can see the loop schedule for our reduction loop.
The do loops have been replaced with a three-dimensional gang,
which in turn is composed of a two-dimensional vector section.
One caveat to using data regions is that you must be aware of which copy
(host or device) of the data you are actually using in a given loop or computation.
For example, any update to the copy of a variable in device memory
won't be reflected in the host copy until you specified
using either an update directive or a `copy` clause at a data or compute region boundary.
!!! important
Unintentional loss of coherence between the host and device copy of a variable is one of the most common causes of validation errors in OpenACC programs.
After making the above change to `SEISMIC_CPML`, the code generated incorrect results. After debugging, it was determined that the section of the time step loop
that initializes boundary conditions was omitted from an OpenACC compute region.
As a result, we were initializing the host copy of the data,
rather than the device copy as intended, which resulted in uninitialized variables in device memory.
The next challenge in optimizing the data transfers related to the handling of the halo regions.
`SEISMIC_CPML` passes halos from six 3-D arrays between MPI processes during the course of the computations.
After some experimentation, we settled on an approach whereby we added six new temporary 2-D arrays to hold the halo data.
Within a compute region we gathered the 2-D halos from the main 3-D arrays
into the new temp arrays, copied the temporaries back to the host in one contiguous block,
passed the halos between MPI processes, and finally copied the exchanged values
back to device memory and scattered the halos back into the 3-D arrays.
While this approach does add to the kernel execution time, it saves a considerable amount of data transfer time.
In the example code below, note that the source code added to support the halo
gathers and transfers is guarded by the preprocessor `_OPENACC` macro
and will only be executed if the code is compiled by an OpenACC-enabled compiler.
```code
#ifdef _OPENACC
#
! Gather the sigma 3D arrays to a 2D slice to allow for faster