Skip to content
Snippets Groups Projects
Commit 2b57d0dc authored by Jan Siwiec's avatar Jan Siwiec
Browse files

fixed format

parent 090ef47b
Branches
No related tags found
No related merge requests found
Pipeline #32773 passed with warnings
...@@ -7,15 +7,16 @@ you need to prepare a job script for that partition or use the interactive job: ...@@ -7,15 +7,16 @@ you need to prepare a job script for that partition or use the interactive job:
salloc -N 1 -c 64 -A PROJECT-ID -p p03-amd --gres=gpu:4 --time=08:00:00 salloc -N 1 -c 64 -A PROJECT-ID -p p03-amd --gres=gpu:4 --time=08:00:00
``` ```
where: where:
- -N 1 means allocating one server,
- -c 64 means allocation 64 cores, - -N 1 means allocating one server,
- -A is your project, - -c 64 means allocation 64 cores,
- -p p03-amd is AMD partition, - -A is your project,
- -p p03-amd is AMD partition,
- --gres=gpu:4 means allcating all 4 GPUs of the node, - --gres=gpu:4 means allcating all 4 GPUs of the node,
- --time=08:00:00 means allocation for 8 hours. - --time=08:00:00 means allocation for 8 hours.
You have also an option to allocate subset of the resources only, by reducing the -c and --gres=gpu to smaller values. You have also an option to allocate subset of the resources only, by reducing the -c and --gres=gpu to smaller values.
```console ```console
salloc -N 1 -c 48 -A PROJECT-ID -p p03-amd --gres=gpu:3 --time=08:00:00 salloc -N 1 -c 48 -A PROJECT-ID -p p03-amd --gres=gpu:3 --time=08:00:00
...@@ -25,21 +26,20 @@ salloc -N 1 -c 16 -A PROJECT-ID -p p03-amd --gres=gpu:1 --time=08:00:00 ...@@ -25,21 +26,20 @@ salloc -N 1 -c 16 -A PROJECT-ID -p p03-amd --gres=gpu:1 --time=08:00:00
!!! Note !!! Note
p03-amd01 server has hyperthreading enabled therefore htop shows 128 cores. p03-amd01 server has hyperthreading enabled therefore htop shows 128 cores.
p03-amd02 server has hyperthreading dissabled therefore htop shows 64 cores.
p03-amd02 server has hyperthreading dissabled therefore htop shows 64 cores.
## Using AMD MI100 GPUs ## Using AMD MI100 GPUs
The AMD GPUs can be programmed using the [ROCm open-source platform](https://docs.amd.com/). The AMD GPUs can be programmed using the [ROCm open-source platform](https://docs.amd.com/).
ROCm and related libraries are installed directly in the system. You can find it here: ROCm and related libraries are installed directly in the system. You can find it here:
```console ```console
/opt/rocm/ /opt/rocm/
``` ```
The actual version can be found here: The actual version can be found here:
```console ```console
[user@p03-amd02.cs]$ cat /opt/rocm/.info/version [user@p03-amd02.cs]$ cat /opt/rocm/.info/version
...@@ -49,9 +49,9 @@ The actual version can be found here: ...@@ -49,9 +49,9 @@ The actual version can be found here:
## Basic HIP Code ## Basic HIP Code
The first way how to program AMD GPUs is to use HIP. The first way how to program AMD GPUs is to use HIP.
The basic vector addition code in HIP looks like this. This a full code and you can copy and paste it into a file. For this example we use `vector_add.hip.cpp` . The basic vector addition code in HIP looks like this. This a full code and you can copy and paste it into a file. For this example we use `vector_add.hip.cpp`.
```console ```console
#include <cstdio> #include <cstdio>
...@@ -90,7 +90,7 @@ int main() ...@@ -90,7 +90,7 @@ int main()
for(long long i = 0; i < count; i++) for(long long i = 0; i < count; i++)
printf(" %7.2f", h_y[i]); printf(" %7.2f", h_y[i]);
printf("\n"); printf("\n");
// allocation of memory on the GPU device // allocation of memory on the GPU device
float * d_x; float * d_x;
float * d_y; float * d_y;
...@@ -126,10 +126,10 @@ int main() ...@@ -126,10 +126,10 @@ int main()
} }
``` ```
To compile the code we use `hipcc` compiler. The compiler information can be found like this: To compile the code we use `hipcc` compiler. The compiler information can be found like this:
```console ```console
[user@p03-amd02.cs ~]$ hipcc --version [user@p03-amd02.cs ~]$ hipcc --version
HIP version: 5.5.30202-eaf00c0b HIP version: 5.5.30202-eaf00c0b
AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.5.1 23194 69ef12a7c3cc5b0ccf820bc007bd87e8b3ac3037) AMD clang version 16.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-5.5.1 23194 69ef12a7c3cc5b0ccf820bc007bd87e8b3ac3037)
...@@ -138,16 +138,16 @@ Thread model: posix ...@@ -138,16 +138,16 @@ Thread model: posix
InstalledDir: /opt/rocm-5.5.1/llvm/bin InstalledDir: /opt/rocm-5.5.1/llvm/bin
``` ```
The code is compiled a follows: The code is compiled a follows:
```console ```console
hipcc vector_add.hip.cpp -o vector_add.x hipcc vector_add.hip.cpp -o vector_add.x
``` ```
The correct output of the code is: The correct output of the code is:
```console ```console
[user@p03-amd02.cs ~]$ ./vector_add.x [user@p03-amd02.cs ~]$ ./vector_add.x
X: 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00 X: 0.00 1.00 2.00 3.00 4.00 5.00 6.00 7.00 8.00 9.00
Y: 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Y: 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00
Y: 0.00 110.00 220.00 330.00 440.00 550.00 660.00 770.00 880.00 990.00 Y: 0.00 110.00 220.00 330.00 440.00 550.00 660.00 770.00 880.00 990.00
...@@ -157,16 +157,15 @@ More details on HIP programming is in the [HIP Programming Guide](https://docs.a ...@@ -157,16 +157,15 @@ More details on HIP programming is in the [HIP Programming Guide](https://docs.a
## HIP and ROCm Libraries ## HIP and ROCm Libraries
The list of official AMD libraries can be found [here](https://docs.amd.com/category/libraries). The list of official AMD libraries can be found [here](https://docs.amd.com/category/libraries).
The libraries are installed in the same directory is ROCm The libraries are installed in the same directory is ROCm
```console ```console
/opt/rocm/ /opt/rocm/
``` ```
Following libraries are installed: Following libraries are installed:
```console ```console
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipblas drwxr-xr-x 4 root root 44 Jun 7 14:09 hipblas
...@@ -179,7 +178,7 @@ drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsolver ...@@ -179,7 +178,7 @@ drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsolver
drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsparse drwxr-xr-x 4 root root 44 Jun 7 14:09 hipsparse
``` ```
and and
```console ```console
drwxr-xr-x 4 root root 32 Jun 7 14:09 rocalution drwxr-xr-x 4 root root 32 Jun 7 14:09 rocalution
...@@ -194,7 +193,7 @@ drwxr-xr-x 3 root root 29 Jun 7 14:09 rocthrust ...@@ -194,7 +193,7 @@ drwxr-xr-x 3 root root 29 Jun 7 14:09 rocthrust
## Using HipBlas Library ## Using HipBlas Library
The basic code in HIP that uses hipBlas looks like this. This a full code and you can copy and paste it into a file. For this example we use `hipblas.hip.cpp` . The basic code in HIP that uses hipBlas looks like this. This a full code and you can copy and paste it into a file. For this example we use `hipblas.hip.cpp`.
```console ```console
#include <cstdio> #include <cstdio>
...@@ -205,7 +204,7 @@ The basic code in HIP that uses hipBlas looks like this. This a full code and yo ...@@ -205,7 +204,7 @@ The basic code in HIP that uses hipBlas looks like this. This a full code and yo
int main() int main()
{ {
srand(9600); srand(9600);
int width = 10; int width = 10;
...@@ -235,7 +234,7 @@ int main() ...@@ -235,7 +234,7 @@ int main()
for(int i = 0; i < width; i++) for(int i = 0; i < width; i++)
printf("%6.3f ", h_x[i]); printf("%6.3f ", h_x[i]);
printf("\n"); printf("\n");
float * h_y; float * h_y;
hipHostMalloc(&h_y, height * sizeof(*h_y)); hipHostMalloc(&h_y, height * sizeof(*h_y));
for(int i = 0; i < height; i++) for(int i = 0; i < height; i++)
...@@ -245,7 +244,7 @@ int main() ...@@ -245,7 +244,7 @@ int main()
printf("%6.3f ", h_x[i]); printf("%6.3f ", h_x[i]);
printf("\n"); printf("\n");
// initialization of data in GPU memory // initialization of data in GPU memory
float * d_A; float * d_A;
...@@ -257,7 +256,7 @@ int main() ...@@ -257,7 +256,7 @@ int main()
float * d_x; float * d_x;
hipMalloc(&d_x, width * sizeof(*d_x)); hipMalloc(&d_x, width * sizeof(*d_x));
hipMemcpy(d_x, h_x, width * sizeof(*d_x), hipMemcpyHostToDevice); hipMemcpy(d_x, h_x, width * sizeof(*d_x), hipMemcpyHostToDevice);
float * d_y; float * d_y;
hipMalloc(&d_y, height * sizeof(*d_y)); hipMalloc(&d_y, height * sizeof(*d_y));
hipMemcpy(d_y, h_y, height * sizeof(*d_y), hipMemcpyHostToDevice); hipMemcpy(d_y, h_y, height * sizeof(*d_y), hipMemcpyHostToDevice);
...@@ -276,8 +275,8 @@ int main() ...@@ -276,8 +275,8 @@ int main()
for(int i = 0; i < height; i++) for(int i = 0; i < height; i++)
printf("%6.3f ", h_y[i]); printf("%6.3f ", h_y[i]);
printf("\n"); printf("\n");
// calculation of the result on the GPU using the hipBLAS library // calculation of the result on the GPU using the hipBLAS library
hipblasHandle_t blas_handle; hipblasHandle_t blas_handle;
...@@ -287,7 +286,7 @@ int main() ...@@ -287,7 +286,7 @@ int main()
hipDeviceSynchronize(); hipDeviceSynchronize();
hipblasDestroy(blas_handle); hipblasDestroy(blas_handle);
// copy the GPU result to CPU memory and print it // copy the GPU result to CPU memory and print it
hipMemcpy(h_y, d_y, height * sizeof(*d_y), hipMemcpyDeviceToHost); hipMemcpy(h_y, d_y, height * sizeof(*d_y), hipMemcpyDeviceToHost);
...@@ -309,7 +308,7 @@ int main() ...@@ -309,7 +308,7 @@ int main()
} }
``` ```
The code compilation can be done as follows: The code compilation can be done as follows:
```console ```console
hipcc hipblas.hip.cpp -o hipblas.x -lhipblas hipcc hipblas.hip.cpp -o hipblas.x -lhipblas
...@@ -317,7 +316,7 @@ hipcc hipblas.hip.cpp -o hipblas.x -lhipblas ...@@ -317,7 +316,7 @@ hipcc hipblas.hip.cpp -o hipblas.x -lhipblas
## Using HipSolver Library ## Using HipSolver Library
The basic code in HIP that uses hipSolver looks like this. This a full code and you can copy and paste it into a file. For this example we use `hipsolver.hip.cpp` . The basic code in HIP that uses hipSolver looks like this. This a full code and you can copy and paste it into a file. For this example we use `hipsolver.hip.cpp`.
```console ```console
#include <cstdio> #include <cstdio>
...@@ -348,8 +347,8 @@ int main() ...@@ -348,8 +347,8 @@ int main()
for(int c = 0; c < size; c++) for(int c = 0; c < size; c++)
printf("%6.3f ", h_A[r * h_A_ld + c]); printf("%6.3f ", h_A[r * h_A_ld + c]);
printf("\n"); printf("\n");
} }
std::vector<float> h_b(size); std::vector<float> h_b(size);
for(int i = 0; i < size; i++) for(int i = 0; i < size; i++)
h_b[i] = (10.0 * rand()) / RAND_MAX; h_b[i] = (10.0 * rand()) / RAND_MAX;
...@@ -370,7 +369,7 @@ int main() ...@@ -370,7 +369,7 @@ int main()
float * d_b; float * d_b;
hipMalloc(&d_b, size * sizeof(float)); hipMalloc(&d_b, size * sizeof(float));
float * d_x; float * d_x;
hipMalloc(&d_x, size * sizeof(float)); hipMalloc(&d_x, size * sizeof(float));
...@@ -382,7 +381,7 @@ int main() ...@@ -382,7 +381,7 @@ int main()
hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice); hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
hipMemcpy(d_b, h_b.data(), size * sizeof(float), hipMemcpyHostToDevice); hipMemcpy(d_b, h_b.data(), size * sizeof(float), hipMemcpyHostToDevice);
// solving the system using hipSOLVER // solving the system using hipSOLVER
...@@ -395,7 +394,7 @@ int main() ...@@ -395,7 +394,7 @@ int main()
float * workspace; float * workspace;
int wss = std::max(wss_trf, wss_trs); int wss = std::max(wss_trf, wss_trs);
hipMalloc(&workspace, wss * sizeof(float)); hipMalloc(&workspace, wss * sizeof(float));
hipsolverSgetrf(solverHandle, size, size, d_A, d_A_ld, workspace, wss, d_piv, info); hipsolverSgetrf(solverHandle, size, size, d_A, d_A_ld, workspace, wss, d_piv, info);
hipsolverSgetrs(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, workspace, wss, info); hipsolverSgetrs(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, workspace, wss, info);
...@@ -445,13 +444,13 @@ int main() ...@@ -445,13 +444,13 @@ int main()
} }
``` ```
The code compilation can be done as follows: The code compilation can be done as follows:
```console ```console
hipcc hipsolver.hip.cpp -o hipsolver.x -lhipblas -lhipsolver hipcc hipsolver.hip.cpp -o hipsolver.x -lhipblas -lhipsolver
``` ```
## Using OpenMP Offload to Program AMD GPUs ## Using OpenMP Offload to Program AMD GPUs
The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of the OpenMP 5.0 standard. Fortran, C/C++ compilers, and corresponding runtime libraries are included. The ROCm™ installation includes an LLVM-based implementation that fully supports the OpenMP 4.5 standard and a subset of the OpenMP 5.0 standard. Fortran, C/C++ compilers, and corresponding runtime libraries are included.
...@@ -463,11 +462,11 @@ The OpenMP toolchain is automatically installed as part of the standard ROCm ins ...@@ -463,11 +462,11 @@ The OpenMP toolchain is automatically installed as part of the standard ROCm ins
- `lib` : Libraries including those required for target offload. - `lib` : Libraries including those required for target offload.
- `lib-debug` : Debug versions of the above libraries. - `lib-debug` : Debug versions of the above libraries.
More information can be found in the [AMD OpenMP Support Guide](https://docs.amd.com/bundle/OpenMP-Support-Guide-v5.5/page/Introduction_to_OpenMP_Support_Guide.html). More information can be found in the [AMD OpenMP Support Guide](https://docs.amd.com/bundle/OpenMP-Support-Guide-v5.5/page/Introduction_to_OpenMP_Support_Guide.html).
## Compilation of OpenMP Code ## Compilation of OpenMP Code
Basic example that uses OpenMP offload is here. Again, code is comlete and can be copy and pasted into file. Here we use `vadd.cpp`. Basic example that uses OpenMP offload is here. Again, code is comlete and can be copy and pasted into file. Here we use `vadd.cpp`.
```console ```console
#include <cstdio> #include <cstdio>
...@@ -497,7 +496,7 @@ int main(int argc, char ** argv) ...@@ -497,7 +496,7 @@ int main(int argc, char ** argv)
for(long long i = 0; i < print_count; i++) for(long long i = 0; i < print_count; i++)
printf("%3lld ", a[i]); printf("%3lld ", a[i]);
printf("\n"); printf("\n");
printf("B: "); printf("B: ");
for(long long i = 0; i < print_count; i++) for(long long i = 0; i < print_count; i++)
printf("%3lld ", b[i]); printf("%3lld ", b[i]);
...@@ -523,19 +522,21 @@ int main(int argc, char ** argv) ...@@ -523,19 +522,21 @@ int main(int argc, char ** argv)
} }
``` ```
This code can be compiled like this: This code can be compiled like this:
```console ```console
/opt/rocm/llvm/bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 vadd.cpp -o vadd.x /opt/rocm/llvm/bin/clang++ -O3 -target x86_64-pc-linux-gnu -fopenmp -fopenmp-targets=amdgcn-amd-amdhsa -Xopenmp-target=amdgcn-amd-amdhsa -march=gfx908 vadd.cpp -o vadd.x
``` ```
These options are required for target offload from an OpenMP program: These options are required for target offload from an OpenMP program:
- `-target x86_64-pc-linux-gnu`
- `-fopenmp`
- `-fopenmp-targets=amdgcn-amd-amdhsa`
- `-Xopenmp-target=amdgcn-amd-amdhsa`
This flag specifies the GPU architecture of targeted GPU. You need to chage this when moving for instance to LUMI with MI250X GPU. The MI100 GPUs presented in CS have code `gfx908`: - `-target x86_64-pc-linux-gnu`
- `-fopenmp`
- `-fopenmp-targets=amdgcn-amd-amdhsa`
- `-Xopenmp-target=amdgcn-amd-amdhsa`
This flag specifies the GPU architecture of targeted GPU. You need to chage this when moving for instance to LUMI with MI250X GPU. The MI100 GPUs presented in CS have code `gfx908`:
- `-march=gfx908` - `-march=gfx908`
Note: You also have to include the `O0`, `O2`, `O3` or `O3` flag. Without this flag the execution of the compiled code fails. Note: You also have to include the `O0`, `O2`, `O3` or `O3` flag. Without this flag the execution of the compiled code fails.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment