diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000000000000000000000000000000000000..d835df7a68539f6eab4dab4137223aceea4a1617
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,2 @@
+*.o
+*.x
diff --git a/01_vector_add/Makefile b/01_vector_add/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..212d4a7a951266d76e52242e337c2116e51967d2
--- /dev/null
+++ b/01_vector_add/Makefile
@@ -0,0 +1,16 @@
+
+.PHONY: compile clean run
+
+
+
+compile:
+	hipcc -g -O2 vector_add.hip.cpp -o vector_add.x
+
+clean:
+	rm -f *.x
+
+run:
+	./vector_add.x
+
+
+
diff --git a/01_vector_add/README.md b/01_vector_add/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..997211aa0fe7c1f8932c9729483d0b86d15a4ea0
--- /dev/null
+++ b/01_vector_add/README.md
@@ -0,0 +1,17 @@
+
+HIP example: vector add (saxpy)
+===============================
+
+This program generates two arrays of numbers and performs the classic saxpy operation on them, Y = Y + a * X, using HIP.
+
+At the top of the file we `#include` the `hip_runtime.h` header file containing declarations of all the HIP-related functions and macros. In the `main` function, we first allocate the data arrays on the host (CPU memory), initialize and print them. Then we allocate the memory on the GPU device using the `hipMalloc` function and copy the data from host to device using the `hipMemcpy` function. Next, we call the computational kernel performing the saxpy operation using the `hipLaunchKernelGGL` function, after which the results are copied back to the host memory using `hipMemcpy`. Finally we print the results and free the allocated memory using `hipFree` for device-side arrays, and classic `delete[]` for host-side buffers.
+
+
+
+### Compilation
+
+```
+hipcc vector_add.hip.cpp -o vector_add.x
+```
+or consult the attached Makefile.
+
diff --git a/01_vector_add/vector_add.hip.cpp b/01_vector_add/vector_add.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..2727d734d47363c88409b8a3593c9a27713e822d
--- /dev/null
+++ b/01_vector_add/vector_add.hip.cpp
@@ -0,0 +1,72 @@
+#include <cstdio>
+#include <hip/hip_runtime.h>
+
+
+
+__global__ void add_vectors(float * x, float * y, float alpha, int count)
+{
+    // loop through all the elements in the vector
+    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
+    {
+        y[idx] += alpha * x[idx];
+    }
+}
+
+
+
+int main()
+{
+    // number of elements in the vectors
+    long long count = 10;
+
+    // allocation and initialization of data on the host (CPU memory)
+    float * h_x = new float[count];
+    float * h_y = new float[count];
+    for(long long i = 0; i < count; i++)
+    {
+        h_x[i] = i;
+        h_y[i] = 10 * i;
+    }    
+
+    // print the input data
+    printf("X:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_x[i]);
+    printf("\n");
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+    
+    // allocation of memory on the GPU device
+    float * d_x;
+    float * d_y;
+    hipMalloc(&d_x, count * sizeof(float));
+    hipMalloc(&d_y, count * sizeof(float));
+
+    // copy the data from host memory to the device
+    hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
+    hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
+
+    // launch the kernel on the GPU
+    add_vectors<<< 20,128 >>>(d_x, d_y, 100, count);
+    // hipLaunchKernel(add_vectors, 20, 128, 0, 0, d_x, d_y, 100, count);
+
+    // copy the result back to CPU memory
+    hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
+
+    // print the results
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+
+    // free the allocated memory
+    hipFree(d_x);
+    hipFree(d_y);
+    delete[] h_x;
+    delete[] h_y;
+
+    return 0;
+}
+
diff --git a/02_complicated_reduction/Makefile b/02_complicated_reduction/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..84ad9f842909aca58bdd3e28f2a193b13f709f36
--- /dev/null
+++ b/02_complicated_reduction/Makefile
@@ -0,0 +1,16 @@
+
+.PHONY: compile clean run
+
+
+
+compile:
+	hipcc -g -O2 reduction.hip.cpp -o reduction.x
+
+clean:
+	rm -f *.x
+
+run:
+	./reduction.x
+
+
+
diff --git a/02_complicated_reduction/README.md b/02_complicated_reduction/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..1d642bc028eae8f5b0a8d3a4ac49baf49ff28171
--- /dev/null
+++ b/02_complicated_reduction/README.md
@@ -0,0 +1,20 @@
+
+HIP example: complicated reduction
+==================================
+
+This example demonstrates as many of the fundamental and advanced features of HIP as possible.
+
+It allocates an array of numbers, on which it performs the following operation. It calculates all the differences `data[i+1]-data[i]`, wrapping around if necessary, and calculates sine of that difference (multiplied by some constant). The results are then all added together, producing the final result.
+
+Let us examine the computation kernel `calculate` in more detail. We create a **dynamic shared memory** array to later store the data. First thread of each block prints, that the threadblock has started execution using the **printf function in the kernel**. Then the threadblock loops through all the data. The required data are copied from the global memory into the shared memory array, followed by **synchronization of all threads within the threadblock**. Then the difference is calculated, along with its sine using `sinf`, one of the **math functions**. The input and output are scaled using values from a **constant memory** variable. The values are then summed within a warp using **warp shuffle functions**, while keeping the code **wave-aware** (functional across multiple possible warp sizes). Finally, the first lane of each warp contributes the warp's partial sum into the global result using **atomic add**. At the end of the kernel, a message about finished execution of the threadblock is again displayed using `printf`.
+
+In the `main` function we allocate **managed memory** array using `hipMallocManaged` function and randomly initialize it on the host (using the CPU). We create a **stream**, which we will use to execute operations asynchronously. The summation result scalar has to be allocated in the device memory, to which we asynchronously copy value zero. Then we **initialize the constant memory** variable by asynchronously copying to it using the `hipMemcpyToSymbolAsync` function (note the `HIP_SYMBOL` macro around the constant variable name). Then the **computational kernel is launched asynchronously** in the created stream, passing the desired dynamic shared memory size as a parameter to the `hipLaunchKernelGGL` function. Then we submit an **asynchronous copy** of the resulting sum back to the CPU memory. We **synchronize with the stream** (wait for all the operations in the stream to finish) using the `hipStreamSynchronize` function, after which we destroy the stream, free the allocated memory and print the final result (which should be around 9733.31).
+
+
+
+### Compilation
+
+```
+hipcc reduction.hip.cpp -o reduction.x
+```
+or consult the attached Makefile.
diff --git a/02_complicated_reduction/reduction.hip.cpp b/02_complicated_reduction/reduction.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..cc53f44ff4e6e8d1c67307da83954ad58882e241
--- /dev/null
+++ b/02_complicated_reduction/reduction.hip.cpp
@@ -0,0 +1,121 @@
+#include <cstdio>
+#include <cstdlib>
+#include <hip/hip_runtime.h>
+
+//   shmem (normal and dynamic)
+//   constant
+//   unified
+//   math functions
+//   __syncthread()
+//   streams
+//   warp shuffle
+//   printf
+//   templated kernel
+
+
+
+struct Parameters
+{
+    float alpha;
+    float beta;
+};
+
+
+
+// variable in constant memory
+__device__ __constant__ Parameters c_params;
+
+
+
+__global__ void calculate(float * data, int count, float * result)
+{
+    // dynamic shared memory
+    extern __shared__ float shmem_data[];
+
+    if(threadIdx.x == 0)
+        printf("Block %3d started\n", (int)blockIdx.x);
+
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+    int lane = threadIdx.x & (warpSize - 1);
+
+    for(int idx = tid; idx < count; idx += tpg)
+    {
+        shmem_data[threadIdx.x] = data[idx];
+        if(threadIdx.x == 0)
+            shmem_data[blockDim.x] = data[(idx + blockDim.x) % count];
+        // thread synchronization
+        __syncthreads();
+
+        float my_diff = shmem_data[threadIdx.x+1] - shmem_data[threadIdx.x];
+        // math functions
+        // use of constant memory
+        float my_val = c_params.beta * sinf(c_params.alpha * my_diff);
+
+        // intra-warp reduction using warp shuffle functions
+        // wave-aware code
+        float warp_sum = my_val;
+        for(int i = warpSize >> 1; i > 0; i >>= 1)
+            warp_sum += __shfl_down(warp_sum, i);
+
+        // atomic operations (although slow on float and double)
+        if(lane == 0)
+            atomicAdd(result, warp_sum);
+    }
+    
+    if(threadIdx.x == 0)
+        printf("Block %3d finished\n", (int)blockIdx.x);
+}
+
+
+
+
+
+int main()
+{
+    int count = 1 << 20;
+
+    srand(2021);
+
+    float * data;
+    // allocation of managed memory
+    hipMallocManaged(&data, count * sizeof(float));
+
+    // initialization on host
+    for(int i = 0; i < count; i++)
+        data[i] = (float)rand() / (float)RAND_MAX;
+
+    // streams
+    hipStream_t stream;
+    hipStreamCreate(&stream);
+    
+    float * d_result;
+    float result = 0.0f;
+    hipMalloc(&d_result, sizeof(float));
+    hipMemcpyAsync(d_result, &result, sizeof(float), hipMemcpyHostToDevice, stream);
+
+    Parameters params;
+    params.alpha = 2.0 * M_PI;
+    params.beta = 10;
+    // constant memory initialization
+    hipMemcpyToSymbolAsync(HIP_SYMBOL(c_params), &params, sizeof(Parameters), 0, hipMemcpyHostToDevice, stream);
+
+    int tpb = 512;
+    int shmem_size = (tpb + 1) * sizeof(float);
+    calculate<<<20, tpb, shmem_size, stream>>>(data, count, d_result);
+
+    hipMemcpyAsync(&result, d_result, sizeof(float), hipMemcpyDeviceToHost, stream);
+
+    // wait for everything to finish
+    hipStreamSynchronize(stream);
+
+    hipStreamDestroy(stream);
+    
+    hipFree(data);
+    hipFree(d_result);
+
+    printf("Result is %f\n", result);
+
+    return 0;
+}
+
diff --git a/03_hipblas/Makefile b/03_hipblas/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..79fca5879314de055f2e4fc5f0db7bdcdae81354
--- /dev/null
+++ b/03_hipblas/Makefile
@@ -0,0 +1,18 @@
+
+HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+
+.PHONY: compile clean run
+
+
+
+compile:
+	hipcc -g -O2 -I${HIPBLASPATH}/include hipblas.hip.cpp -o hipblas.x -L${HIPBLASPATH}/lib -lhipblas
+
+clean:
+	rm -f *.x
+
+run:
+	./hipblas.x
+
+
+
diff --git a/03_hipblas/README.md b/03_hipblas/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..e749098375b8cc0b0ab114c09a4dee99f97666dd
--- /dev/null
+++ b/03_hipblas/README.md
@@ -0,0 +1,24 @@
+
+HIP example: hipBLAS library
+============================
+
+This example demonstrates basic functionality of the hipBLAS library. It performs a basic gemv operation, multiplying a vector by a matrix and adding a result to another vector, `y=a*A*x+b*y`.
+
+At the top of the source file, along the `hip_runtime.h`, we also include the `hipblas.h` header file. The memory on the host is allocated using the `hipHostMalloc` function, which allocates device accessible page locked host memory (better copy performance). The entries of the matrix `A` and vectors `x` and `y` are initialized randomly. Then the device memory is allocated (for the matrix we use pitched memory allocation) and the data are copied from the host to the device (the matrix `A` is copied using `hipMemcpy2D`). For correctness-checking purposes we calculate the result on the CPU. Then we initialize the hipBlas handle using the `hipblasCreate` function, perform the matrix-vector multiplication using the `hipblasSgemv` function, synchronize with the device (wait for the operation to complete) and destroy the handle using the `hipblasDestroy` function. The result is then copied into host memory and printed, followed by releasing all the allocated memory using `hipFree` and `hipHostFree` functions.
+
+
+
+### Compilation
+
+As usual, but add the `-I` flag so the compiler can find the header files, and the linker `-L` and `-l` flags to link with the hipBLAS library. Depending on your system configuration some of those might not be necessary.
+```
+hipcc -I/path/to/hipblas/include hipblas.hip.cpp -o hipblas.x -L/path/to/hipblas/lib -lhipblas
+```
+Using default instalation of HIP and hipBLAS this would be
+```
+hipcc -I/opt/rocm/hipblas/include hipblas.hip.cpp -o hipblas.x -L/opt/rocm/hipblas/lib -lhipblas
+```
+or just
+```
+hipcc -I/opt/rocm/include hipblas.hip.cpp -o hipblas.x -L/opt/rocm/lib -lhipblas
+```
diff --git a/03_hipblas/hipblas.hip.cpp b/03_hipblas/hipblas.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..a60f1c7c9e6416657de14031845000ce45713a85
--- /dev/null
+++ b/03_hipblas/hipblas.hip.cpp
@@ -0,0 +1,120 @@
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <hip/hip_runtime.h>
+#include <hipblas.h>
+
+
+
+int main()
+{    
+    srand(9600);
+
+    int width = 10;
+    int height = 7;
+    int elem_count = width * height;
+
+    // initialization of data in CPU memory
+
+    float * h_A;
+    hipHostMalloc(&h_A, elem_count * sizeof(*h_A));
+    for(int i = 0; i < elem_count; i++)
+        h_A[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("Matrix A:\n");
+    for(int r = 0; r < height; r++)
+    {
+        for(int c = 0; c < width; c++)
+            printf("%6.3f  ", h_A[r + height * c]);
+        printf("\n");
+    }
+
+    float * h_x;
+    hipHostMalloc(&h_x, width * sizeof(*h_x));
+    for(int i = 0; i < width; i++)
+        h_x[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("vector x:\n");
+    for(int i = 0; i < width; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+    
+    float * h_y;
+    hipHostMalloc(&h_y, height * sizeof(*h_y));
+    for(int i = 0; i < height; i++)
+        h_x[i] = 100.0f + i;
+    printf("vector y:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+
+
+
+    
+    // initialization of data in GPU memory
+
+    float * d_A;
+    size_t pitch_A;
+    hipMallocPitch((void**)&d_A, &pitch_A, height * sizeof(*d_A), width);
+    hipMemcpy2D(d_A, pitch_A, h_A, height * sizeof(*d_A), height * sizeof(*d_A), width, hipMemcpyHostToDevice);
+    int lda = pitch_A / sizeof(float);
+
+    float * d_x;
+    hipMalloc(&d_x, width * sizeof(*d_x));
+    hipMemcpy(d_x, h_x, width * sizeof(*d_x), hipMemcpyHostToDevice);
+    
+    float * d_y;
+    hipMalloc(&d_y, height * sizeof(*d_y));
+    hipMemcpy(d_y, h_y, height * sizeof(*d_y), hipMemcpyHostToDevice);
+
+
+
+
+
+    // basic calculation of the result on the CPU
+
+    float alpha=2.0f, beta=10.0f;
+
+    for(int i = 0; i < height; i++)
+        h_y[i] *= beta;
+    for(int r = 0; r < height; r++)
+        for(int c = 0; c < width; c++)
+            h_y[r] += alpha * h_x[c] * h_A[r + height * c];
+    printf("result y CPU:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+    
+
+
+
+    
+    // calculation of the result on the GPU using the hipBLAS library
+
+    hipblasHandle_t blas_handle;
+    hipblasCreate(&blas_handle);
+
+    hipblasSgemv(blas_handle, HIPBLAS_OP_N, height, width, &alpha, d_A, lda, d_x, 1, &beta, d_y, 1);
+    hipDeviceSynchronize();
+
+    hipblasDestroy(blas_handle);
+    
+
+    // copy the GPU result to CPU memory and print it
+    hipMemcpy(h_y, d_y, height * sizeof(*d_y), hipMemcpyDeviceToHost);
+    printf("result y BLAS:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+
+
+    // free all the allocated memory
+    hipFree(d_A);
+    hipFree(d_x);
+    hipFree(d_y);
+    hipHostFree(h_A);
+    hipHostFree(h_x);
+    hipHostFree(h_y);
+
+    return 0;
+}
+
diff --git a/04_hipsolver/Makefile b/04_hipsolver/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..68182f18143afcbe5784df300d83dffa6002875c
--- /dev/null
+++ b/04_hipsolver/Makefile
@@ -0,0 +1,19 @@
+
+HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
+HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+
+.PHONY: compile clean run
+
+
+
+compile:
+	hipcc -g -O2 -I${HIPBLASPATH}/include -I${HIPSOLVERPATH}/include hipsolver.hip.cpp -o hipsolver.x -L${HIPBLASPATH}/lib -lhipblas -L${HIPSOLVERPATH}/lib -lhipsolver
+
+clean:
+	rm -f *.x
+
+run:
+	./hipsolver.x
+
+
+
diff --git a/04_hipsolver/README.md b/04_hipsolver/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6d4981dbb3802b28c31bb394a70ae8a3a49979bc
--- /dev/null
+++ b/04_hipsolver/README.md
@@ -0,0 +1,24 @@
+
+HIP example: hipSOLVER library
+==============================
+
+In this example we demonstrate how to solve a simple dense system of linear equations `Ax=b` using the hipSOLVER library, specifically the `trf-trs` approach.
+
+In the source code, we first create and randomly initialize the matrix `A`, right-hand-side vector `b`, and solution vector `x`. Then we allocate the memory space on the device for the matrix `A`, vectors `b` and `x`, vector containing the pivoting and variable info; and copy the matrix `A` and vector `b` to the device.
+
+Then we start preparing for solving the system and create the hipSOLVER handle. The functions performing the factorization and solving require additional auxiliary workspace memory buffers, we therefore ask how much memory they need (using the `hipsolverSgetrf_bufferSize` and `hipsolverSgetrs_bufferSize` functions) and allocate it. Then we perform the triangular factorization with partial pivoting of the matrix `A` using `hipsolverSgetrf`. The factorized matrix is then used in the function `hipsolverSgetrs` to solve the system. The solution vector is copied to the host and printed, after which we free the workspace memory and destroy the hipSOLVER handle.
+
+To check if the calculations were correct, we multiply the matrix `A` with the solution vector `x`, which should yield vector `b`. For this we use the hipBLAS library, starting with creating the hipBLAS handle. Because the triangular factorization function modified the matrix `A` in the memory, we need to copy it from host to device again. Then we perform the matrix-vector multiplication using the `hipblasSgemv` function, wait for it to finish, destroy the hipBLAS handle, copy the result vector to host and print it. In the end we free all the allocated memory.
+
+
+
+### Compilation
+
+Compilation is very similar to the hipBLAS example, only the `-I`, `-L` and `-l` flags for hipSOLVER also need to be provided,
+```
+hipcc -g -O2 -I/path/to/hipblas/include -I/path/to/hipsolver/include hipsolver.hip.cpp -o hipsolver.x -L/path/to/hipblas/lib -lhipblas -L/path/to/hipsolver/lib -lhipsolver
+```
+or just
+```
+hipcc -I/opt/rocm/include hipsolver.hip.cpp -o hipsolver.x -L/opt/rocm/lib -lhipblas -lhipsolver
+```
diff --git a/04_hipsolver/hipsolver.hip.cpp b/04_hipsolver/hipsolver.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..f5d67dd2029ae4ec548a150366a0dd1a855e33d2
--- /dev/null
+++ b/04_hipsolver/hipsolver.hip.cpp
@@ -0,0 +1,138 @@
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <algorithm>
+#include <hipsolver.h>
+#include <hipblas.h>
+
+
+int main()
+{
+    srand(63456);
+
+    int size = 10;
+
+
+
+    // allocation and initialization of data on host. this time we use std::vector
+
+    int h_A_ld = size;
+    int h_A_pitch = h_A_ld * sizeof(float);
+    std::vector<float> h_A(size * h_A_ld);
+    for(int r = 0; r < size; r++)
+        for(int c = 0; c < size; c++)
+            h_A[r * h_A_ld + c] = (10.0 * rand()) / RAND_MAX;
+    printf("System matrix A:\n");
+    for(int r = 0; r < size; r++)
+    {
+        for(int c = 0; c < size; c++)
+            printf("%6.3f  ", h_A[r * h_A_ld + c]);
+        printf("\n");
+    }    
+    
+    std::vector<float> h_b(size);
+    for(int i = 0; i < size; i++)
+        h_b[i] = (10.0 * rand()) / RAND_MAX;
+    printf("RHS vector b:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+    std::vector<float> h_x(size);
+
+
+
+
+
+    // memory allocation on the device and initialization
+
+    float * d_A;
+    size_t d_A_pitch;
+    hipMallocPitch((void**)&d_A, &d_A_pitch, size, size);
+    int d_A_ld = d_A_pitch / sizeof(float);
+
+    float * d_b;
+    hipMalloc(&d_b, size * sizeof(float));
+    
+    float * d_x;
+    hipMalloc(&d_x, size * sizeof(float));
+
+    int * d_piv;
+    hipMalloc(&d_piv, size * sizeof(int));
+
+    int * info;
+    hipMallocManaged(&info, sizeof(int));
+
+    hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
+    hipMemcpy(d_b, h_b.data(), size * sizeof(float), hipMemcpyHostToDevice);
+    
+
+
+
+
+    // solving the system using hipSOLVER
+
+    hipsolverHandle_t solverHandle;
+    hipsolverCreate(&solverHandle);
+
+    int wss_trf, wss_trs; // wss = WorkSpace Size
+    hipsolverSgetrf_bufferSize(solverHandle, size, size, d_A, d_A_ld, &wss_trf);
+    hipsolverSgetrs_bufferSize(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, &wss_trs);
+    float * workspace;
+    int wss = std::max(wss_trf, wss_trs);
+    hipMalloc(&workspace, wss * sizeof(float));
+    
+    hipsolverSgetrf(solverHandle, size, size, d_A, d_A_ld, workspace, wss, d_piv, info);
+    hipsolverSgetrs(solverHandle, HIPSOLVER_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, workspace, wss, info);
+
+    hipMemcpy(d_x, d_b, size * sizeof(float), hipMemcpyDeviceToDevice);
+    hipMemcpy(h_x.data(), d_x, size * sizeof(float), hipMemcpyDeviceToHost);
+    printf("Solution vector x:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+    hipFree(workspace);
+
+    hipsolverDestroy(solverHandle);
+
+
+
+
+
+    // perform matrix-vector multiplication A*x using hipBLAS to check if the solution is correct
+
+    hipblasHandle_t blasHandle;
+    hipblasCreate(&blasHandle);
+
+    float alpha = 1;
+    float beta = 0;
+    hipMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, hipMemcpyHostToDevice);
+    hipblasSgemv(blasHandle, HIPBLAS_OP_N, size, size, &alpha, d_A, d_A_ld, d_x, 1, &beta, d_b, 1);
+    hipDeviceSynchronize();
+
+    hipblasDestroy(blasHandle);
+
+    for(int i = 0; i < size; i++)
+        h_b[i] = 0;
+    hipMemcpy(h_b.data(), d_b, size * sizeof(float), hipMemcpyDeviceToHost);
+    printf("Check multiplication vector Ax:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+
+
+
+
+    // free all the allocated memory
+
+    hipFree(info);
+    hipFree(d_piv);
+    hipFree(d_x);
+    hipFree(d_b);
+    hipFree(d_A);
+
+    return 0;
+}
+
diff --git a/05_hipify_vector_add/Makefile b/05_hipify_vector_add/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..6a31dfeefdfdb677c850504104434daab006c21b
--- /dev/null
+++ b/05_hipify_vector_add/Makefile
@@ -0,0 +1,30 @@
+
+.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+
+
+
+default:
+	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+
+clean:
+	rm -f *.x *.hip.cpp
+
+
+
+cuda:
+	nvcc -g -O2 vector_add.cu -o vector_add.cuda.x
+
+hip:
+	hipcc -g -O2 vector_add.hip.cpp -o vector_add.hip.x
+
+hipifyclang:
+	hipify-clang vector_add.cu -o vector_add.hip.cpp
+
+hipifyperl:
+	hipify-perl vector_add.cu -o vector_add.hip.cpp
+
+runcuda:
+	./vector_add.cuda.x
+
+runhip:
+	./vector_add.hip.x
diff --git a/05_hipify_vector_add/README.md b/05_hipify_vector_add/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..05fb398a66901754843ecc85376c2d830317496b
--- /dev/null
+++ b/05_hipify_vector_add/README.md
@@ -0,0 +1,26 @@
+
+HIPIFY example: vector add
+==========================
+
+This example demonstrates basic usage of the HIPIFY tool.
+
+The source `vector_add.cu` contains CUDA code performing the classic saxpy operation. It is possible to convert this CUDA source code to HIP source code (to hipify) using one of the following commands
+```
+hipify-perl vector_add.cu -o vector_add.hip.cpp
+```
+```
+hipify-clang vector_add.cu -o vector_add.hip.cpp
+```
+The output file contains HIP source code with the same functionality as the original CUDA code.
+
+The hipified code can then be compiled
+```
+hipcc vector_add.hip.cpp -o vector_add.hip.x
+```
+and run
+```
+./vector_add.hip.x
+```
+
+
+
diff --git a/05_hipify_vector_add/vector_add.cu b/05_hipify_vector_add/vector_add.cu
new file mode 100644
index 0000000000000000000000000000000000000000..24b27f73a14c3af0ba1426321b419607ab7b888b
--- /dev/null
+++ b/05_hipify_vector_add/vector_add.cu
@@ -0,0 +1,61 @@
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+
+
+__global__ void saxpy(float * x, float * y, float a, int count)
+{
+    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
+    {
+        y[idx] += a * x[idx];
+    }
+}
+
+
+int main()
+{
+    long long count = 10;
+    long long size = count * sizeof(float);
+
+    std::vector<float> h_x(count);
+    std::vector<float> h_y(count);
+    float a = 100;
+
+    for(long long i = 0; i < count; i++)
+        h_x[i] = i;
+    for(long long i = 0; i < count; i++)
+        h_y[i] = 10 * i;
+    
+    printf("X:");
+    for(long long i = 0; i < count; i++)
+        printf("%7.2f ", h_x[i]);
+    printf("\n");
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf("%7.2f ", h_y[i]);
+    printf("\n");
+    
+    
+    float * d_x;
+    float * d_y;
+    cudaMalloc(&d_x, size);
+    cudaMalloc(&d_y, size);
+
+    cudaMemcpy(d_x, h_x.data(), size, cudaMemcpyHostToDevice);
+    cudaMemcpy(d_y, h_y.data(), size, cudaMemcpyHostToDevice);
+
+    saxpy<<< 64, 1024 >>>(d_x, d_y, a, count);
+
+    cudaMemcpy(h_y.data(), d_y, size, cudaMemcpyDeviceToHost);
+
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf("%7.2f ", h_y[i]);
+    printf("\n");
+
+    cudaFree(d_x);
+    cudaFree(d_y);
+
+    return 0;
+}
+
diff --git a/06_hipify_warpshuffle/Makefile b/06_hipify_warpshuffle/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..82858f8d028074dda75ffb17ec1e302e7ae5b5e9
--- /dev/null
+++ b/06_hipify_warpshuffle/Makefile
@@ -0,0 +1,30 @@
+
+.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+
+
+
+default:
+	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+
+clean:
+	rm -f *.x *.hip.cpp
+
+
+
+cuda:
+	nvcc -g -O2 warpshuffle.cu -o warpshuffle.cuda.x
+
+hip:
+	hipcc -g -O2 warpshuffle.hip.cpp -o warpshuffle.hip.x
+
+hipifyclang:
+	hipify-clang warpshuffle.cu -o warpshuffle.hip.cpp
+
+hipifyperl:
+	hipify-perl warpshuffle.cu -o warpshuffle.hip.cpp
+
+runcuda:
+	./warpshuffle.cuda.x
+
+runhip:
+	./warpshuffle.hip.x
diff --git a/06_hipify_warpshuffle/README.md b/06_hipify_warpshuffle/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..eabb2e449f228725c84838f1d60c9efa4460a35a
--- /dev/null
+++ b/06_hipify_warpshuffle/README.md
@@ -0,0 +1,24 @@
+
+HIPIFY example: warp shuffle reduction
+======================================
+
+This example demonstrates a hipification use-case, where some manual changes need to be made to complete the hipification.
+
+The CUDA code creates an array of random integers and calculates their sum using warp shuffles and atomics.
+
+Hipifying the CUDA source using either `hipify-perl` or `hipify-clang` yields a warning, that the `__shfl_down_sync` function is unsupported in HIP. We can fix that by replacing it with `__shfl_down` and omitting the first parameter (`0xffffffff`). Such function does exist in HIP, the hipification was just not performed (safety, correctness reasons, probably because of the extra argument). **We are modifying the hipified code, not the original.**
+
+Modifying the warp shuffle function call syntax, we remember, that the warp size on NVIDIA hardware is 32, but on AMD it is 64. The current code would therefore yield incorrect results on AMD GPUs. But we cannot add an additional `__shfl_down(my_sum, 32)` call, because then it might not work on NVIDIA hardware. (We will ignore the fact that the third optional argument of the `__shfl_down` function is a width.) We need to write wave-aware code, which behaves correctly no matter the warp size. We replace all the `__shfl_down` function calls with the following loop
+```
+for(int i = warpSize >> 1; i > 0; i >>= 1)
+    my_sum += __shfl_down(my_sum, i);
+```
+We could template the kernel with warp size and unroll this loop, but this is not necessary.
+
+Compiling the code using
+```
+hipcc warpshuffle.hip.cpp -o warpshuffle.hip.x
+```
+produces an error, `cannot convert int** to void**` in the `hipHostAlloc` function. The solution could be to just cast the double pointer to `void**`, but this is not very convenient. Anyway, the `hipHostAlloc` and `hipMallocHost` functions are marked as deprecated. The currently correct way of allocating host-side page-locked memory in HIP is to use the `hipHostMalloc` function. Therefore we just change the name of the function from `hipHostAlloc` to `hipHostMalloc`. Compilation is now successfull and running the HIP program produces the same results as the CUDA program.
+
+An interesting thing to note is, that if we did not fix the problem with the warp shuffle functions and kept `__shfl_down_sync` in the code, it would still compile and run fine on NVIDIA platform. This is because on NVIDIA platform, `hipcc` internally calls `nvcc`, which understands the `__shfl_down_sync` function without any problem. Anyway, `__shfl_down` is just a wrapper around `__shfl_down_sync` on NVIDIA platform. **Be careful. If it runs on NVIDA platform, is does not mean it will also run on AMD. Test on both platforms.**
diff --git a/06_hipify_warpshuffle/warpshuffle.cu b/06_hipify_warpshuffle/warpshuffle.cu
new file mode 100644
index 0000000000000000000000000000000000000000..f1569e4d43e949fc5bb4b82740275ea7b32fe387
--- /dev/null
+++ b/06_hipify_warpshuffle/warpshuffle.cu
@@ -0,0 +1,56 @@
+#include <cstdio>
+#include <cstdlib>
+
+
+__global__ void reduce_sum_warpshuffles(int * data, int count, int * result)
+{
+    int my_sum = 0;
+    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
+        my_sum += data[idx];
+
+    int lane = threadIdx.x % 32;
+
+    my_sum += __shfl_down_sync(0xffffffff, my_sum, 16);
+    my_sum += __shfl_down_sync(0xffffffff, my_sum,  8);
+    my_sum += __shfl_down_sync(0xffffffff, my_sum,  4);
+    my_sum += __shfl_down_sync(0xffffffff, my_sum,  2);
+    my_sum += __shfl_down_sync(0xffffffff, my_sum,  1);
+
+    if(lane == 0)
+        atomicAdd(result, my_sum);
+}
+
+
+
+int main()
+{
+    srand(12345);
+
+    int count = 1 << 20;
+
+    int * h_data;
+    int * d_data;
+    cudaMalloc(&d_data, count * sizeof(int));
+    cudaHostAlloc(&h_data, count * sizeof(int), cudaHostAllocDefault);
+    int h_result;
+    int * d_result;
+    cudaMalloc(&d_result, sizeof(int));
+
+    for(int i = 0; i < count; i++)
+        h_data[i] = rand() % 100;
+    h_result = 0;
+    cudaMemcpy(d_data, h_data, count * sizeof(int), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_result, &h_result, sizeof(int), cudaMemcpyHostToDevice);
+
+    reduce_sum_warpshuffles<<<40,512>>>(d_data, count, d_result);
+
+    cudaMemcpy(&h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
+
+    printf("Result is %d\n", h_result);
+    
+    cudaFreeHost(h_data);
+    cudaFree(d_data);
+    cudaFree(d_result);
+
+    return 0;
+}
diff --git a/07_hipify_multisource/Makefile b/07_hipify_multisource/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..7d087518ea0352c253ff5474e1b70427d5b61371
--- /dev/null
+++ b/07_hipify_multisource/Makefile
@@ -0,0 +1,31 @@
+
+.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+
+
+
+default:
+	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+
+clean:
+	cd project_cuda   &&   make clean
+	rm -r project_hip
+
+
+
+cuda:
+	cd project_cuda   &&   make
+
+hip:
+	cd project_hip   &&   make
+
+hipifyclang:
+	cp -r project_cuda project_hip   &&   hipconvertinplace.sh project_hip
+
+hipifyperl:
+	cp -r project_cuda project_hip   &&   hipconvertinplace-perl.sh project_hip
+
+runcuda:
+	cd project_cuda   &&   make run
+
+runhip:
+	cd project_hip   &&   make run
diff --git a/07_hipify_multisource/README.md b/07_hipify_multisource/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..6ad445ee9737d42c1ad3351323b5b1007d944a7e
--- /dev/null
+++ b/07_hipify_multisource/README.md
@@ -0,0 +1,48 @@
+
+HIPIFY example: whole directory hipification
+============================================
+
+In the `src_cuda` directory there is a somewhat larger project consisting of multiple CUDA and c++ source files, along with a Makefile used for compilation of this project. We would like to convert this whole project from CUDA to HIP, for which we use HIPIFY. We do not need to hipify each source file individually, instead we can use `hipconvertinplace.sh` or `hipconvertinplace-perl.sh` tools, which can recursively hipify whole directory in-place.
+
+These tools to not have an "input" and "output" directory, they can hipify in-place only (currently). We therefore make a copy of the whole project, e.g.
+```
+cp -r project_cuda project_hip
+```
+After that the copied sources can be hipified
+```
+hipconvertinplace.sh project_hip
+```
+The script prints some info and exits without any errors.
+
+The directory `project_hip` now contains HIP sources. The Makefile, however, was not hipified, so we have to modify it ourselves, in this example it is sufficent to just replace `nvcc` with `hipcc`. We can change to the `project_hip` directory and try to compile the project,
+```
+cd project_hip
+make
+```
+We get an error about incompatible `int**` and `void**` in the `hipHostAlloc` function, so we just replace it with `hipHostMalloc` as in the previous example. We try to compile it again, which is successfull, and the results are also correct.
+
+Let us, however, examine the hipified source code in more detail. Looking into the `reduction.cu` file containing a sum-reduction kernel, we can again notice an assumption about the warp size that is hard-coded into the source. But the result was correct, even on AMD GPU! In this specific example, the assumption of `warpSize=32` did not cause any problem, since we were on a device with warp size of 32 or 64. But if there was a device with warp size 16, the result probably would not be correct. The final 6 additions are actually just an optimization -- if only one warp is executing, we do not need to perform synchronization. There is hard-coded warpSize 32, but having a warpSize 64 actually just lowers the performance a bit, performing one unnecessary `__syncthreads()`. Either way, for correctnes we should replace the `32`s with `warpSize` and somehow handle the remainder loop, resulting in e.g.
+```
+int active_threads = blockDim.x >> 1;
+while(active_threads > warpSize)
+{
+    if(threadIdx.x < active_threads)
+        shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x + active_threads];
+    active_threads >>= 1;
+    __syncthreads();
+}
+
+while(active_threads > 0)
+{
+    if(threadIdx.x < active_threads)
+        shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x + active_threads];
+    active_threads >>= 1;
+}
+```
+
+One thing, that does not bother the compiler, linker, not harms correctness, but harms our eyes and brain, is the file extensions, which got preserved from the original project. HIP sources have `.cu` extension, which is really not nice. We could (and should) rename all such files, but we should also reflect these changes in the Makefile and `#include` statements etc., which is no fun to do.
+
+Another thing that might bother us is, that the hipicifation script added the `#include <hip/hip_runtime.h>` statement to *all* the source files, event to the ones which are pure c++ and have nothing to do with HIP (see `hello.h` and `hello.cpp`). There is currently nothing I know about that could solve this, other than manually deleting these includes.
+
+
+
diff --git a/07_hipify_multisource/project_cuda/Makefile b/07_hipify_multisource/project_cuda/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..0a6885c0e91c5002758d884f07dd804b4a70ddc8
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/Makefile
@@ -0,0 +1,27 @@
+
+.PHONY: all clean run
+
+
+
+all: program.x
+
+clean:
+	rm -f *.x *.o
+
+run: program.x
+	./program.x
+
+
+
+program.x: main.o reduction.o hello.o
+	nvcc -g -O2 $^ -o $@
+
+main.o: main.cu reduction.cuh hello.h
+	nvcc -g -O2 -c $< -o $@
+
+reduction.o: reduction.cu reduction.cuh
+	nvcc -g -O2 -c $< -o $@
+
+hello.o: hello.cpp hello.h
+	nvcc -g -O2 -c $< -o $@
+
diff --git a/07_hipify_multisource/project_cuda/hello.cpp b/07_hipify_multisource/project_cuda/hello.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..33df65d84cdb5c9bea99825f32edf797716681a2
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/hello.cpp
@@ -0,0 +1,8 @@
+#include "hello.h"
+
+#include <cstdio>
+
+void say_hello_to(const char * name)
+{
+    printf("Hello, %s\n", name);
+}
diff --git a/07_hipify_multisource/project_cuda/hello.h b/07_hipify_multisource/project_cuda/hello.h
new file mode 100644
index 0000000000000000000000000000000000000000..1917b30e3c38c6ea857083b2fa814f842d4ae1f7
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/hello.h
@@ -0,0 +1,3 @@
+#pragma once
+
+void say_hello_to(const char * name);
diff --git a/07_hipify_multisource/project_cuda/main.cu b/07_hipify_multisource/project_cuda/main.cu
new file mode 100644
index 0000000000000000000000000000000000000000..0b2b7b8b44fbfed0af6aa871f8105d3dd73651c2
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/main.cu
@@ -0,0 +1,48 @@
+#include <cstdio>
+#include <cstdlib>
+
+#include "reduction.cuh"
+#include "hello.h"
+
+
+
+
+
+int main()
+{
+    say_hello_to("IT4I");
+
+    srand(12345);
+
+    int count = 1 << 20;
+
+    int * h_data;
+    int * d_data;
+    cudaMalloc(&d_data, count * sizeof(int));
+    cudaHostAlloc(&h_data, count * sizeof(int), cudaHostAllocDefault);
+    int h_result;
+    int * d_result;
+    cudaMalloc(&d_result, sizeof(int));
+
+    for(int i = 0; i < count; i++)
+        h_data[i] = rand() % 100;
+    h_result = 0;
+    cudaMemcpy(d_data, h_data, count * sizeof(int), cudaMemcpyHostToDevice);
+    cudaMemcpy(d_result, &h_result, sizeof(int), cudaMemcpyHostToDevice);
+
+    int bpg = 40;
+    int tpb = 512;
+    int shmem_size = tpb * sizeof(int);
+    reduce_sum<<< bpg, tpb, shmem_size >>>(d_data, count, d_result);
+
+    cudaDeviceSynchronize();
+    cudaMemcpy(&h_result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
+
+    printf("Result is %d\n", h_result);
+    
+    cudaFreeHost(h_data);
+    cudaFree(d_data);
+    cudaFree(d_result);
+
+    return 0;
+}
diff --git a/07_hipify_multisource/project_cuda/reduction.cu b/07_hipify_multisource/project_cuda/reduction.cu
new file mode 100644
index 0000000000000000000000000000000000000000..47ae883c5218e35f91c747b582eb91457b41524b
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/reduction.cu
@@ -0,0 +1,34 @@
+#include "reduction.cuh"
+
+
+
+__global__ void reduce_sum(int * data, int count, int * d_result)
+{
+    extern __shared__ volatile int shmem_sums[];
+
+    int my_sum = 0;
+    for(int idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
+        my_sum += data[idx];
+    shmem_sums[threadIdx.x] = my_sum;
+    __syncthreads();
+
+    int active_threads = blockDim.x >> 1;
+    while(active_threads > 32)
+    {
+        if(threadIdx.x < active_threads)
+            shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x + active_threads];
+        active_threads >>= 1;
+        __syncthreads();
+    }
+
+    if(threadIdx.x < 32) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x + 32];
+    if(threadIdx.x < 16) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x + 16];
+    if(threadIdx.x <  8) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x +  8];
+    if(threadIdx.x <  4) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x +  4];
+    if(threadIdx.x <  2) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x +  2];
+    if(threadIdx.x <  1) shmem_sums[threadIdx.x] += shmem_sums[threadIdx.x +  1];
+
+    if(threadIdx.x == 0)
+        atomicAdd(d_result, shmem_sums[0]);
+}
+
diff --git a/07_hipify_multisource/project_cuda/reduction.cuh b/07_hipify_multisource/project_cuda/reduction.cuh
new file mode 100644
index 0000000000000000000000000000000000000000..4282f45f1c90c36fe6af22c70bd41930bfbebf44
--- /dev/null
+++ b/07_hipify_multisource/project_cuda/reduction.cuh
@@ -0,0 +1,5 @@
+#pragma once
+
+
+
+__global__ void reduce_sum(int * data, int count, int * d_result);
diff --git a/08_hipify_blas/Makefile b/08_hipify_blas/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..1e9cf258d405c0af08f201d99e6cb7a517fa2d19
--- /dev/null
+++ b/08_hipify_blas/Makefile
@@ -0,0 +1,32 @@
+
+HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+
+.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+
+
+
+default:
+	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+
+clean:
+	rm -f *.x *.hip.cpp
+
+
+
+cuda:
+	nvcc -g -O2 blas.cu -o blas.cuda.x -lcublas
+
+hip:
+	hipcc -g -O2 -I${HIPBLASPATH}/include blas.hip.cpp -o blas.hip.x -L${HIPBLASPATH}/lib -lhipblas
+
+hipifyclang:
+	hipify-clang blas.cu -o blas.hip.cpp
+
+hipifyperl:
+	hipify-perl blas.cu -o blas.hip.cpp
+
+runcuda:
+	./blas.cuda.x
+
+runhip:
+	./blas.hip.x
diff --git a/08_hipify_blas/README.md b/08_hipify_blas/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..c217c756f8659a334cab2517a330f1e45927d012
--- /dev/null
+++ b/08_hipify_blas/README.md
@@ -0,0 +1,7 @@
+
+HIPIFY example: cuBLAS to hipBLAS hipification
+==============================================
+
+The CUDA source code `blas.cu` demonstrates basic cuBLAS functionality -- it performs a gemv operation, `y=a*A*x+b*y`. Our goal is to hipify this.
+
+Simply hipifying the code using the usuall commands produces no errors nor warnings. Skimming through the hipified code, we see that even the cuBLAS function calls were converted into their hipBLAS equivalents. Besides the classic fix of the host memory allocation function calls we also need to cast the first parameter of the `hipMallocPitch` function to `void**`. After that, we can compile the program using `-lhipblas` instead of `-lcublas` in the command line and adding the other necessary `-I` and `-L` flags. After compilation the program runs flawlessly and produces equivalent results to the original CUDA program.
diff --git a/08_hipify_blas/blas.cu b/08_hipify_blas/blas.cu
new file mode 100644
index 0000000000000000000000000000000000000000..7912240339c4144509bacacdc8b1c4f4bed19ad4
--- /dev/null
+++ b/08_hipify_blas/blas.cu
@@ -0,0 +1,110 @@
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <cublas_v2.h>
+
+
+
+int main()
+{    
+    srand(9600);
+
+    int width = 10;
+    int height = 7;
+    int elem_count = width * height;
+
+    float * h_A;
+    cudaHostAlloc(&h_A, elem_count * sizeof(*h_A), cudaHostAllocDefault);
+    for(int i = 0; i < elem_count; i++)
+        h_A[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("Matrix A:\n");
+    for(int r = 0; r < height; r++)
+    {
+        for(int c = 0; c < width; c++)
+            printf("%6.3f  ", h_A[r + height * c]);
+        printf("\n");
+    }
+
+    float * h_x;
+    cudaHostAlloc(&h_x, width * sizeof(*h_x), cudaHostAllocDefault);
+    for(int i = 0; i < width; i++)
+        h_x[i] = (100.0f * rand()) / (float)RAND_MAX;
+    printf("vector x:\n");
+    for(int i = 0; i < width; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+    
+    float * h_y;
+    cudaHostAlloc(&h_y, height * sizeof(*h_y), cudaHostAllocDefault);
+    for(int i = 0; i < height; i++)
+        h_x[i] = 100.0f + i;
+    printf("vector y:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_x[i]);
+    printf("\n");
+
+
+    
+
+    float * d_A;
+    size_t pitch_A;
+    cudaMallocPitch(&d_A, &pitch_A, height * sizeof(*d_A), width);
+    cudaMemcpy2D(d_A, pitch_A, h_A, height * sizeof(*d_A), height * sizeof(*d_A), width, cudaMemcpyHostToDevice);
+    int lda = pitch_A / sizeof(float);
+
+    float * d_x;
+    cudaMalloc(&d_x, width * sizeof(*d_x));
+    cudaMemcpy(d_x, h_x, width * sizeof(*d_x), cudaMemcpyHostToDevice);
+    
+    float * d_y;
+    cudaMalloc(&d_y, height * sizeof(*d_y));
+    cudaMemcpy(d_y, h_y, height * sizeof(*d_y), cudaMemcpyHostToDevice);
+
+
+
+
+    float alpha=2.0f, beta=10.0f;
+
+    for(int i = 0; i < height; i++)
+        h_y[i] *= beta;
+    for(int r = 0; r < height; r++)
+        for(int c = 0; c < width; c++)
+            h_y[r] += alpha * h_x[c] * h_A[r + height * c];
+    printf("result y CPU:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+
+    for(int i = 0; i < height; i++)
+        h_y[i] = 0.0f;
+
+
+
+    cublasHandle_t blas_handle;
+    cublasCreate(&blas_handle);
+    
+    cublasSgemv(blas_handle, CUBLAS_OP_N, height, width, &alpha, d_A, lda, d_x, 1, &beta, d_y, 1);
+    cudaDeviceSynchronize();
+    
+    cublasDestroy(blas_handle);
+    
+    
+
+    cudaMemcpy(h_y, d_y, height * sizeof(*d_y), cudaMemcpyDeviceToHost);
+    printf("result y BLAS:\n");
+    for(int i = 0; i < height; i++)
+        printf("%6.3f  ", h_y[i]);
+    printf("\n");
+
+
+
+    cudaFree(d_A);
+    cudaFree(d_x);
+    cudaFree(d_y);
+    cudaFreeHost(h_A);
+    cudaFreeHost(h_x);
+    cudaFreeHost(h_y);
+
+    return 0;
+}
+
diff --git a/09_hipify_solver/Makefile b/09_hipify_solver/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..cd69b16ecc6dfae3e76eea150ffffdcf93fa6d3b
--- /dev/null
+++ b/09_hipify_solver/Makefile
@@ -0,0 +1,38 @@
+
+HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
+
+
+
+
+
+.PHONY: default cuda hip hipifyclang hihipifyperlpify runcuda runhip clean
+
+
+
+default:
+	@echo "Use the following targets: cuda hip hipifyclang hihipifyperlpify runcuda runhip clean"
+
+clean:
+	rm -f *.x *.hip.cpp
+
+
+
+cuda:
+	nvcc -g -O2 solver.cu -o solver.cuda.x -lcusolver -lcublas
+
+hip:
+	hipcc -g -O2 -I${HIPSOLVERPATH}/include -I${HIPBLASPATH}/include solver.hip.cpp -o solver.hip.x -L${HIPSOLVERPATH}/lib -L${HIPBLASPATH}/lib -lhipsolver -lhipblas
+
+hipifyclang:
+	hipify-clang solver.cu -o solver.hip.cpp
+
+hipifyperl:
+	hipify-perl solver.cu -o solver.hip.cpp
+
+runcuda:
+	./solver.cuda.x
+
+runhip:
+	./solver.hip.x
+
diff --git a/09_hipify_solver/README.md b/09_hipify_solver/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..39dfc49d4c7e896f967f53604f02bdf8e4fb3a9a
--- /dev/null
+++ b/09_hipify_solver/README.md
@@ -0,0 +1,9 @@
+
+HIPIFY example: cuSOLVER to hipSOLVER hipification
+==================================================
+
+The CUDA source code `solver.cu` solves a simple dense system of linear equations using the cuSOLVER library. We would like to hipify this code.
+
+Running either of the hipify scripts on the source code produces no warnings or errors. However, trying to compile the hipified code fails with errors. Looking at the hipified code, we notice, that the cuSOLVER functions were left untouched. But the hipify script produced no warnings!
+
+Unfortunately, cuSOLVER is not hipifiable (at the time of writing). Hipify treats all unknown functions equally (being it `my_simulation` or `cusolverDnSgetrf`), therefore no warnings were produced. The reason why this is not hipifiable is probably the slightly different calling conventions of some functions, which require additional workspace memory buffers. Also, cuSOLVER contains dense and sparse solvers for single- and multi-GPU systems, but hipSOLVER currently contains only some dense solvers for single-GPU systems.
diff --git a/09_hipify_solver/solver.cu b/09_hipify_solver/solver.cu
new file mode 100644
index 0000000000000000000000000000000000000000..bea114727ca9c7b70366a0cf910da3e95f718fe1
--- /dev/null
+++ b/09_hipify_solver/solver.cu
@@ -0,0 +1,115 @@
+#include <cstdio>
+#include <vector>
+#include <cstdlib>
+#include <cusolverDn.h>
+#include <cublas_v2.h>
+
+
+int main()
+{
+    srand(63456);
+
+    int size = 10;
+
+    int h_A_ld = size;
+    int h_A_pitch = h_A_ld * sizeof(float);
+    std::vector<float> h_A(size * h_A_ld);
+    for(int r = 0; r < size; r++)
+        for(int c = 0; c < size; c++)
+            h_A[r * h_A_ld + c] = (10.0 * rand()) / RAND_MAX;
+    printf("System matrix A:\n");
+    for(int r = 0; r < size; r++)
+    {
+        for(int c = 0; c < size; c++)
+            printf("%6.3f  ", h_A[r * h_A_ld + c]);
+        printf("\n");
+    }    
+    
+    std::vector<float> h_b(size);
+    for(int i = 0; i < size; i++)
+        h_b[i] = (10.0 * rand()) / RAND_MAX;
+    printf("RHS vector b:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+    float * d_A;
+    size_t d_A_pitch;
+    int d_A_ld;
+    cudaMallocPitch(&d_A, &d_A_pitch, size, size);
+    d_A_ld = d_A_pitch / sizeof(float);
+
+    float * d_b;
+    cudaMalloc(&d_b, size * sizeof(float));
+    
+    float * d_x;
+    cudaMalloc(&d_x, size * sizeof(float));
+
+    int * d_piv;
+    cudaMalloc(&d_piv, size * sizeof(int));
+
+    int * info;
+    cudaMallocManaged(&info, sizeof(int));
+
+    cudaMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, cudaMemcpyHostToDevice);
+    cudaMemcpy(d_b, h_b.data(), size * sizeof(float), cudaMemcpyHostToDevice);
+    
+
+
+    cusolverDnHandle_t cusolverHandle;
+    cusolverDnCreate(&cusolverHandle);
+
+    float * workspace;
+    int workspace_size;
+    cusolverDnSgetrf_bufferSize(cusolverHandle, size, size, d_A, d_A_ld, &workspace_size);
+    cudaMalloc(&workspace, workspace_size * sizeof(float));
+    cusolverDnSgetrf(cusolverHandle, size, size, d_A, d_A_ld, workspace, d_piv, info);
+
+    cusolverDnSgetrs(cusolverHandle, CUBLAS_OP_N, size, 1, d_A, d_A_ld, d_piv, d_b, size, info);
+
+    for(int i = 0; i < size; i++)
+        h_b[i] = 0;
+
+    cudaMemcpy(d_x, d_b, size * sizeof(float), cudaMemcpyDeviceToDevice);
+    cudaMemcpy(h_b.data(), d_b, size * sizeof(float), cudaMemcpyDeviceToHost);
+    printf("Solution vector x:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+
+    cudaFree(workspace);
+
+    cusolverDnDestroy(cusolverHandle);
+
+
+
+    cublasHandle_t cublasHandle;
+    cublasCreate(&cublasHandle);
+
+    float alpha = 1;
+    float beta = 0;
+    cudaMemcpy2D(d_A, d_A_pitch, h_A.data(), h_A_pitch, size * sizeof(float), size, cudaMemcpyHostToDevice);
+    cublasSgemv(cublasHandle, CUBLAS_OP_N, size, size, &alpha, d_A, d_A_ld, d_x, 1, &beta, d_b, 1);
+
+    cublasDestroy(cublasHandle);
+
+    
+
+    cudaMemcpy(h_b.data(), d_b, size * sizeof(float), cudaMemcpyDeviceToHost);
+    printf("Check multiplication vector Ax:\n");
+    for(int i = 0; i < size; i++)
+        printf("%6.3f  ", h_b[i]);
+    printf("\n");
+
+
+
+    cudaFree(info);
+    cudaFree(d_piv);
+    cudaFree(d_x);
+    cudaFree(d_b);
+    cudaFree(d_A);
+
+    return 0;
+}
+
diff --git a/10_hipcpu_vector_add/Makefile b/10_hipcpu_vector_add/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..659bd6f336af5f0e90b6cbbde2f2490c2901dc88
--- /dev/null
+++ b/10_hipcpu_vector_add/Makefile
@@ -0,0 +1,18 @@
+
+HIPCPUPATH=${HOME}/apps/HIP-CPU
+
+.PHONY: compile clean run
+
+
+
+compile:
+	g++ -g -O2 -std=c++17 -I${HIPCPUPATH}/include vector_add.hip.cpp -o vector_add.x -ltbb -pthread
+
+clean:
+	rm -f *.x
+
+run:
+	./vector_add.x
+
+
+
diff --git a/10_hipcpu_vector_add/README.md b/10_hipcpu_vector_add/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..874e32b7483fc6fb11b47e9148d8f4a853c97318
--- /dev/null
+++ b/10_hipcpu_vector_add/README.md
@@ -0,0 +1,17 @@
+
+HIP example: using the HIP-CPU library
+======================================
+
+This example demonstrates how the HIP-CPU library is to be used.
+
+In the `vector_add.hip.cpp` file is HIP source code performing addition of two vectors. The code is the same as in the very first example `01_vector_add`. This time, we will not use the `hipcc` compiler to compile for GPU, but instead we will compile with any c++ compiler and use the HIP-CPU library, which should allow the same unchanged HIP code to run on CPUs.
+
+To compile the program to run on CPUs, we need to specify the include directory of the HIP-CPU library to the compiler, need to specify that we need the c++17 standard, and link with TBB and pthread,
+```
+g++ -g -O2 -std=c++17 -I/path/to/hip-cpu/include vector_add.hip.cpp -o vector_add.x -ltbb -pthread
+```
+However, compilation of the code fails with type conversion error. To resolve the error, we just need to cast the pointers to `void**`.
+
+HIP-CPU is not yet fully complete, some of the functions are not yet present, correctly working, or don't have the usuall overloads, which is the case of this error. I do not recomment using HIP-CPU for any serious usage yet, not even for debugging, since you will mostly be debugging the HIP-CPU library, rather than your code.
+
+
diff --git a/10_hipcpu_vector_add/vector_add.hip.cpp b/10_hipcpu_vector_add/vector_add.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..6126d1ee10a6a81a4e2bfa5df9a52962b4b6ab36
--- /dev/null
+++ b/10_hipcpu_vector_add/vector_add.hip.cpp
@@ -0,0 +1,74 @@
+#include <cstdio>
+#include <hip/hip_runtime.h>
+
+
+
+__global__ void add_vectors(float * x, float * y, float alpha, int count)
+{
+    // loop through all the elements in the vector
+    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
+    {
+        y[idx] += alpha * x[idx];
+    }
+}
+
+
+
+int main()
+{
+    // number of elements in the vectors
+    long long count = 10;
+
+    // allocation and initialization of data on the host (CPU memory)
+    float * h_x = new float[count];
+    float * h_y = new float[count];
+    for(long long i = 0; i < count; i++)
+    {
+        h_x[i] = i;
+        h_y[i] = 10 * i;
+    }    
+
+    // print the input data
+    printf("X:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_x[i]);
+    printf("\n");
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+    
+    // allocation of memory on the GPU device
+    float * d_x;
+    float * d_y;
+    hipMalloc(&d_x, count * sizeof(float));
+    hipMalloc(&d_y, count * sizeof(float));
+    // hipMalloc((void**)&d_x, count * sizeof(float));
+    // hipMalloc((void**)&d_y, count * sizeof(float));
+
+    // copy the data from host memory to the device
+    hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
+    hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
+
+    // launch the kernel on the GPU
+    // hipLaunchKernelGGL( kernel_name, blocks_per_grid, threads_per_block, dyn_shmem_size, stream, kernel_parameters ... )
+    hipLaunchKernelGGL(add_vectors, 20, 128, 0, 0, d_x, d_y, 100, count);
+
+    // copy the result back to CPU memory
+    hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
+
+    // print the results
+    printf("Y:");
+    for(long long i = 0; i < count; i++)
+        printf(" %7.2f", h_y[i]);
+    printf("\n");
+
+    // free the allocated memory
+    hipFree(d_x);
+    hipFree(d_y);
+    delete[] h_x;
+    delete[] h_y;
+
+    return 0;
+}
+
diff --git a/11_rocprof_PSO/Makefile b/11_rocprof_PSO/Makefile
new file mode 100644
index 0000000000000000000000000000000000000000..50122f47759e0a90141eeb6494af972208092fbb
--- /dev/null
+++ b/11_rocprof_PSO/Makefile
@@ -0,0 +1,27 @@
+
+HIPRANDPATH=/opt/rocm/hiprand
+
+FLAGS=-g -O2 -I${HIPRANDPATH}/include
+
+ifeq (${HIP_PLATFORM},amd)
+	ROCRANDPATH=/opt/rocm/rocrand
+	FLAGS+=-I${ROCRANDPATH}/include
+endif
+
+
+
+.PHONY: compile run clean
+
+compile: PSO.x
+
+run:
+	./PSO.x
+
+clean:
+	rm -f *.x
+
+
+
+PSO.x: PSO.hip.cpp
+	hipcc ${FLAGS} ${HIPINCLUDE} $^ -o $@
+
diff --git a/11_rocprof_PSO/PSO.hip.cpp b/11_rocprof_PSO/PSO.hip.cpp
new file mode 100644
index 0000000000000000000000000000000000000000..c16031df39990325636411e0b7a759eb035f9086
--- /dev/null
+++ b/11_rocprof_PSO/PSO.hip.cpp
@@ -0,0 +1,333 @@
+#include <hip/hip_runtime.h>
+#include <cstdio>
+#include <cmath>
+#include <cstdlib>
+#include <cstring>
+#include <algorithm>
+#include <random>
+#include <hiprand.h>
+#include <hiprand_kernel.h>
+
+
+
+__global__ void InitValue(double *x, int count, double value)
+{
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+
+    for(int i = tid; i < count; i += tpg)
+        x[i] = value;
+}
+
+
+__global__ void InitRandom(double *x, int count, int dim, int countPitch, double min, double max, unsigned long long seed)
+{
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+
+    if(tid >= count)
+        return;
+
+    hiprandState state;
+    hiprand_init(seed, tid, 0, &state);
+    double range = max - min;
+    
+    for(int i = tid; i < count; i += tpg)
+    {
+        for(int d = 0; d < dim; d++)
+        {
+            x[d * countPitch + i] = hiprand_uniform_double(&state) * range + min;
+        }
+    }
+
+}
+
+
+template<int cf>
+__global__ void RenewPBest(double *pBestCosts, double *p, double *pBest, int count, int dim, int countPitch)
+{
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+
+    for(int i = tid; i < count; i += tpg)
+    {
+        double cost = 0;
+
+        switch(cf) // hope for compile time optimization, select only one branch based on template argument
+        {
+            case 1:
+                for(int d = 0; d < dim; d++)
+                    cost += p[d * countPitch + i] * p[d * countPitch + i];
+                break;
+            case 2:
+                for(int d = 1; d < dim; d++)
+                {
+                    double a = p[(d-1) * countPitch + i]  - p[d * countPitch + i];
+                    double b = 1 - p[(d-1) * countPitch + i];
+                    cost += 100 * a * a + b * b;
+                }
+                break;
+            case 3:
+                for(int d = 0; d < dim; d++)
+                    cost += p[d * countPitch + i] * p[d * countPitch + i] - 10 * cos(2 * M_PI * p[d * countPitch + i]);
+                break;
+            case 4:
+                for(int d = 0; d < dim; d++)
+                    cost -= p[d * countPitch + i] * sin(sqrt(abs(p[d * countPitch + i])));
+                break;
+        }
+
+        if(cost < pBestCosts[i])
+        {
+            pBestCosts[i] = cost;
+            for(int d = 0; d < dim; d++)
+                pBest[d * countPitch + i] = p[d * countPitch + i];
+        }
+    }
+}
+
+void RenewPBest(int costFuncNumber, double *results, double *p, double *pBest, int count, int dim, int countPitch, dim3 blocks, dim3 threads)
+{
+    switch(costFuncNumber)
+    {
+        case 1:
+            hipLaunchKernelGGL(HIP_KERNEL_NAME(RenewPBest<1>), blocks, threads, 0, 0, results, p, pBest, count, dim, countPitch);
+            break;
+        case 2:
+            hipLaunchKernelGGL(HIP_KERNEL_NAME(RenewPBest<2>), blocks, threads, 0, 0, results, p, pBest, count, dim, countPitch);
+            break;
+        case 3:
+            hipLaunchKernelGGL(HIP_KERNEL_NAME(RenewPBest<3>), blocks, threads, 0, 0, results, p, pBest, count, dim, countPitch);
+            break;
+        case 4:
+            hipLaunchKernelGGL(HIP_KERNEL_NAME(RenewPBest<4>), blocks, threads, 0, 0, results, p, pBest, count, dim, countPitch);
+            break;
+    }
+}
+
+__global__ void MinReductionIdx(int *resIdxs, double *values, int count)
+{
+    // launch with as many threads in block as possible, and as little blocks as possible (number of SMs should be the best)
+    // needs TPB*(sizeof(double)+sizeof(int)) bytes of dynamic shared memory
+    // data needs to be padded up to a multiple of TPB
+
+    extern __shared__ char shmem[];
+
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+    int tpb = blockDim.x;
+    int idx = threadIdx.x;
+    
+    double *shmemVals = (double*)shmem;
+    int *shmemIdxs = (int*)(shmem + tpb * sizeof(double));
+
+    shmemVals[idx] = INFINITY;
+    shmemIdxs[idx] = -1;
+
+    for(int offset = tid; offset < count; offset += tpg)
+    {
+        if(values[offset] < shmemVals[idx])
+        {
+            shmemVals[idx] = values[offset];
+            shmemIdxs[idx] = offset;
+        }
+    }
+
+    __syncthreads();
+
+    int currThreadCount = tpb;
+    while(true)
+    {
+        currThreadCount >>= 1;
+
+        if(currThreadCount <= warpSize)
+            break;
+        if(tid >= currThreadCount)
+            return;
+        
+        if(shmemVals[idx + currThreadCount] < shmemVals[idx])
+        {
+            shmemVals[idx] = shmemVals[idx + currThreadCount];
+            shmemIdxs[idx] = shmemIdxs[idx + currThreadCount];
+        }
+
+        __syncthreads();
+    }
+    
+    while(true)
+    {
+        currThreadCount >>= 1;
+
+        if(currThreadCount == 0)
+            break;
+        
+        if(shmemVals[idx + currThreadCount] < shmemVals[idx])
+        {
+            shmemVals[idx] = shmemVals[idx + currThreadCount];
+            shmemIdxs[idx] = shmemIdxs[idx + currThreadCount];
+        }
+    }
+
+    if(idx == 0)
+        resIdxs[blockIdx.x] = shmemIdxs[0];
+}
+
+__global__ void Move(double *p, double *v, double *pBest, double *gBest, int count, int dim, int countPitch, double c1, double c2, double maxPos, double maxVel, unsigned long long seed)
+{
+    int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    int tpg = blockDim.x * gridDim.x;
+    
+    hiprandState state;
+    hiprand_init(seed, tid, 0, &state);
+
+    for(int i = tid; i < count; i += tpg)
+    {
+        for(int d = 0; d < dim; d++)
+        {
+            int index = d * countPitch + i;
+
+            v[index] += c1 * hiprand_uniform_double(&state) * (pBest[index] - p[index]) + c2 * hiprand_uniform_double(&state) * (gBest[d] - p[index]);
+            p[index] += v[index];
+
+            if(v[index] > maxVel)
+                v[index] = maxVel;
+            if(v[index] < -maxVel)
+                v[index] = -maxVel;
+            if(p[index] > maxPos)
+                p[index] = maxPos;
+            if(p[index] < -maxPos)
+                p[index] = -maxPos;
+        }
+    }
+}
+
+
+int main(int argc, char** argv)
+{
+    if(argc > 1 && strcmp(argv[1], "-h") == 0)
+    {
+        printf("Usage: ./PSO.x costFuncNumber1-4 dimension particleCount iterationCount\n");
+        return 0;
+    }
+
+    int costFuncNumber = 2;
+    int dim = 3;
+    int count = 1000000;
+    int iterationCount = 20;
+
+    if(argc > 1 && argc <= 4)
+    {
+        fprintf(stderr, "Wrong number of arguments. Use -h for help\n");
+        return 1;
+    }
+
+    if(argc > 4)
+    {
+        costFuncNumber = atoi(argv[1]);
+        dim = atoi(argv[2]);
+        count = atoi(argv[3]);
+        iterationCount = atoi(argv[4]);
+    }
+
+    if(costFuncNumber < 1 || costFuncNumber > 4 || dim <= 0 || count <= 0 || iterationCount <= 0)
+    {
+        fprintf(stderr, "Incorrect parameter values\n");
+        return 1;
+    }
+
+
+
+    const int align = 512;
+    int countPitch = ((count + align - 1) / align) * align;
+    // data XXX0YYY0ZZZ0
+    double *p;
+    double *v;
+    double *pBest;
+    hipMallocManaged(&p, countPitch * dim * sizeof(*p));
+    hipMallocManaged(&v, countPitch * dim * sizeof(*v));
+    hipMallocManaged(&pBest, countPitch * dim * sizeof(*pBest));
+    double *pBestCosts;
+    hipMallocManaged(&pBestCosts, countPitch * sizeof(*pBestCosts));
+    int bestPbestIdx;
+    double gBestCost = INFINITY;
+    double *gBest;
+    hipMallocManaged(&gBest, dim * sizeof(*gBest));
+    double c1 = 2, c2 = 2;
+    double bound;
+    if(costFuncNumber == 1) bound = 1;
+    else if(costFuncNumber == 2) bound = 2;
+    else if(costFuncNumber == 3) bound = 5;
+    else if(costFuncNumber == 4) bound = 500;
+    double maxVel = bound / 20;
+    std::mt19937 *random = new std::mt19937(12345);
+    std::uniform_int_distribution<unsigned long long> dist(0, ~0ULL);
+    int deviceId;
+    hipGetDevice(&deviceId);
+    hipDeviceProp_t props;
+    hipGetDeviceProperties(&props, deviceId);
+    int tpb = 1024;
+    int bpg = props.multiProcessorCount;
+    int *reductionPerblockResults;
+    hipMallocManaged(&reductionPerblockResults, bpg * sizeof(*reductionPerblockResults));
+
+    hipMemPrefetchAsync(p, countPitch * dim * sizeof(*p), deviceId);
+    hipMemPrefetchAsync(v, countPitch * dim * sizeof(*v), deviceId);
+    hipMemPrefetchAsync(pBest, countPitch * dim * sizeof(*pBest), deviceId);
+    hipMemPrefetchAsync(pBestCosts, countPitch * sizeof(*pBestCosts), deviceId);
+    hipLaunchKernelGGL(InitRandom, bpg, tpb, 0, 0, p, count, dim, countPitch, -bound, bound, dist(*random));
+    hipLaunchKernelGGL(InitRandom, bpg, tpb, 0, 0, v, count, dim, countPitch, -maxVel / 2, maxVel / 2, dist(*random));
+    hipMemcpy(pBest, p, countPitch * dim * sizeof(double), hipMemcpyDeviceToDevice);
+    hipLaunchKernelGGL(InitValue, bpg, tpb, 0, 0, pBestCosts, countPitch, INFINITY);
+
+    
+    RenewPBest(costFuncNumber, pBestCosts, p, pBest, count, dim, countPitch, bpg, tpb);
+    MinReductionIdx<<<bpg, tpb, tpb * (sizeof(double) + sizeof(int))>>>(reductionPerblockResults, pBestCosts, count);
+    hipDeviceSynchronize();
+    bestPbestIdx = reductionPerblockResults[0];
+    for(int i = 1; i < bpg; i++)
+    {
+        if(pBestCosts[i] < pBestCosts[bestPbestIdx])
+            bestPbestIdx = i;
+    }
+    hipMemPrefetchAsync(gBest, dim * sizeof(*gBest), deviceId);
+    gBestCost = pBestCosts[bestPbestIdx];
+    hipMemcpy2D(gBest, 1 * sizeof(double), pBest + bestPbestIdx, countPitch * sizeof(double), 1 * sizeof(double), dim, hipMemcpyDeviceToDevice);
+
+    for(int i = 0; i < iterationCount; i++)
+    {
+        hipLaunchKernelGGL(Move, bpg, tpb, 0, 0, p, v, pBest, gBest, count, dim, countPitch, c1, c2, bound, maxVel, dist(*random));
+        RenewPBest(costFuncNumber, pBestCosts, p, pBest, count, dim, countPitch, bpg, tpb);
+        MinReductionIdx<<<bpg, tpb, tpb * (sizeof(double) + sizeof(int))>>>(reductionPerblockResults, pBestCosts, count);
+        hipDeviceSynchronize();
+        bestPbestIdx = reductionPerblockResults[0];
+        for(int i = 1; i < bpg; i++)
+        {
+            if(pBestCosts[i] < pBestCosts[bestPbestIdx])
+                bestPbestIdx = i;
+        }
+
+        if(pBestCosts[bestPbestIdx] < gBestCost)
+        {
+            gBestCost = pBestCosts[bestPbestIdx];
+            hipMemcpy2D(gBest, 1 * sizeof(double), pBest + bestPbestIdx, countPitch * sizeof(double), 1 * sizeof(double), dim, hipMemcpyDeviceToDevice);
+        }
+    }
+    hipDeviceSynchronize();
+
+    printf("Best found location:\n");
+    printf("  Location:");
+    for(int d = 0; d < dim; d++)
+        printf("    %18.12f", gBest[d]);
+    printf("\n");
+    printf("  Cost: %.12f\n", gBestCost);
+
+    hipFree(p);
+    hipFree(v);
+    hipFree(pBest);
+    hipFree(pBestCosts);
+    hipFree(gBest);
+    hipFree(reductionPerblockResults);
+    delete random;
+
+    return 0;
+}
diff --git a/11_rocprof_PSO/README.md b/11_rocprof_PSO/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..91cd0c7889b89038cff6fee30a1daa2960072d58
--- /dev/null
+++ b/11_rocprof_PSO/README.md
@@ -0,0 +1,11 @@
+
+rocprof example
+===============
+
+This directory contains a simple Particle Swarm Optimization (PSO) program, which we will profile using the `rocprof` profiler. First, compile the program using `make` and try running it using `./PSO.x`.
+
+To logically separate the program from profiling files, I recommend `cd profiling` and run rocprof commands from there. To see what counters and metrics `rocprof` and the GPUs in the system support, run `rocprof --list-basic` for hardware counters, or `rocprof --list-derived` for derived metrics.
+
+To run a basic profiling, use `rocprof --stats ../PSO.x`. The most useful information is then located in the `result.stats.csv` file. `result.csv` then contains more extensive information about each kernel launched. I recommend opening them in excel or using the "Edit csv" VSCode extension.
+
+To configure what counters and metrics will be measured, edit the `profile.txt` file and use the `-i profile.txt` option (the name of the file can be arbitrary). Each line starting with `pmc: ` denotes one run of the application and the rest of the line specifies the counters and/or metrics measured in that run (there is not enough HW counters to measure everything at once). The `gpu: 0` specifies, that the counters will be measured on GPU with index 0. The `kernel:` line specifies which kernels should be included in the profiling. To run the profiler with this input file, use the command `rocprof --stats -i profile.txt ../PSO.x`. `profile.stats.csv` will again contain some basic summary info, in `profile.csv` you can see all the measured counters and metrics in additional columns.
diff --git a/11_rocprof_PSO/profiling/profile.txt b/11_rocprof_PSO/profiling/profile.txt
new file mode 100644
index 0000000000000000000000000000000000000000..8a254d198950bc135066b46eece088f4f58e2740
--- /dev/null
+++ b/11_rocprof_PSO/profiling/profile.txt
@@ -0,0 +1,11 @@
+# perf counters group 1
+pmc: TCC_HIT_sum, TCC_MISS_sum
+
+# perf counters group 2
+pmc: GPUBusy, VALUBusy, LDSBankConflict
+
+# GPU indexes
+gpu: 0
+
+# kernel names (prefixes)
+kernel: MinReductionIdx Move RenewPBest
diff --git a/README.md b/README.md
index 33410a3fa829a6919b59e676087e8f2d84330432..dbe8c4d4dded0b6ae476431e4da5caf966bbae28 100644
--- a/README.md
+++ b/README.md
@@ -1,2 +1,11 @@
-# lecture_hip
-materials for a lecture on programming in HIP
+
+Example codes
+=============
+
+Hello HIP world!
+
+This folder contains examples associated with the HIP presentation. Each folder contains one example and another README.md which describes it.
+
+The main purpose of the attached Makefiles is just to better communicate the compilation/hipification commands and as a shortcut to manual typing of the whole commands. Real Makefiles would look differently. **Do not use hipify tools in Makefiles or any other automatic build scripts.**
+
+The examples are designed for HIP-4.3.1, and when we are refering to "now", we mean November 2021.