diff --git a/01_vector_add/README.md b/01_vector_add/README.md
index 997211aa0fe7c1f8932c9729483d0b86d15a4ea0..d171630cf0c13010c12223e1461d14bdacc6a95f 100644
--- a/01_vector_add/README.md
+++ b/01_vector_add/README.md
@@ -2,9 +2,9 @@
 HIP example: vector add (saxpy)
 ===============================
 
-This program generates two arrays of numbers and performs the classic saxpy operation on them, Y = Y + a * X, using HIP.
+This program generates two arrays of numbers and performs the classic saxpy operation on them, $y = y + ax$, using HIP.
 
-At the top of the file we `#include` the `hip_runtime.h` header file containing declarations of all the HIP-related functions and macros. In the `main` function, we first allocate the data arrays on the host (CPU memory), initialize and print them. Then we allocate the memory on the GPU device using the `hipMalloc` function and copy the data from host to device using the `hipMemcpy` function. Next, we call the computational kernel performing the saxpy operation using the `hipLaunchKernelGGL` function, after which the results are copied back to the host memory using `hipMemcpy`. Finally we print the results and free the allocated memory using `hipFree` for device-side arrays, and classic `delete[]` for host-side buffers.
+At the top of the file we `#include` the `hip_runtime.h` header file containing declarations of all the HIP-related functions and macros. In the `main` function, we first allocate the data arrays on the host (CPU memory), initialize and print them. Then we allocate the memory on the GPU device using the `hipMalloc` function and copy the data from host to device using the `hipMemcpy` function. Next, we call the computational kernel performing the saxpy operation using the `<<<>>>` syntax (`hipLaunchKernelGGL` function is available as an alternative). After that, the results are copied back to the host memory using `hipMemcpy`. Finally we print the results and free the allocated memory using `hipFree` for device-side arrays, and classic `delete[]` for host-side buffers.
 
 
 
diff --git a/01_vector_add/vector_add.hip.cpp b/01_vector_add/vector_add.hip.cpp
index 2727d734d47363c88409b8a3593c9a27713e822d..5f639eee993a9de0075b5107ec6265d3b80c8f1b 100644
--- a/01_vector_add/vector_add.hip.cpp
+++ b/01_vector_add/vector_add.hip.cpp
@@ -5,11 +5,10 @@
 
 __global__ void add_vectors(float * x, float * y, float alpha, int count)
 {
-    // loop through all the elements in the vector
-    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
-    {
+    long long idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if(idx < count)
         y[idx] += alpha * x[idx];
-    }
 }
 
 
@@ -26,7 +25,7 @@ int main()
     {
         h_x[i] = i;
         h_y[i] = 10 * i;
-    }    
+    }
 
     // print the input data
     printf("X:");
@@ -48,9 +47,11 @@ int main()
     hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
     hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
 
+    int tpb = 256;
+    int bpg = (count - 1) / tpb + 1;
     // launch the kernel on the GPU
-    add_vectors<<< 20,128 >>>(d_x, d_y, 100, count);
-    // hipLaunchKernel(add_vectors, 20, 128, 0, 0, d_x, d_y, 100, count);
+    add_vectors<<< bpg, tpb >>>(d_x, d_y, 100, count);
+    // hipLaunchKernelGGL(add_vectors, bpg, tpb, 0, 0, d_x, d_y, 100, count);
 
     // copy the result back to CPU memory
     hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
@@ -69,4 +70,3 @@ int main()
 
     return 0;
 }
-
diff --git a/02_complicated_reduction/Makefile b/02_complicated_reduction/Makefile
index 84ad9f842909aca58bdd3e28f2a193b13f709f36..a4ac16a4d1283d48a53a63d2b2d73497e2a04cf9 100644
--- a/02_complicated_reduction/Makefile
+++ b/02_complicated_reduction/Makefile
@@ -11,6 +11,3 @@ clean:
 
 run:
 	./reduction.x
-
-
-
diff --git a/02_complicated_reduction/README.md b/02_complicated_reduction/README.md
index 1d642bc028eae8f5b0a8d3a4ac49baf49ff28171..24bf736538d5436f9defe9ef4b7d49964210fdd4 100644
--- a/02_complicated_reduction/README.md
+++ b/02_complicated_reduction/README.md
@@ -6,9 +6,9 @@ This example demonstrates as many of the fundamental and advanced features of HI
 
 It allocates an array of numbers, on which it performs the following operation. It calculates all the differences `data[i+1]-data[i]`, wrapping around if necessary, and calculates sine of that difference (multiplied by some constant). The results are then all added together, producing the final result.
 
-Let us examine the computation kernel `calculate` in more detail. We create a **dynamic shared memory** array to later store the data. First thread of each block prints, that the threadblock has started execution using the **printf function in the kernel**. Then the threadblock loops through all the data. The required data are copied from the global memory into the shared memory array, followed by **synchronization of all threads within the threadblock**. Then the difference is calculated, along with its sine using `sinf`, one of the **math functions**. The input and output are scaled using values from a **constant memory** variable. The values are then summed within a warp using **warp shuffle functions**, while keeping the code **wave-aware** (functional across multiple possible warp sizes). Finally, the first lane of each warp contributes the warp's partial sum into the global result using **atomic add**. At the end of the kernel, a message about finished execution of the threadblock is again displayed using `printf`.
+Let us examine the computation kernel `calculate` in more detail. We create a **dynamic shared memory** array to later store the data. First thread of each block prints, that the threadblock has started execution using the **printf function in the kernel**. Then the threadblocks loop through all the data. The required data are copied from the global memory into the shared memory array, followed by **synchronization of all threads within the threadblock**. Then the difference is calculated, along with its sine using `sinf`, one of the **math functions**. The input and output are scaled using values from a **constant memory** variable. The values are then summed within a warp using **warp shuffle functions**, while keeping the code **wave-aware** (functional across multiple possible warp sizes). Finally, the first lane of each warp contributes the warp's partial sum into the global result using **atomic add**. At the end of the kernel, a message about finished execution of the threadblock is again displayed using `printf`.
 
-In the `main` function we allocate **managed memory** array using `hipMallocManaged` function and randomly initialize it on the host (using the CPU). We create a **stream**, which we will use to execute operations asynchronously. The summation result scalar has to be allocated in the device memory, to which we asynchronously copy value zero. Then we **initialize the constant memory** variable by asynchronously copying to it using the `hipMemcpyToSymbolAsync` function (note the `HIP_SYMBOL` macro around the constant variable name). Then the **computational kernel is launched asynchronously** in the created stream, passing the desired dynamic shared memory size as a parameter to the `hipLaunchKernelGGL` function. Then we submit an **asynchronous copy** of the resulting sum back to the CPU memory. We **synchronize with the stream** (wait for all the operations in the stream to finish) using the `hipStreamSynchronize` function, after which we destroy the stream, free the allocated memory and print the final result (which should be around 9733.31).
+In the `main` function we allocate **managed memory** array using `hipMallocManaged` function and randomly initialize it on the host (using the CPU). We create a **stream**, which we will use to execute operations asynchronously. The summation result scalar has to be allocated in the device memory, to which we asynchronously copy value zero. Then we **initialize the constant memory** variable by asynchronously copying to it using the `hipMemcpyToSymbolAsync` function (note the `HIP_SYMBOL` macro around the constant variable name). Then the **computational kernel is launched asynchronously** in the created stream using the `<<<>>>` syntax, providing the desired dynamic shared memory size. Then we submit an **asynchronous copy** of the resulting sum back to the CPU memory. We **synchronize with the stream** (wait for all the operations in the stream to finish) using the `hipStreamSynchronize` function, after which we destroy the stream, free the allocated memory and print the final result (which should be around 9733.323, but can vary due to floating-point inaccuracies).
 
 
 
diff --git a/02_complicated_reduction/reduction.hip.cpp b/02_complicated_reduction/reduction.hip.cpp
index cc53f44ff4e6e8d1c67307da83954ad58882e241..2359bd34a7e3ca8a69c64913b9d8fc9b10955641 100644
--- a/02_complicated_reduction/reduction.hip.cpp
+++ b/02_complicated_reduction/reduction.hip.cpp
@@ -2,16 +2,6 @@
 #include <cstdlib>
 #include <hip/hip_runtime.h>
 
-//   shmem (normal and dynamic)
-//   constant
-//   unified
-//   math functions
-//   __syncthread()
-//   streams
-//   warp shuffle
-//   printf
-//   templated kernel
-
 
 
 struct Parameters
@@ -32,6 +22,7 @@ __global__ void calculate(float * data, int count, float * result)
     // dynamic shared memory
     extern __shared__ float shmem_data[];
 
+    // printf in kernel
     if(threadIdx.x == 0)
         printf("Block %3d started\n", (int)blockIdx.x);
 
@@ -58,9 +49,11 @@ __global__ void calculate(float * data, int count, float * result)
         for(int i = warpSize >> 1; i > 0; i >>= 1)
             warp_sum += __shfl_down(warp_sum, i);
 
-        // atomic operations (although slow on float and double)
+        // atomic operations
         if(lane == 0)
             atomicAdd(result, warp_sum);
+        
+        __syncthreads();
     }
     
     if(threadIdx.x == 0)
diff --git a/03_hipblas/Makefile b/03_hipblas/Makefile
index 79fca5879314de055f2e4fc5f0db7bdcdae81354..5e7d31a7f02f4d59cf1fe5a96dcb3b235a601685 100644
--- a/03_hipblas/Makefile
+++ b/03_hipblas/Makefile
@@ -1,5 +1,6 @@
 
-HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+HIPBLASPATH=/opt/rocm/hipblas
+#HIPBLASPATH=${HOME}/apps/hipBLAS/installation
 
 .PHONY: compile clean run
 
@@ -13,6 +14,3 @@ clean:
 
 run:
 	./hipblas.x
-
-
-
diff --git a/03_hipblas/README.md b/03_hipblas/README.md
index e749098375b8cc0b0ab114c09a4dee99f97666dd..2105a5e987a9b66e794c693429c61cd7284721dd 100644
--- a/03_hipblas/README.md
+++ b/03_hipblas/README.md
@@ -18,7 +18,7 @@ Using default instalation of HIP and hipBLAS this would be
 ```
 hipcc -I/opt/rocm/hipblas/include hipblas.hip.cpp -o hipblas.x -L/opt/rocm/hipblas/lib -lhipblas
 ```
-or just
+or just (due to symlinks)
 ```
 hipcc -I/opt/rocm/include hipblas.hip.cpp -o hipblas.x -L/opt/rocm/lib -lhipblas
 ```
diff --git a/04_hipsolver/Makefile b/04_hipsolver/Makefile
index 68182f18143afcbe5784df300d83dffa6002875c..40e3f98309a4427996cde295d63e604e4d2ca311 100644
--- a/04_hipsolver/Makefile
+++ b/04_hipsolver/Makefile
@@ -1,6 +1,8 @@
 
-HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
-HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+HIPSOLVERPATH=/opt/rocm/hipsolver
+HIPBLASPATH=/opt/rocm/hipblas
+#HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
+#HIPBLASPATH=${HOME}/apps/hipBLAS/installation
 
 .PHONY: compile clean run
 
diff --git a/04_hipsolver/README.md b/04_hipsolver/README.md
index 6d4981dbb3802b28c31bb394a70ae8a3a49979bc..e6d8579f100ff419e8b6a6ce0ca18b3ebc79e6c7 100644
--- a/04_hipsolver/README.md
+++ b/04_hipsolver/README.md
@@ -2,13 +2,13 @@
 HIP example: hipSOLVER library
 ==============================
 
-In this example we demonstrate how to solve a simple dense system of linear equations `Ax=b` using the hipSOLVER library, specifically the `trf-trs` approach.
+In this example we demonstrate how to solve a simple dense system of linear equations $Ax=b$ using the hipSOLVER library, specifically the `trf-trs` approach.
 
-In the source code, we first create and randomly initialize the matrix `A`, right-hand-side vector `b`, and solution vector `x`. Then we allocate the memory space on the device for the matrix `A`, vectors `b` and `x`, vector containing the pivoting and variable info; and copy the matrix `A` and vector `b` to the device.
+In the source code, we first create and randomly initialize the matrix $A$, right-hand-side vector $b$, and solution vector $x$. Then we allocate the memory space on the device for the matrix $A$, vectors $b$ and $x$, vector containing the pivoting and variable info; and copy the matrix $A$ and vector $b$ to the device.
 
-Then we start preparing for solving the system and create the hipSOLVER handle. The functions performing the factorization and solving require additional auxiliary workspace memory buffers, we therefore ask how much memory they need (using the `hipsolverSgetrf_bufferSize` and `hipsolverSgetrs_bufferSize` functions) and allocate it. Then we perform the triangular factorization with partial pivoting of the matrix `A` using `hipsolverSgetrf`. The factorized matrix is then used in the function `hipsolverSgetrs` to solve the system. The solution vector is copied to the host and printed, after which we free the workspace memory and destroy the hipSOLVER handle.
+Then we start preparing for solving the system and create the hipSOLVER handle. The functions performing the factorization and solving require additional auxiliary workspace memory buffers, we therefore ask how much memory they need (using the `hipsolverSgetrf_bufferSize` and `hipsolverSgetrs_bufferSize` functions) and allocate it. Then we perform the triangular factorization with partial pivoting of the matrix $A$ using `hipsolverSgetrf`. The factorized matrix is then used in the function `hipsolverSgetrs` to solve the system. The solution vector is copied to the host and printed, after which we free the workspace memory and destroy the hipSOLVER handle.
 
-To check if the calculations were correct, we multiply the matrix `A` with the solution vector `x`, which should yield vector `b`. For this we use the hipBLAS library, starting with creating the hipBLAS handle. Because the triangular factorization function modified the matrix `A` in the memory, we need to copy it from host to device again. Then we perform the matrix-vector multiplication using the `hipblasSgemv` function, wait for it to finish, destroy the hipBLAS handle, copy the result vector to host and print it. In the end we free all the allocated memory.
+To check if the calculations were correct, we multiply the matrix $A$ with the solution vector $x$, which should yield vector $b$. For this we use the hipBLAS library, starting with creating the hipBLAS handle. Because the triangular factorization function modified the matrix $A$ in the device memory, we need to copy it from host to device again. Then we perform the matrix-vector multiplication using the `hipblasSgemv` function, wait for it to finish, destroy the hipBLAS handle, copy the result vector to host and print it. In the end we free all the allocated memory.
 
 
 
@@ -18,7 +18,7 @@ Compilation is very similar to the hipBLAS example, only the `-I`, `-L` and `-l`
 ```
 hipcc -g -O2 -I/path/to/hipblas/include -I/path/to/hipsolver/include hipsolver.hip.cpp -o hipsolver.x -L/path/to/hipblas/lib -lhipblas -L/path/to/hipsolver/lib -lhipsolver
 ```
-or just
+e.g.,
 ```
 hipcc -I/opt/rocm/include hipsolver.hip.cpp -o hipsolver.x -L/opt/rocm/lib -lhipblas -lhipsolver
 ```
diff --git a/05_hipify_vector_add/Makefile b/05_hipify_vector_add/Makefile
index 6a31dfeefdfdb677c850504104434daab006c21b..194cef22b664023e2e6efbc0a9b0a08404d669be 100644
--- a/05_hipify_vector_add/Makefile
+++ b/05_hipify_vector_add/Makefile
@@ -1,10 +1,10 @@
 
-.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+.PHONY: default cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean
 
 
 
 default:
-	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+	@echo "Use the following targets: cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean"
 
 clean:
 	rm -f *.x *.hip.cpp
@@ -17,12 +17,16 @@ cuda:
 hip:
 	hipcc -g -O2 vector_add.hip.cpp -o vector_add.hip.x
 
+hipify: hipifyperl
+
 hipifyclang:
 	hipify-clang vector_add.cu -o vector_add.hip.cpp
 
 hipifyperl:
 	hipify-perl vector_add.cu -o vector_add.hip.cpp
 
+run: runhip
+
 runcuda:
 	./vector_add.cuda.x
 
diff --git a/05_hipify_vector_add/README.md b/05_hipify_vector_add/README.md
index 05fb398a66901754843ecc85376c2d830317496b..9f309acfd2f43a46ef1514dc45894b569ac8acc0 100644
--- a/05_hipify_vector_add/README.md
+++ b/05_hipify_vector_add/README.md
@@ -4,7 +4,7 @@ HIPIFY example: vector add
 
 This example demonstrates basic usage of the HIPIFY tool.
 
-The source `vector_add.cu` contains CUDA code performing the classic saxpy operation. It is possible to convert this CUDA source code to HIP source code (to hipify) using one of the following commands
+The source `vector_add.cu` contains CUDA code performing the classic saxpy operation. It is possible to convert (hipify) this CUDA source code to HIP source code using one of the following commands
 ```
 hipify-perl vector_add.cu -o vector_add.hip.cpp
 ```
@@ -21,6 +21,3 @@ and run
 ```
 ./vector_add.hip.x
 ```
-
-
-
diff --git a/05_hipify_vector_add/vector_add.cu b/05_hipify_vector_add/vector_add.cu
index 24b27f73a14c3af0ba1426321b419607ab7b888b..54faa005383aeee53861368ab7fbaf34faa3eda2 100644
--- a/05_hipify_vector_add/vector_add.cu
+++ b/05_hipify_vector_add/vector_add.cu
@@ -5,10 +5,10 @@
 
 __global__ void saxpy(float * x, float * y, float a, int count)
 {
-    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
-    {
+    long long idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if(idx < count)
         y[idx] += a * x[idx];
-    }
 }
 
 
@@ -44,7 +44,9 @@ int main()
     cudaMemcpy(d_x, h_x.data(), size, cudaMemcpyHostToDevice);
     cudaMemcpy(d_y, h_y.data(), size, cudaMemcpyHostToDevice);
 
-    saxpy<<< 64, 1024 >>>(d_x, d_y, a, count);
+    int tpb = 256;
+    int bpg = (count - 1) / tpb + 1;
+    saxpy<<< bpg, tpb >>>(d_x, d_y, a, count);
 
     cudaMemcpy(h_y.data(), d_y, size, cudaMemcpyDeviceToHost);
 
@@ -58,4 +60,3 @@ int main()
 
     return 0;
 }
-
diff --git a/06_hipify_warpshuffle/Makefile b/06_hipify_warpshuffle/Makefile
index 82858f8d028074dda75ffb17ec1e302e7ae5b5e9..c52da51da7b3d3917fc1c3166326f1da37ca80f6 100644
--- a/06_hipify_warpshuffle/Makefile
+++ b/06_hipify_warpshuffle/Makefile
@@ -1,10 +1,10 @@
 
-.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+.PHONY: default cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean
 
 
 
 default:
-	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+	@echo "Use the following targets: cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean"
 
 clean:
 	rm -f *.x *.hip.cpp
@@ -17,12 +17,16 @@ cuda:
 hip:
 	hipcc -g -O2 warpshuffle.hip.cpp -o warpshuffle.hip.x
 
+hipify: hipifyperl
+
 hipifyclang:
 	hipify-clang warpshuffle.cu -o warpshuffle.hip.cpp
 
 hipifyperl:
 	hipify-perl warpshuffle.cu -o warpshuffle.hip.cpp
 
+run: runhip
+
 runcuda:
 	./warpshuffle.cuda.x
 
diff --git a/06_hipify_warpshuffle/README.md b/06_hipify_warpshuffle/README.md
index eabb2e449f228725c84838f1d60c9efa4460a35a..924a536df82e22688f35b273130665cc20827c43 100644
--- a/06_hipify_warpshuffle/README.md
+++ b/06_hipify_warpshuffle/README.md
@@ -8,9 +8,9 @@ The CUDA code creates an array of random integers and calculates their sum using
 
 Hipifying the CUDA source using either `hipify-perl` or `hipify-clang` yields a warning, that the `__shfl_down_sync` function is unsupported in HIP. We can fix that by replacing it with `__shfl_down` and omitting the first parameter (`0xffffffff`). Such function does exist in HIP, the hipification was just not performed (safety, correctness reasons, probably because of the extra argument). **We are modifying the hipified code, not the original.**
 
-Modifying the warp shuffle function call syntax, we remember, that the warp size on NVIDIA hardware is 32, but on AMD it is 64. The current code would therefore yield incorrect results on AMD GPUs. But we cannot add an additional `__shfl_down(my_sum, 32)` call, because then it might not work on NVIDIA hardware. (We will ignore the fact that the third optional argument of the `__shfl_down` function is a width.) We need to write wave-aware code, which behaves correctly no matter the warp size. We replace all the `__shfl_down` function calls with the following loop
+Modifying the warp shuffle function call syntax, we remember, that the warp size on NVIDIA hardware is 32, but on AMD it is 64. The current code would therefore yield incorrect results on AMD GPUs. But we cannot add an additional `__shfl_down(my_sum, 32)` call, because then it might not work on NVIDIA hardware. (We will ignore the fact that the third optional argument of the `__shfl_down` function is a width, for which we could use 32.) We need to write wave-aware code, which behaves correctly no matter the warp size. We replace all the `__shfl_down` function calls with the following loop
 ```
-for(int i = warpSize >> 1; i > 0; i >>= 1)
+for(int i = warpSize / 1; i > 0; i /= 2)
     my_sum += __shfl_down(my_sum, i);
 ```
 We could template the kernel with warp size and unroll this loop, but this is not necessary.
@@ -19,6 +19,6 @@ Compiling the code using
 ```
 hipcc warpshuffle.hip.cpp -o warpshuffle.hip.x
 ```
-produces an error, `cannot convert int** to void**` in the `hipHostAlloc` function. The solution could be to just cast the double pointer to `void**`, but this is not very convenient. Anyway, the `hipHostAlloc` and `hipMallocHost` functions are marked as deprecated. The currently correct way of allocating host-side page-locked memory in HIP is to use the `hipHostMalloc` function. Therefore we just change the name of the function from `hipHostAlloc` to `hipHostMalloc`. Compilation is now successfull and running the HIP program produces the same results as the CUDA program.
+produces an error about the `hipHostAlloc` function. The solution could be to just cast the double pointer to `void**`, but this is not very convenient. The `hipHostAlloc` and `hipMallocHost` functions are marked as deprecated anyway. The currently correct way of allocating host-side pinned memory in HIP is to use the `hipHostMalloc` function. Therefore we just change the name of the function from `hipHostAlloc` to `hipHostMalloc`. Compilation is now successfull and running the HIP program produces the same results as the CUDA program.
 
 An interesting thing to note is, that if we did not fix the problem with the warp shuffle functions and kept `__shfl_down_sync` in the code, it would still compile and run fine on NVIDIA platform. This is because on NVIDIA platform, `hipcc` internally calls `nvcc`, which understands the `__shfl_down_sync` function without any problem. Anyway, `__shfl_down` is just a wrapper around `__shfl_down_sync` on NVIDIA platform. **Be careful. If it runs on NVIDA platform, is does not mean it will also run on AMD. Test on both platforms.**
diff --git a/07_hipify_multisource/Makefile b/07_hipify_multisource/Makefile
index 7d087518ea0352c253ff5474e1b70427d5b61371..9499a0290aaa34ad98cdf98e8f26039b4e954cec 100644
--- a/07_hipify_multisource/Makefile
+++ b/07_hipify_multisource/Makefile
@@ -1,14 +1,14 @@
 
-.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+.PHONY: default cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean
 
 
 
 default:
-	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+	@echo "Use the following targets: cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean"
 
 clean:
 	cd project_cuda   &&   make clean
-	rm -r project_hip
+	rm -rf project_hip
 
 
 
@@ -18,12 +18,16 @@ cuda:
 hip:
 	cd project_hip   &&   make
 
+hipify: hipifyperl
+
 hipifyclang:
 	cp -r project_cuda project_hip   &&   hipconvertinplace.sh project_hip
 
 hipifyperl:
 	cp -r project_cuda project_hip   &&   hipconvertinplace-perl.sh project_hip
 
+run: runhip
+
 runcuda:
 	cd project_cuda   &&   make run
 
diff --git a/07_hipify_multisource/README.md b/07_hipify_multisource/README.md
index 6ad445ee9737d42c1ad3351323b5b1007d944a7e..477c3853388a4f72723b530cac7b3f99493e767a 100644
--- a/07_hipify_multisource/README.md
+++ b/07_hipify_multisource/README.md
@@ -2,26 +2,26 @@
 HIPIFY example: whole directory hipification
 ============================================
 
-In the `src_cuda` directory there is a somewhat larger project consisting of multiple CUDA and c++ source files, along with a Makefile used for compilation of this project. We would like to convert this whole project from CUDA to HIP, for which we use HIPIFY. We do not need to hipify each source file individually, instead we can use `hipconvertinplace.sh` or `hipconvertinplace-perl.sh` tools, which can recursively hipify whole directory in-place.
+In the `src_cuda` directory there is a little larger project consisting of multiple CUDA and c++ source files, along with a Makefile used for compilation of this project. We would like to convert this whole project from CUDA to HIP, for which we use HIPIFY. We do not need to hipify each source file individually, instead we can use `hipconvertinplace.sh` or `hipconvertinplace-perl.sh` tools, which can recursively hipify whole directory in-place.
 
-These tools to not have an "input" and "output" directory, they can hipify in-place only (currently). We therefore make a copy of the whole project, e.g.
+These tools do not have an "input" and "output" directory, they can hipify in-place only, so there is actually one one directory (serving both as an input and output). We therefore make a copy of the whole project, e.g.
 ```
 cp -r project_cuda project_hip
 ```
-After that the copied sources can be hipified
+After that, the copied sources can be hipified
 ```
 hipconvertinplace.sh project_hip
 ```
 The script prints some info and exits without any errors.
 
-The directory `project_hip` now contains HIP sources. The Makefile, however, was not hipified, so we have to modify it ourselves, in this example it is sufficent to just replace `nvcc` with `hipcc`. We can change to the `project_hip` directory and try to compile the project,
+The directory `project_hip` now contains HIP sources. The Makefile was, however, not hipified, so we have to modify it ourselves, in this example it is sufficent to just replace `nvcc` with `hipcc`. We can `cd` to the `project_hip` directory and try to compile the project,
 ```
 cd project_hip
 make
 ```
-We get an error about incompatible `int**` and `void**` in the `hipHostAlloc` function, so we just replace it with `hipHostMalloc` as in the previous example. We try to compile it again, which is successfull, and the results are also correct.
+We get an error about the `hipHostAlloc` function, so we just replace it with `hipHostMalloc` as in the previous examples. We try to compile it again, which is successfull, and the results are also correct.
 
-Let us, however, examine the hipified source code in more detail. Looking into the `reduction.cu` file containing a sum-reduction kernel, we can again notice an assumption about the warp size that is hard-coded into the source. But the result was correct, even on AMD GPU! In this specific example, the assumption of `warpSize=32` did not cause any problem, since we were on a device with warp size of 32 or 64. But if there was a device with warp size 16, the result probably would not be correct. The final 6 additions are actually just an optimization -- if only one warp is executing, we do not need to perform synchronization. There is hard-coded warpSize 32, but having a warpSize 64 actually just lowers the performance a bit, performing one unnecessary `__syncthreads()`. Either way, for correctnes we should replace the `32`s with `warpSize` and somehow handle the remainder loop, resulting in e.g.
+Let us, however, examine the hipified source code in more detail. Looking into the `reduction.cu` file containing a sum-reduction kernel, we can again notice an assumption about the warp size that is hard-coded into the source. But the result was correct, even on AMD GPU! In this specific example, the assumption of `warpSize=32` did not cause any problem, since we were on a device with warp size of 32 or 64. But if there was a device with warp size 16, the result probably would not be correct. The final 6 additions are actually just an optimization -- if only one warp is executing, we do not need to perform synchronization. There is hard-coded warpSize 32, but having an actual warpSize 64 actually just lowers the performance a bit (relative to hardcoded 64), performing one unnecessary `__syncthreads()`. Either way, for correctnes we should replace the `32`s with `warpSize` and somehow handle the remainder loop, resulting in e.g.
 ```
 int active_threads = blockDim.x >> 1;
 while(active_threads > warpSize)
@@ -40,9 +40,4 @@ while(active_threads > 0)
 }
 ```
 
-One thing, that does not bother the compiler, linker, not harms correctness, but harms our eyes and brain, is the file extensions, which got preserved from the original project. HIP sources have `.cu` extension, which is really not nice. We could (and should) rename all such files, but we should also reflect these changes in the Makefile and `#include` statements etc., which is no fun to do.
-
-Another thing that might bother us is, that the hipicifation script added the `#include <hip/hip_runtime.h>` statement to *all* the source files, event to the ones which are pure c++ and have nothing to do with HIP (see `hello.h` and `hello.cpp`). There is currently nothing I know about that could solve this, other than manually deleting these includes.
-
-
-
+One thing, that does not bother the compiler, linker, not harms correctness, but harms our eyes and brain, is the file extensions, which got preserved from the original project. Now the HIP sources have `.cu` extension, which is really not nice. We could (and should) rename all such files, but we should also reflect these changes in the Makefile and `#include` statements etc., which is no fun to do.
diff --git a/07_hipify_multisource/project_cuda/Makefile b/07_hipify_multisource/project_cuda/Makefile
index 0a6885c0e91c5002758d884f07dd804b4a70ddc8..6882f4e562003bd13368ca43986990cc18f5e790 100644
--- a/07_hipify_multisource/project_cuda/Makefile
+++ b/07_hipify_multisource/project_cuda/Makefile
@@ -23,5 +23,4 @@ reduction.o: reduction.cu reduction.cuh
 	nvcc -g -O2 -c $< -o $@
 
 hello.o: hello.cpp hello.h
-	nvcc -g -O2 -c $< -o $@
-
+	g++ -g -O2 -c $< -o $@
diff --git a/07_hipify_multisource/project_cuda/reduction.cu b/07_hipify_multisource/project_cuda/reduction.cu
index 47ae883c5218e35f91c747b582eb91457b41524b..0e59db62c12a817f8cd825d459346ac3358a9a87 100644
--- a/07_hipify_multisource/project_cuda/reduction.cu
+++ b/07_hipify_multisource/project_cuda/reduction.cu
@@ -31,4 +31,3 @@ __global__ void reduce_sum(int * data, int count, int * d_result)
     if(threadIdx.x == 0)
         atomicAdd(d_result, shmem_sums[0]);
 }
-
diff --git a/08_hipify_blas/Makefile b/08_hipify_blas/Makefile
index 1e9cf258d405c0af08f201d99e6cb7a517fa2d19..cbbdbaebc4992937729a01fb66442069d37601ce 100644
--- a/08_hipify_blas/Makefile
+++ b/08_hipify_blas/Makefile
@@ -1,12 +1,15 @@
 
-HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+HIPBLASPATH=/opt/rocm/hipblas
+#HIPBLASPATH=${HOME}/apps/hipBLAS/installation
 
-.PHONY: default cuda hip hipifyclang hipifyperl runcuda runhip clean
+
+
+.PHONY: default cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean
 
 
 
 default:
-	@echo "Use the following targets: cuda hip hipifyclang hipifyperl runcuda runhip clean"
+	@echo "Use the following targets: cuda hip hipify hipifyclang hipifyperl run runcuda runhip clean"
 
 clean:
 	rm -f *.x *.hip.cpp
@@ -19,12 +22,16 @@ cuda:
 hip:
 	hipcc -g -O2 -I${HIPBLASPATH}/include blas.hip.cpp -o blas.hip.x -L${HIPBLASPATH}/lib -lhipblas
 
+hipify: hipifyperl
+
 hipifyclang:
 	hipify-clang blas.cu -o blas.hip.cpp
 
 hipifyperl:
 	hipify-perl blas.cu -o blas.hip.cpp
 
+run: runhip
+
 runcuda:
 	./blas.cuda.x
 
diff --git a/08_hipify_blas/README.md b/08_hipify_blas/README.md
index c217c756f8659a334cab2517a330f1e45927d012..d7af4067e01a0a5c2085cb8a01fa2e270b0611ad 100644
--- a/08_hipify_blas/README.md
+++ b/08_hipify_blas/README.md
@@ -4,4 +4,4 @@ HIPIFY example: cuBLAS to hipBLAS hipification
 
 The CUDA source code `blas.cu` demonstrates basic cuBLAS functionality -- it performs a gemv operation, `y=a*A*x+b*y`. Our goal is to hipify this.
 
-Simply hipifying the code using the usuall commands produces no errors nor warnings. Skimming through the hipified code, we see that even the cuBLAS function calls were converted into their hipBLAS equivalents. Besides the classic fix of the host memory allocation function calls we also need to cast the first parameter of the `hipMallocPitch` function to `void**`. After that, we can compile the program using `-lhipblas` instead of `-lcublas` in the command line and adding the other necessary `-I` and `-L` flags. After compilation the program runs flawlessly and produces equivalent results to the original CUDA program.
+Simply hipifying the code using the usuall commands produces no errors nor warnings. Skimming through the hipified code, we see that even the cuBLAS function calls were converted into their hipBLAS equivalents. Besides the classic fix to `hipHostMalloc` we also need to cast the first parameter of the `hipMallocPitch` function to `void**`. After that, we can compile the program using `-lhipblas` instead of `-lcublas` in the command line and adding the other necessary `-I` and `-L` flags. After compilation the program runs flawlessly and produces equivalent results to the original CUDA program.
diff --git a/09_hipify_solver/Makefile b/09_hipify_solver/Makefile
index cd69b16ecc6dfae3e76eea150ffffdcf93fa6d3b..55dc1c7dbd5d22284519965666d6b16916f9abb1 100644
--- a/09_hipify_solver/Makefile
+++ b/09_hipify_solver/Makefile
@@ -1,17 +1,19 @@
 
-HIPBLASPATH=${HOME}/apps/hipBLAS/installation
-HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
+HIPBLASPATH=/opt/rocm/hipblas
+HIPSOLVERPATH=/opt/rocm/hipsolver
+#HIPBLASPATH=${HOME}/apps/hipBLAS/installation
+#HIPSOLVERPATH=${HOME}/apps/hipSOLVER/installation
 
 
 
 
 
-.PHONY: default cuda hip hipifyclang hihipifyperlpify runcuda runhip clean
+.PHONY: default cuda hip hipify hipifyclang hihipifyperlpify run runcuda runhip clean
 
 
 
 default:
-	@echo "Use the following targets: cuda hip hipifyclang hihipifyperlpify runcuda runhip clean"
+	@echo "Use the following targets: cuda hip hipify hipifyclang hihipifyperlpify run runcuda runhip clean"
 
 clean:
 	rm -f *.x *.hip.cpp
@@ -24,15 +26,18 @@ cuda:
 hip:
 	hipcc -g -O2 -I${HIPSOLVERPATH}/include -I${HIPBLASPATH}/include solver.hip.cpp -o solver.hip.x -L${HIPSOLVERPATH}/lib -L${HIPBLASPATH}/lib -lhipsolver -lhipblas
 
+hipify: hipifyperl
+
 hipifyclang:
 	hipify-clang solver.cu -o solver.hip.cpp
 
 hipifyperl:
 	hipify-perl solver.cu -o solver.hip.cpp
 
+run: runhip
+
 runcuda:
 	./solver.cuda.x
 
 runhip:
 	./solver.hip.x
-
diff --git a/09_hipify_solver/solver.cu b/09_hipify_solver/solver.cu
index bea114727ca9c7b70366a0cf910da3e95f718fe1..1c22f8b6f20467fb184a00f4aeeae2645d35eb41 100644
--- a/09_hipify_solver/solver.cu
+++ b/09_hipify_solver/solver.cu
@@ -112,4 +112,3 @@ int main()
 
     return 0;
 }
-
diff --git a/10_hipcpu_vector_add/Makefile b/10_hipcpu_vector_add/Makefile
index 659bd6f336af5f0e90b6cbbde2f2490c2901dc88..728fee3a9f8c7421b2aa65646eed1af14b864310 100644
--- a/10_hipcpu_vector_add/Makefile
+++ b/10_hipcpu_vector_add/Makefile
@@ -1,5 +1,6 @@
 
-HIPCPUPATH=${HOME}/apps/HIP-CPU
+HIPCPUPATH=/opt/rocm/HIP-CPU
+#HIPCPUPATH=${HOME}/apps/HIP-CPU
 
 .PHONY: compile clean run
 
@@ -13,6 +14,3 @@ clean:
 
 run:
 	./vector_add.x
-
-
-
diff --git a/10_hipcpu_vector_add/README.md b/10_hipcpu_vector_add/README.md
index 874e32b7483fc6fb11b47e9148d8f4a853c97318..49993197b2a38ebee07766805194ab5b5b71c946 100644
--- a/10_hipcpu_vector_add/README.md
+++ b/10_hipcpu_vector_add/README.md
@@ -2,16 +2,14 @@
 HIP example: using the HIP-CPU library
 ======================================
 
-This example demonstrates how the HIP-CPU library is to be used.
+This example shows basic usage of the HIP-CPU library.
 
-In the `vector_add.hip.cpp` file is HIP source code performing addition of two vectors. The code is the same as in the very first example `01_vector_add`. This time, we will not use the `hipcc` compiler to compile for GPU, but instead we will compile with any c++ compiler and use the HIP-CPU library, which should allow the same unchanged HIP code to run on CPUs.
+In the `vector_add.hip.cpp` file is HIP source code performing addition of two vectors. The code is the same as in the very first example `01_vector_add`. This time, we will not use the `hipcc` compiler to compile for GPU, but instead we can compile with any c++ compiler and use the HIP-CPU library, which should allow the same unmodified HIP code to run on CPUs.
 
-To compile the program to run on CPUs, we need to specify the include directory of the HIP-CPU library to the compiler, need to specify that we need the c++17 standard, and link with TBB and pthread,
+To compile the program to run on CPUs, we need to specify the include directory of the HIP-CPU library to the compiler, need to specify that we need the c++17 standard, and link with TBB and pthread, e.g.
 ```
-g++ -g -O2 -std=c++17 -I/path/to/hip-cpu/include vector_add.hip.cpp -o vector_add.x -ltbb -pthread
+g++ -g -O2 -std=c++17 -I/opt/rocm/HIP-CPU/include vector_add.hip.cpp -o vector_add.x -ltbb -pthread
 ```
-However, compilation of the code fails with type conversion error. To resolve the error, we just need to cast the pointers to `void**`.
-
-HIP-CPU is not yet fully complete, some of the functions are not yet present, correctly working, or don't have the usuall overloads, which is the case of this error. I do not recomment using HIP-CPU for any serious usage yet, not even for debugging, since you will mostly be debugging the HIP-CPU library, rather than your code.
-
+However, compilation of the code fails with several error - HIP-CPU is not fully compatible with HIP. To resolve the issues, the pointers in `hipMalloc` to `void**` and use the `hipLaunchKernelGGL` function to launch the kernel instead of `<<<>>>`.
 
+HIP-CPU is not yet fully complete, some of the functions are not yet present or are not working correctly. I do not recommend using HIP-CPU for any serious usage yet, not even for debugging, since you will mostly be debugging the HIP-CPU library, rather than your code.
diff --git a/10_hipcpu_vector_add/vector_add.hip.cpp b/10_hipcpu_vector_add/vector_add.hip.cpp
index 6126d1ee10a6a81a4e2bfa5df9a52962b4b6ab36..5f639eee993a9de0075b5107ec6265d3b80c8f1b 100644
--- a/10_hipcpu_vector_add/vector_add.hip.cpp
+++ b/10_hipcpu_vector_add/vector_add.hip.cpp
@@ -5,11 +5,10 @@
 
 __global__ void add_vectors(float * x, float * y, float alpha, int count)
 {
-    // loop through all the elements in the vector
-    for(long long idx = blockIdx.x * blockDim.x + threadIdx.x; idx < count; idx += blockDim.x * gridDim.x)
-    {
+    long long idx = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if(idx < count)
         y[idx] += alpha * x[idx];
-    }
 }
 
 
@@ -26,7 +25,7 @@ int main()
     {
         h_x[i] = i;
         h_y[i] = 10 * i;
-    }    
+    }
 
     // print the input data
     printf("X:");
@@ -43,16 +42,16 @@ int main()
     float * d_y;
     hipMalloc(&d_x, count * sizeof(float));
     hipMalloc(&d_y, count * sizeof(float));
-    // hipMalloc((void**)&d_x, count * sizeof(float));
-    // hipMalloc((void**)&d_y, count * sizeof(float));
 
     // copy the data from host memory to the device
     hipMemcpy(d_x, h_x, count * sizeof(float), hipMemcpyHostToDevice);
     hipMemcpy(d_y, h_y, count * sizeof(float), hipMemcpyHostToDevice);
 
+    int tpb = 256;
+    int bpg = (count - 1) / tpb + 1;
     // launch the kernel on the GPU
-    // hipLaunchKernelGGL( kernel_name, blocks_per_grid, threads_per_block, dyn_shmem_size, stream, kernel_parameters ... )
-    hipLaunchKernelGGL(add_vectors, 20, 128, 0, 0, d_x, d_y, 100, count);
+    add_vectors<<< bpg, tpb >>>(d_x, d_y, 100, count);
+    // hipLaunchKernelGGL(add_vectors, bpg, tpb, 0, 0, d_x, d_y, 100, count);
 
     // copy the result back to CPU memory
     hipMemcpy(h_y, d_y, count * sizeof(float), hipMemcpyDeviceToHost);
@@ -71,4 +70,3 @@ int main()
 
     return 0;
 }
-
diff --git a/12_omp_offload_vadd/Makefile b/12_omp_offload_vadd/Makefile
index 762c3188fdaf5ec0ba9894d6e647d1a7278edea9..b0b2ff475dbe2d64cc685f70609ed7424796476b 100644
--- a/12_omp_offload_vadd/Makefile
+++ b/12_omp_offload_vadd/Makefile
@@ -14,5 +14,4 @@ run: vadd.x
 
 
 vadd.x: vadd.cpp
-	aompcc -O2 $< -o $@
-
+	aompcc -g -O2 $< -o $@
diff --git a/12_omp_offload_vadd/README.md b/12_omp_offload_vadd/README.md
index d82e941c0a13e0899de0f6c874f7a16c8c92db4d..dcb925bdd4dc395b0f5fff275b3e5fea2919dc75 100644
--- a/12_omp_offload_vadd/README.md
+++ b/12_omp_offload_vadd/README.md
@@ -2,7 +2,7 @@
 OpenMP offloading on AMD GPUs
 =============================
 
-This example demostrates how to use AOMP, which can compile programs that use OpenMP offloading.
+This example demostrates how to use AOMP, which can compile programs that use OpenMP offloading for AMD GPUs.
 
 The `vadd.cpp` source file contains a simple vector add source code. On line 35 there begins a loop performing the vector addition, which is annotated by several OpenMP constructs. The `target` construct makes the code execute on the GPU, `map` informs OpenMP about what data transfers should be done. The `teams` construct creates a league of teams, and `distribute` splits the for loop iterations between all teams, a lot like dividing work between threadblocks in CUDA/HIP. `parallel for` then creates several threads, which together work on the team's loop iterations, just like threads in a threadblock.
 
@@ -11,4 +11,4 @@ The code can be compiled using
 aompcc vadd.cpp -o vadd.x
 ```
 
-On machines with other than non-default GPU (default is Vega, gfx900), one would either `export AOMP_GPU=gfx908` or compile using `aompcc --offload-arch gfx908 vadd.cpp -o vadd.x` (for AMD Instinct MI100).
+On machines with other than non-default GPU (default is Vega, gfx900), one would either `export AOMP_GPU=gfx908` or compile using `aompcc --offload-arch gfx908 vadd.cpp -o vadd.x` (for AMD Instinct MI100, or `gfx90a` for MI200).