Filip Vaverka · 6ee59835
--- a/docs.it4i/cs/guides/hm_management.md 0 → 100644

+ 251

− 0
+++ b/docs.it4i/cs/guides/hm_management.md 0 → 100644

+ 251

− 0
+# Heterogeneous Memory Management on Intel Platforms
+
+Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
+
+## Overview
+
+At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
+
+```bash
+ml memkind
+```
+
+### Process Level (NUMACTL)
+
+The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
+
+```bash
+numactl --membind <node_ids_set>
+```
+or select single preffered node
+```bash
+numactl --preffered <node_id>
+```
+where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
+
+Convenient way to check `numactl` configuration is
+```bash
+numactl -s
+```
+which prints configuration in its execution environment eg.
+```bash
+numactl --membind 8-15 numactl -s
+policy: bind
+preferred node: 0
+physcpubind: 0 1 2 ... 189 190 191 
+cpubind: 0 1 2 3 4 5 6 7 
+nodebind: 0 1 2 3 4 5 6 7 
+membind: 8 9 10 11 12 13 14 15
+```
+The last row shows allocations memory are restricted to NUMA nodes `8-15`.
+
+### Allocation Level (MEMKIND)
+
+The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
+```cpp
+void *pData = malloc(<SIZE>);
+/* ... */
+free(pData);
+```
+with
+```cpp
+#include <memkind.h>
+
+void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
+/* ... */
+memkind_free(NULL, pData); // "kind" parameter is deduced from the address
+```
+Similarly other memory types can be chosen.
+
+!!! note
+    The allocation will return `NULL` pointer when memory of specified kind is not available.
+
+## High Bandwidth Memory (HBM)
+
+Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
+```bash
+numactl -H
+```
+which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
+
+![](../../img/cs/guides/p10_numa_sc4_flat.png)
+
+### Process Level
+
+With this we can easily restrict application to DDR DRAM or HBM memory:
+```bash
+# Only DDR DRAM
+numactl --membind 0-7 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:          369745.8     0.043355     0.043273     0.043588
+Scale:         366989.8     0.043869     0.043598     0.045355
+Add:           378054.0     0.063652     0.063483     0.063899
+Triad:         377852.5     0.063621     0.063517     0.063884
+
+# Only HBM
+numactl --membind 8-15 ./stream
+# ...
+Function    Best Rate MB/s  Avg time     Min time     Max time
+Copy:         1128430.1     0.015214     0.014179     0.015615
+Scale:        1045065.2     0.015814     0.015310     0.016309
+Add:          1096992.2     0.022619     0.021878     0.024182
+Triad:        1065152.4     0.023449     0.022532     0.024559
+```
+The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
+
+Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
+```bash
+#!/bin/bash
+numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
+```
+and can be used as
+```bash
+mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
+```
+(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise 
+```bash
+mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
+```
+is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
+![](../../img/cs/guides/p10_stream_dram.png)
+![](../../img/cs/guides/p10_stream_hbm.png)
+
+### Allocation Level
+
+Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
+```cpp
+#include <memkind.h>
+// ...
+STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
+// ...
+memkind_free(NULL, a);
+memkind_free(NULL, b);
+memkind_free(NULL, c);
+```
+Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
+The code then has to be linked with `memkind` library
+```bash
+gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
+```
+and can be run as
+```bash
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
+```
+While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
+
+![](../../img/cs/guides/p10_stream_memkind.png)
+
+With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
+
+## Simple Application
+
+One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
+
+```cpp
+#include <iostream>
+#include <vector>
+#include <chrono>
+#include <cmath>
+#include <cstring>
+#include <omp.h>
+#include <memkind.h>
+
+const size_t N_DATA_SIZE  = 2 * 1024 * 1024 * 1024ull;
+const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
+const size_t N_ITERS      = 10;
+
+#if defined(HBM)
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_HBW_ALL
+#else
+    #define DATA_MEMKIND MEMKIND_REGULAR
+    #define BINS_MEMKIND MEMKIND_REGULAR
+#endif
+
+int main(int argc, char *argv[])
+{
+    const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
+    
+    double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
+    size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
+    
+    #pragma omp parallel
+    {
+        drand48_data state;
+        srand48_r(omp_get_thread_num(), &state);
+        
+        #pragma omp for
+        for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            drand48_r(&state, &pData[i]);
+    }
+    
+    auto c1 = std::chrono::steady_clock::now();
+    
+    for(size_t it = 0; it < N_ITERS; ++it)
+    {
+        #pragma omp parallel
+        {
+            for(size_t i = 0; i < N_BINS_COUNT; ++i)
+                pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
+            
+            #pragma omp for
+            for(size_t i = 0; i < N_DATA_SIZE; ++i)
+            {
+                const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
+                pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
+            }
+        }
+    }
+    
+    auto c2 = std::chrono::steady_clock::now();
+    
+    #pragma omp parallel for
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+    {
+        for(size_t j = 1; j < omp_get_max_threads(); ++j)
+            pBins[i] += pBins[j*N_BINS_COUNT + i];
+    }
+    
+    std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
+    
+    size_t total = 0;
+    #pragma omp parallel for reduction(+:total)
+    for(size_t i = 0; i < N_BINS_COUNT; ++i)
+        total += pBins[i];
+    
+    std::cout << "Total Items: " << total << std::endl;
+    
+    memkind_free(NULL, pData);
+    memkind_free(NULL, pBins);
+    
+    return 0;
+}
+```
+
+### Using HBM Memory (p10-intel)
+
+Following commands can be used to compile and run example application above
+```bash
+ml GCC memkind
+export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
+g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
+g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
+OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
+```
+
+Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
+
+## Additional Resources
+
+- [https://linux.die.net/man/8/numactl][1]
+- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
+- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
+
+[1]: https://linux.die.net/man/8/numactl
+[2]: http://memkind.github.io/memkind/man_pages/memkind.html
+[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory
+\ No newline at end of file