hm_management.md

# Heterogeneous Memory Management on Intel Platforms

Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.

## Overview

At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).

```bash
ml memkind
```

### Process Level (NUMACTL)

The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes

```bash
numactl --membind <node_ids_set>
```

or select single preffered node

```bash
numactl --preffered <node_id>
```

where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.

Convenient way to check `numactl` configuration is

```bash
numactl -s
```

which prints configuration in its execution environment eg.

```bash
numactl --membind 8-15 numactl -s
policy: bind
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
```

The last row shows allocations memory are restricted to NUMA nodes `8-15`.

### Allocation Level (MEMKIND)

The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:

```cpp
void *pData = malloc(<SIZE>);
/* ... */
free(pData);
```

with

```cpp
#include <memkind.h>

void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address
```

Similarly other memory types can be chosen.

!!! note
    The allocation will return `NULL` pointer when memory of specified kind is not available.

## High Bandwidth Memory (HBM)

Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running

```bash
numactl -H
```

which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).

![](../../img/cs/guides/p10_numa_sc4_flat.png)

### Process Level

With this we can easily restrict application to DDR DRAM or HBM memory:

```bash
# Only DDR DRAM
numactl --membind 0-7 ./stream
# ...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:          369745.8     0.043355     0.043273     0.043588
Scale:         366989.8     0.043869     0.043598     0.045355
Add:           378054.0     0.063652     0.063483     0.063899
Triad:         377852.5     0.063621     0.063517     0.063884

# Only HBM
numactl --membind 8-15 ./stream
# ...
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:         1128430.1     0.015214     0.014179     0.015615
Scale:        1045065.2     0.015814     0.015310     0.016309
Add:          1096992.2     0.022619     0.021878     0.024182
Triad:        1065152.4     0.023449     0.022532     0.024559
```

The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.

Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like

```bash
#!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
```

and can be used as

```bash
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
```

(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise

```bash
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
```

is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
![](../../img/cs/guides/p10_stream_dram.png)
![](../../img/cs/guides/p10_stream_hbm.png)

### Allocation Level

Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows

```cpp
#include <memkind.h>
// ...
STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
// ...
memkind_free(NULL, a);
memkind_free(NULL, b);
memkind_free(NULL, c);
```

Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library

```bash
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
```

and can be run as

```bash
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
```

While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.

![](../../img/cs/guides/p10_stream_memkind.png)

With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.

## Simple Application

One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.

```cpp
#include <iostream>
#include <vector>
#include <chrono>
#include <cmath>
#include <cstring>
#include <omp.h>
#include <memkind.h>

const size_t N_DATA_SIZE  = 2 * 1024 * 1024 * 1024ull;
const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
const size_t N_ITERS      = 10;

#if defined(HBM)
    #define DATA_MEMKIND MEMKIND_REGULAR
    #define BINS_MEMKIND MEMKIND_HBW_ALL
#else
    #define DATA_MEMKIND MEMKIND_REGULAR
    #define BINS_MEMKIND MEMKIND_REGULAR
#endif

int main(int argc, char *argv[])
{
    const double binWidth = 1.0 / double(N_BINS_COUNT + 1);

    double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
    size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));

    #pragma omp parallel
    {
        drand48_data state;
        srand48_r(omp_get_thread_num(), &state);

        #pragma omp for
        for(size_t i = 0; i < N_DATA_SIZE; ++i)
            drand48_r(&state, &pData[i]);
    }

    auto c1 = std::chrono::steady_clock::now();

    for(size_t it = 0; it < N_ITERS; ++it)
    {
        #pragma omp parallel
        {
            for(size_t i = 0; i < N_BINS_COUNT; ++i)
                pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);

            #pragma omp for
            for(size_t i = 0; i < N_DATA_SIZE; ++i)
            {
                const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
                pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
            }
        }
    }

    auto c2 = std::chrono::steady_clock::now();

    #pragma omp parallel for
    for(size_t i = 0; i < N_BINS_COUNT; ++i)
    {
        for(size_t j = 1; j < omp_get_max_threads(); ++j)
            pBins[i] += pBins[j*N_BINS_COUNT + i];
    }

    std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;

    size_t total = 0;
    #pragma omp parallel for reduction(+:total)
    for(size_t i = 0; i < N_BINS_COUNT; ++i)
        total += pBins[i];

    std::cout << "Total Items: " << total << std::endl;

    memkind_free(NULL, pData);
    memkind_free(NULL, pBins);

    return 0;
}
```

### Using HBM Memory (P10-Intel)

Following commands can be used to compile and run example application above

```bash
ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
```

Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).

## Additional Resources

- [https://linux.die.net/man/8/numactl][1]
- [http://memkind.github.io/memkind/man_pages/memkind.html][2]
- [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]

[1]: https://linux.die.net/man/8/numactl
[2]: http://memkind.github.io/memkind/man_pages/memkind.html
[3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory