Skip to content
Snippets Groups Projects
hm_management.md 9.96 KiB
Newer Older
  • Learn to ignore specific revisions
  • # Heterogeneous Memory Management on Intel Platforms
    
    Partition `p10-intel` offser heterogeneous memory directly exposed to the user. This allows to manually pick appropriate kind of memory to be used at process or even single allocation granularity. Both kinds of memory are exposed as memory-only NUMA nodes. This allows both coarse (process level) and fine (allocation level) grained control over memory type used.
    
    ## Overview
    
    At the process level the `numactl` facilities can be utilized, while Intel provided `memkind` library allows for finer control. Both `memkind` library and `numactl` can be accessed by loading `memkind` module or `OpenMPI` module (only `numactl`).
    
    ```bash
    ml memkind
    ```
    
    ### Process Level (NUMACTL)
    
    The `numactl` allows to either restrict memory pool of the process to specific set of memory NUMA nodes
    
    ```bash
    numactl --membind <node_ids_set>
    ```
    
    or select single preffered node
    
    ```bash
    numactl --preffered <node_id>
    ```
    
    where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
    
    Convenient way to check `numactl` configuration is
    
    which prints configuration in its execution environment eg.
    
    ```bash
    numactl --membind 8-15 numactl -s
    policy: bind
    preferred node: 0
    
    physcpubind: 0 1 2 ... 189 190 191
    cpubind: 0 1 2 3 4 5 6 7
    nodebind: 0 1 2 3 4 5 6 7
    
    membind: 8 9 10 11 12 13 14 15
    ```
    
    The last row shows allocations memory are restricted to NUMA nodes `8-15`.
    
    ### Allocation Level (MEMKIND)
    
    The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
    
    ```cpp
    void *pData = malloc(<SIZE>);
    /* ... */
    free(pData);
    ```
    
    ```cpp
    #include <memkind.h>
    
    void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
    /* ... */
    memkind_free(NULL, pData); // "kind" parameter is deduced from the address
    ```
    
    Similarly other memory types can be chosen.
    
    !!! note
        The allocation will return `NULL` pointer when memory of specified kind is not available.
    
    ## High Bandwidth Memory (HBM)
    
    Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
    
    which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
    
    ![](../../img/cs/guides/p10_numa_sc4_flat.png)
    
    ### Process Level
    
    With this we can easily restrict application to DDR DRAM or HBM memory:
    
    ```bash
    # Only DDR DRAM
    numactl --membind 0-7 ./stream
    # ...
    Function    Best Rate MB/s  Avg time     Min time     Max time
    Copy:          369745.8     0.043355     0.043273     0.043588
    Scale:         366989.8     0.043869     0.043598     0.045355
    Add:           378054.0     0.063652     0.063483     0.063899
    Triad:         377852.5     0.063621     0.063517     0.063884
    
    # Only HBM
    numactl --membind 8-15 ./stream
    # ...
    Function    Best Rate MB/s  Avg time     Min time     Max time
    Copy:         1128430.1     0.015214     0.014179     0.015615
    Scale:        1045065.2     0.015814     0.015310     0.016309
    Add:          1096992.2     0.022619     0.021878     0.024182
    Triad:        1065152.4     0.023449     0.022532     0.024559
    ```
    
    The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
    
    Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
    
    ```bash
    #!/bin/bash
    numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
    ```
    
    ```bash
    mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
    ```
    
    
    (8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
    
    
    ```bash
    mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
    ```
    
    is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
    ![](../../img/cs/guides/p10_stream_dram.png)
    ![](../../img/cs/guides/p10_stream_hbm.png)
    
    ### Allocation Level
    
    Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
    
    ```cpp
    #include <memkind.h>
    // ...
    STREAM_TYPE *a = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
    STREAM_TYPE *b = (STREAM_TYPE *)memkind_malloc(MEMKIND_REGULAR, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
    STREAM_TYPE *c = (STREAM_TYPE *)memkind_malloc(MEMKIND_HBW_ALL, STREAM_ARRAY_SIZE * sizeof(STREAM_TYPE));
    // ...
    memkind_free(NULL, a);
    memkind_free(NULL, b);
    memkind_free(NULL, c);
    ```
    
    Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
    The code then has to be linked with `memkind` library
    
    ```bash
    gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
    ```
    
    ```bash
    export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
    OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
    ```
    
    While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
    
    ![](../../img/cs/guides/p10_stream_memkind.png)
    
    With this setup we can see that simple copy operation (C[i] = A[i]) achieves bandwidth comparable to the application bound entirely to HBM memory. On the other hand the scale operation (B[i] = s*C[i]) is mostly limited by DDR DRAM bandwidth. Its also worth noting that operations combining all three arrays are performing close to HBM-only configuration.
    
    ## Simple Application
    
    One of applications that can greatly benefit from availability of large slower and faster smaller memory is computing histogram with many bins over large dataset.
    
    ```cpp
    #include <iostream>
    #include <vector>
    #include <chrono>
    #include <cmath>
    #include <cstring>
    #include <omp.h>
    #include <memkind.h>
    
    const size_t N_DATA_SIZE  = 2 * 1024 * 1024 * 1024ull;
    const size_t N_BINS_COUNT = 1 * 1024 * 1024ull;
    const size_t N_ITERS      = 10;
    
    #if defined(HBM)
        #define DATA_MEMKIND MEMKIND_REGULAR
        #define BINS_MEMKIND MEMKIND_HBW_ALL
    #else
        #define DATA_MEMKIND MEMKIND_REGULAR
        #define BINS_MEMKIND MEMKIND_REGULAR
    #endif
    
    int main(int argc, char *argv[])
    {
        const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
    
        double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
        size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
    
        #pragma omp parallel
        {
            drand48_data state;
            srand48_r(omp_get_thread_num(), &state);
    
            #pragma omp for
            for(size_t i = 0; i < N_DATA_SIZE; ++i)
                drand48_r(&state, &pData[i]);
        }
    
        auto c1 = std::chrono::steady_clock::now();
    
        for(size_t it = 0; it < N_ITERS; ++it)
        {
            #pragma omp parallel
            {
                for(size_t i = 0; i < N_BINS_COUNT; ++i)
                    pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
    
                #pragma omp for
                for(size_t i = 0; i < N_DATA_SIZE; ++i)
                {
                    const size_t idx = size_t(pData[i] / binWidth) % N_BINS_COUNT;
                    pBins[omp_get_thread_num()*N_BINS_COUNT + idx]++;
                }
            }
        }
    
        auto c2 = std::chrono::steady_clock::now();
    
        #pragma omp parallel for
        for(size_t i = 0; i < N_BINS_COUNT; ++i)
        {
            for(size_t j = 1; j < omp_get_max_threads(); ++j)
                pBins[i] += pBins[j*N_BINS_COUNT + i];
        }
    
        std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
    
        size_t total = 0;
        #pragma omp parallel for reduction(+:total)
        for(size_t i = 0; i < N_BINS_COUNT; ++i)
            total += pBins[i];
    
        std::cout << "Total Items: " << total << std::endl;
    
        memkind_free(NULL, pData);
        memkind_free(NULL, pBins);
    
    ### Using HBM Memory (P10-Intel)
    
    
    Following commands can be used to compile and run example application above
    
    ```bash
    ml GCC memkind
    export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
    g++ -O3 -fopenmp -lmemkind histogram.cpp -o histogram_dram
    g++ -O3 -fopenmp -lmemkind -DHBM histogram.cpp -o histogram_hbm
    OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_dram
    OMP_PROC_BIND=spread GOMP_CPU_AFFINITY=0-95 OMP_NUM_THREADS=96 ./histogram_hbm
    ```
    
    Moving histogram bins data into HBM memory should speedup the algorithm more than twice. It should be noted that moving also `pData` array into HBM memory worsens this result (presumably because the algorithm can saturate both memory interfaces).
    
    ## Additional Resources
    
    - [https://linux.die.net/man/8/numactl][1]
    - [http://memkind.github.io/memkind/man_pages/memkind.html][2]
    - [https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory][3]
    
    [1]: https://linux.die.net/man/8/numactl
    [2]: http://memkind.github.io/memkind/man_pages/memkind.html
    [3]: https://lenovopress.lenovo.com/lp1738-implementing-intel-high-bandwidth-memory