Skip to content
Snippets Groups Projects
Commit 33fc350f authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Update file hm_management.md

parent 221dde69
No related branches found
No related tags found
No related merge requests found
Pipeline #36933 passed with warnings
...@@ -17,37 +17,47 @@ The `numactl` allows to either restrict memory pool of the process to specific s ...@@ -17,37 +17,47 @@ The `numactl` allows to either restrict memory pool of the process to specific s
```bash ```bash
numactl --membind <node_ids_set> numactl --membind <node_ids_set>
``` ```
or select single preffered node or select single preffered node
```bash ```bash
numactl --preffered <node_id> numactl --preffered <node_id>
``` ```
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation. where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check `numactl` configuration is Convenient way to check `numactl` configuration is
```bash ```bash
numactl -s numactl -s
``` ```
which prints configuration in its execution environment eg. which prints configuration in its execution environment eg.
```bash ```bash
numactl --membind 8-15 numactl -s numactl --membind 8-15 numactl -s
policy: bind policy: bind
preferred node: 0 preferred node: 0
physcpubind: 0 1 2 ... 189 190 191 physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7 cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7 nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15 membind: 8 9 10 11 12 13 14 15
``` ```
The last row shows allocations memory are restricted to NUMA nodes `8-15`. The last row shows allocations memory are restricted to NUMA nodes `8-15`.
### Allocation Level (MEMKIND) ### Allocation Level (MEMKIND)
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing: The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
```cpp ```cpp
void *pData = malloc(<SIZE>); void *pData = malloc(<SIZE>);
/* ... */ /* ... */
free(pData); free(pData);
``` ```
with with
```cpp ```cpp
#include <memkind.h> #include <memkind.h>
...@@ -55,6 +65,7 @@ void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>); ...@@ -55,6 +65,7 @@ void *pData = memkind_malloc(MEMKIND_HBW, <SIZE>);
/* ... */ /* ... */
memkind_free(NULL, pData); // "kind" parameter is deduced from the address memkind_free(NULL, pData); // "kind" parameter is deduced from the address
``` ```
Similarly other memory types can be chosen. Similarly other memory types can be chosen.
!!! note !!! note
...@@ -63,9 +74,11 @@ Similarly other memory types can be chosen. ...@@ -63,9 +74,11 @@ Similarly other memory types can be chosen.
## High Bandwidth Memory (HBM) ## High Bandwidth Memory (HBM)
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
```bash ```bash
numactl -H numactl -H
``` ```
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each). which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
![](../../img/cs/guides/p10_numa_sc4_flat.png) ![](../../img/cs/guides/p10_numa_sc4_flat.png)
...@@ -73,6 +86,7 @@ which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR ...@@ -73,6 +86,7 @@ which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR
### Process Level ### Process Level
With this we can easily restrict application to DDR DRAM or HBM memory: With this we can easily restrict application to DDR DRAM or HBM memory:
```bash ```bash
# Only DDR DRAM # Only DDR DRAM
numactl --membind 0-7 ./stream numactl --membind 0-7 ./stream
...@@ -92,21 +106,28 @@ Scale: 1045065.2 0.015814 0.015310 0.016309 ...@@ -92,21 +106,28 @@ Scale: 1045065.2 0.015814 0.015310 0.016309
Add: 1096992.2 0.022619 0.021878 0.024182 Add: 1096992.2 0.022619 0.021878 0.024182
Triad: 1065152.4 0.023449 0.022532 0.024559 Triad: 1065152.4 0.023449 0.022532 0.024559
``` ```
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar. The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
```bash ```bash
#!/bin/bash #!/bin/bash
numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@ numactl --membind $((8 + $OMPI_COMM_WORLD_RANK)) $@
``` ```
and can be used as and can be used as
```bash ```bash
mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi mpirun -np 8 --map-by slot:pe=12 membind_wrapper.sh ./stream_mpi
``` ```
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
```bash ```bash
mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi mpirun -n 8 --map-by slot:pe=12 numactl --membind 8-15 ./stream_mpi
``` ```
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results: is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
![](../../img/cs/guides/p10_stream_dram.png) ![](../../img/cs/guides/p10_stream_dram.png)
![](../../img/cs/guides/p10_stream_hbm.png) ![](../../img/cs/guides/p10_stream_hbm.png)
...@@ -114,6 +135,7 @@ is most likely preferable even for MPI workloads. Applying above approach to MPI ...@@ -114,6 +135,7 @@ is most likely preferable even for MPI workloads. Applying above approach to MPI
### Allocation Level ### Allocation Level
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
```cpp ```cpp
#include <memkind.h> #include <memkind.h>
// ... // ...
...@@ -125,16 +147,21 @@ memkind_free(NULL, a); ...@@ -125,16 +147,21 @@ memkind_free(NULL, a);
memkind_free(NULL, b); memkind_free(NULL, b);
memkind_free(NULL, c); memkind_free(NULL, c);
``` ```
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B. Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library The code then has to be linked with `memkind` library
```bash ```bash
gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream gcc -march=native -O3 -fopenmp -lmemkind memkind_stream.c -o memkind_stream
``` ```
and can be run as and can be run as
```bash ```bash
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15 export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream OMP_NUM_THREADS=$((N*12)) OMP_PROC_BIND=spread ./memkind_stream
``` ```
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable. While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
![](../../img/cs/guides/p10_stream_memkind.png) ![](../../img/cs/guides/p10_stream_memkind.png)
...@@ -169,29 +196,29 @@ const size_t N_ITERS = 10; ...@@ -169,29 +196,29 @@ const size_t N_ITERS = 10;
int main(int argc, char *argv[]) int main(int argc, char *argv[])
{ {
const double binWidth = 1.0 / double(N_BINS_COUNT + 1); const double binWidth = 1.0 / double(N_BINS_COUNT + 1);
double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double)); double *pData = (double *)memkind_malloc(DATA_MEMKIND, N_DATA_SIZE * sizeof(double));
size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double)); size_t *pBins = (size_t *)memkind_malloc(BINS_MEMKIND, N_BINS_COUNT * omp_get_max_threads() * sizeof(double));
#pragma omp parallel #pragma omp parallel
{ {
drand48_data state; drand48_data state;
srand48_r(omp_get_thread_num(), &state); srand48_r(omp_get_thread_num(), &state);
#pragma omp for #pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i) for(size_t i = 0; i < N_DATA_SIZE; ++i)
drand48_r(&state, &pData[i]); drand48_r(&state, &pData[i]);
} }
auto c1 = std::chrono::steady_clock::now(); auto c1 = std::chrono::steady_clock::now();
for(size_t it = 0; it < N_ITERS; ++it) for(size_t it = 0; it < N_ITERS; ++it)
{ {
#pragma omp parallel #pragma omp parallel
{ {
for(size_t i = 0; i < N_BINS_COUNT; ++i) for(size_t i = 0; i < N_BINS_COUNT; ++i)
pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0); pBins[omp_get_thread_num()*N_BINS_COUNT + i] = size_t(0);
#pragma omp for #pragma omp for
for(size_t i = 0; i < N_DATA_SIZE; ++i) for(size_t i = 0; i < N_DATA_SIZE; ++i)
{ {
...@@ -200,35 +227,36 @@ int main(int argc, char *argv[]) ...@@ -200,35 +227,36 @@ int main(int argc, char *argv[])
} }
} }
} }
auto c2 = std::chrono::steady_clock::now(); auto c2 = std::chrono::steady_clock::now();
#pragma omp parallel for #pragma omp parallel for
for(size_t i = 0; i < N_BINS_COUNT; ++i) for(size_t i = 0; i < N_BINS_COUNT; ++i)
{ {
for(size_t j = 1; j < omp_get_max_threads(); ++j) for(size_t j = 1; j < omp_get_max_threads(); ++j)
pBins[i] += pBins[j*N_BINS_COUNT + i]; pBins[i] += pBins[j*N_BINS_COUNT + i];
} }
std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl; std::cout << "Elapsed Time [s]: " << std::chrono::duration<double>(c2 - c1).count() << std::endl;
size_t total = 0; size_t total = 0;
#pragma omp parallel for reduction(+:total) #pragma omp parallel for reduction(+:total)
for(size_t i = 0; i < N_BINS_COUNT; ++i) for(size_t i = 0; i < N_BINS_COUNT; ++i)
total += pBins[i]; total += pBins[i];
std::cout << "Total Items: " << total << std::endl; std::cout << "Total Items: " << total << std::endl;
memkind_free(NULL, pData); memkind_free(NULL, pData);
memkind_free(NULL, pBins); memkind_free(NULL, pBins);
return 0; return 0;
} }
``` ```
### Using HBM Memory (p10-intel) ### Using HBM Memory (P10-Intel)
Following commands can be used to compile and run example application above Following commands can be used to compile and run example application above
```bash ```bash
ml GCC memkind ml GCC memkind
export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15 export MEMKIND_HBW_NODES=8,9,10,11,12,13,14,15
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment