@@ -17,37 +17,47 @@ The `numactl` allows to either restrict memory pool of the process to specific s
...
@@ -17,37 +17,47 @@ The `numactl` allows to either restrict memory pool of the process to specific s
```bash
```bash
numactl --membind <node_ids_set>
numactl --membind <node_ids_set>
```
```
or select single preffered node
or select single preffered node
```bash
```bash
numactl --preffered <node_id>
numactl --preffered <node_id>
```
```
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
where `<node_ids_set>` is comma separated list (eg. `0,2,5,...`) in combination with ranges (such as `0-5`). The `membind` option kills the process if it requests more memory than can be satisfied from specified nodes. The `preffered` option just reverts to using other nodes according to their NUMA distance in the same situation.
Convenient way to check `numactl` configuration is
Convenient way to check `numactl` configuration is
```bash
```bash
numactl -s
numactl -s
```
```
which prints configuration in its execution environment eg.
which prints configuration in its execution environment eg.
```bash
```bash
numactl --membind 8-15 numactl -s
numactl --membind 8-15 numactl -s
policy: bind
policy: bind
preferred node: 0
preferred node: 0
physcpubind: 0 1 2 ... 189 190 191
physcpubind: 0 1 2 ... 189 190 191
cpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
nodebind: 0 1 2 3 4 5 6 7
membind: 8 9 10 11 12 13 14 15
membind: 8 9 10 11 12 13 14 15
```
```
The last row shows allocations memory are restricted to NUMA nodes `8-15`.
The last row shows allocations memory are restricted to NUMA nodes `8-15`.
### Allocation Level (MEMKIND)
### Allocation Level (MEMKIND)
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
The `memkind` library (in its simplest use case) offers new variant of `malloc/free` function pair, which allows to specify kind of memory to be used for given allocation. Moving specific allocation from default to HBM memory pool then can be achieved by replacing:
memkind_free(NULL,pData);// "kind" parameter is deduced from the address
memkind_free(NULL,pData);// "kind" parameter is deduced from the address
```
```
Similarly other memory types can be chosen.
Similarly other memory types can be chosen.
!!! note
!!! note
...
@@ -63,9 +74,11 @@ Similarly other memory types can be chosen.
...
@@ -63,9 +74,11 @@ Similarly other memory types can be chosen.
## High Bandwidth Memory (HBM)
## High Bandwidth Memory (HBM)
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
Intel Sapphire Rapids (partition `p10-intel`) consists of two sockets each with `128GB` of DDR and `64GB` on-package HBM memory. The machine is configured in FLAT mode and therefore exposes HBM memory as memory-only NUMA nodes (`16GB` per 12-core tile). The configuration can be verified by running
```bash
```bash
numactl -H
numactl -H
```
```
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).
which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR DRAM, while `8-15` should have no cores and `16GB` of HBM each).


...
@@ -73,6 +86,7 @@ which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR
...
@@ -73,6 +86,7 @@ which should show 16 NUMA nodes (`0-7` should contain 12 cores and `32GB` of DDR
### Process Level
### Process Level
With this we can easily restrict application to DDR DRAM or HBM memory:
With this we can easily restrict application to DDR DRAM or HBM memory:
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
The DDR DRAM achieves bandwidth of around 400GB/s, while the HBM clears 1TB/s bar.
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
Some further improvements can be achieved by entirely isolating a process to a single tile. This can be useful for MPI jobs, where `$OMPI_COMM_WORLD_RANK` can be used to bind each process individually. The simple wrapper script to do this may look like
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
(8 tiles with 12 cores each). However, this approach assumes `16GB` of HBM memory local to the tile is sufficient for each process (memory cannot spill between tiles). This approach may be significantly more useful in combination with `--preferred` instead of `--membind` to force preference of local HBM with spill to DDR DRAM. Otherwise
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:
is most likely preferable even for MPI workloads. Applying above approach to MPI Stream with 8 ranks and 1-24 threads per rank we can expect these results:




...
@@ -114,6 +135,7 @@ is most likely preferable even for MPI workloads. Applying above approach to MPI
...
@@ -114,6 +135,7 @@ is most likely preferable even for MPI workloads. Applying above approach to MPI
### Allocation Level
### Allocation Level
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
Allocation level memory kind selection using `memkind` library can be illustrated using modified stream benchmark. The stream benchmark uses three working arrays (A, B and C), whose allocation can be changed to `memkind_malloc` as follows
```cpp
```cpp
#include<memkind.h>
#include<memkind.h>
// ...
// ...
...
@@ -125,16 +147,21 @@ memkind_free(NULL, a);
...
@@ -125,16 +147,21 @@ memkind_free(NULL, a);
memkind_free(NULL,b);
memkind_free(NULL,b);
memkind_free(NULL,c);
memkind_free(NULL,c);
```
```
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
Arrays A and C are allocated from HBM (`MEMKIND_HBW_ALL`), while DDR DRAM (`MEMKIND_REGULAR`) is used for B.
The code then has to be linked with `memkind` library
The code then has to be linked with `memkind` library
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.
While the `memkind` library should be able to detect HBM memory on its own (through `HMAT` and `hwloc`) this is not supported on `p10-intel`. This means that NUMA nodes representing HBM have to be specified manually using `MEMKIND_HBW_NODES` environment variable.