Newer
Older
# Complementary System Job Scheduling
## Introduction
[Slurm][1] workload manager is used to allocate and access Complementary systems resources.
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
p00-arm up 1-00:00:00 0/1/0/1 p00-arm01
p01-arm* up 1-00:00:00 0/8/0/8 p01-arm[01-08]
p03-amd up 1-00:00:00 0/2/0/2 p03-amd[01-02]
p04-edge up 1-00:00:00 0/1/0/1 p04-edge01
p05-synt up 1-00:00:00 0/1/0/1 p05-synt01
p06-arm up 1-00:00:00 0/2/0/2 p06-arm[01-02]
p07-power up 1-00:00:00 0/1/0/1 p07-power01
p08-amd up 1-00:00:00 0/1/0/1 p08-amd01
p10-intel up 1-00:00:00 0/1/0/1 p10-intel01
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 p01-arm interact user R 1:48 2 p01-arm[01-02]
```
Show job details for executing job from job session
```console
$ scontrol -d show job $SLURM_JOBID
```
## Running Interactive Jobs
Run interactive job
```console
$ salloc -A PROJECT-ID -p p01-arm
```
Run interactive job, with X11 forwarding
```console
$ salloc -A PROJECT-ID -p p01-arm --x11
```
!!! warning
Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
## Running Batch Jobs
Run batch job
```console
```
Useful command options (salloc, sbatch, srun)
* -n, --ntasks
* -c, --cpus-per-task
* -N, --nodes
Slurm provides useful information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).
| variable name | description | example |
| ------ | ------ | ------ |
| SLURM_JOB_ID | job id of the executing job| 593 |
| SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
| SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
| SLURM_JOB_PARTITION | name of the partition | p03-amd |
| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |
See [Slurm srun documentation][2] for details.
Get job nodelist
```
$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]
```
Expand nodelist to list of nodes.
```
$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02
```
## Modifying Jobs
```
$ scontrol update JobId=JOBID ATTR=VALUE
```
for example
```
$ scontrol update JobId=JOBID Comment='The best job ever'
```
## Deleting Jobs
```
$ scancel JOBID
```
| PARTITION | nodes | whole node | cores per node | features |
| --------- | ----- | ---------- | -------------- | -------- |
| p00-arm | 1 | yes | 64 | aarch64,cortex-a72 |
| p01-arm | 8 | yes | 48 | aarch64,a64fx,ib |
| p02-intel | 2 | no | 64 | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
| p03-amd | 2 | no | 64 | x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx |
| p04-edge | 1 | yes | 16 | 86_64,intel,broadwell,ib |
| p05-synt | 1 | yes | 8 | x86_64,amd,milan,ib,ht |
| p06-arm | 2 | yes | 80 | aarch64,ib |
| p07-power | 1 | yes | 192 | ppc64le,ib |
| p08-amd | 1 | yes | 128 | x86_64,amd,milan-x,ib,ht |
| p10-intel | 1 | yes | 96 | x86_64,intel,sapphire_rapids,ht|
Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
Whole node allocation.
One node:
```console
```
## Partition 01 - ARM (A64FX)
Whole node allocation.
One node:
```console
```
Multiple nodes:
```console
```
## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per FPGA, resource separation is not enforced.
```
Two FPGAs on the same node:
```console
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
```
## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
GPUs and FPGAs are treated as resources. See below for more details about resources.
Partial allocation - per GPU and per FPGA, resource separation is not enforced.
Use only GPUs and FPGAs allocated to the job!
```
Two GPUs on the same node:
```console
```
Four GPUs on the same node:
```console
```
All GPUs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4
```
One FPGA:
```console
```
Two FPGAs:
```console
```
All FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
```
One GPU and one FPGA on the same node:
```console
```
Four GPUs and two FPGAs on the same node:
```console
salloc -A PROJECT-ID -p p03-amd --gres=gpu:4,fpga:2
```
All GPUs and FPGAs:
```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpu:4,fpga:2
```
## Partition 04 - Edge Server
Whole node allocation:
```console
```
## Partition 05 - FPGA Synthesis Server
Whole node allocation:
```console
## Partition 06 - ARM
Whole node allocation:
```console
salloc -A PROJECT-ID -p p06-arm
```
## Partition 07 - IBM Power
Whole node allocation:
```console
salloc -A PROJECT-ID -p p07-power
```
## Partition 08 - AMD Milan-X
Whole node allocation:
```console
salloc -A PROJECT-ID -p p08-amd
```
## Partition 10 - Intel Sapphire Rapids
Whole node allocation:
```console
salloc -A PROJECT-ID -p p10-intel
## Features
Nodes have feature tags assigned to them.
Users can select nodes based on the feature tags using --constraint option.
| Feature | Description |
| ------ | ------ |
| aarch64 | platform |
| x86_64 | platform |
| amd | manufacturer |
| intel | manufacturer |
| icelake | processor family |
| broadwell | processor family |
| sapphire_rapids | processor family |
| fpga | equipped with FPGA |
| nvdimm | equipped with NVDIMMs |
| ht | Hyperthreading enabled |
| noht | Hyperthreading disabled |
```
$ sinfo -o '%16N %f'
NODELIST AVAIL_FEATURES
p00-arm01 aarch64,cortex-a72
p01-arm[01-08] aarch64,a64fx,ib
p02-intel01 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02 x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd02 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,noht
p03-amd01 x86_64,amd,milan,ib,gpu,mi100,fpga,xilinx,ht
p04-edge01 x86_64,intel,broadwell,ib,ht
p05-synt01 x86_64,amd,milan,ib,ht
p06-arm[01-02] aarch64,ib
p07-power01 ppc64le,ib
p08-amd01 x86_64,amd,milan-x,ib,ht
p10-intel01 x86_64,intel,sapphire_rapids,ht
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPUs and FPGAs.
Use only allocated GPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behavior and get into troubles.
$ scontrol -d show node p02-intel01 | grep Gres=
Gres=fpga:bitware_520n_mx:2
$ scontrol -d show node p02-intel02 | grep Gres=
Gres=fpga:bitware_520n_mx:2
To allocate required resources (GPUs or FPGAs) use the `--gres salloc/srun` option.
Example: Allocate one FPGA
```
$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
```
Information about allocated resources is available in Slurm job details, attributes `JOB_GRES` and `GRES`.
```
$ scontrol -d show job $SLURM_JOBID |grep GRES=
JOB_GRES=fpga:xilinx_alveo_u250:1
Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
```
IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPUs) allocated to the job on the node. In the given example - allocated resources are `fpga:xilinx_alveo_u250:1(IDX:0)`, we should use FPGA with index/number 0 on node p03-amd01.
It is possible to allocate specific resources. It is useful for partition p03-amd equipped with FPGAs of different types.
GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.
$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job
$ scontrol -d show job $SLURM_JOBID | grep -i gres
Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
[2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES