job-scheduling.md

# Complementary System Job Scheduling

## Introduction

[Slurm][1] workload manager is used to allocate and access Complementary systems resources.

Display partitions/queues

```console
$ sinfo -s
PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
p00-arm      up 1-00:00:00          0/1/0/1 p00-arm01
p01-arm*     up 1-00:00:00          0/8/0/8 p01-arm[01-08]
p02-intel    up 1-00:00:00          0/2/0/2 p02-intel[01-02]
p03-amd      up 1-00:00:00          0/2/0/2 p03-amd[01-02]
p04-edge     up 1-00:00:00          0/1/0/1 p04-edge01
p05-synt     up 1-00:00:00          0/1/0/1 p05-synt01
```

Show jobs

```console
$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               104   p01-arm interact    user   R       1:48      2 p01-arm[01-02]
```

Show job details for specific job

```console
$ scontrol -d show job JOBID
```

Show job details for executing job from job session

```console
$ scontrol -d show job $SLURM_JOBID
```

## Running Interactive Jobs

Run interactive job

```console
 $ salloc -A PROJECT-ID -p p01-arm
```

Run interactive job, with X11 forwarding

```console
 $ salloc -A PROJECT-ID -p p01-arm --x11
```

!!! warning
    Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.

## Running Batch Jobs

Run batch job

```console
 $ sbatch -A PROJECT-ID -p p01-arm ../script.sh
```

Useful command options (salloc, sbatch, srun)

* -n, --ntasks
* -c, --cpus-per-task
* -N, --nodes

## Slurm Job Environment Variables

Slurm provides usefull information to the job via environment variables. Environment variables are available on all nodes allocated to job when accessed via Slurm supported means (srun, compatible mpirun).

See all Slurm variables
```
set | grep ^SLURM
```

### Useful Variables

| variable name | description | example |
| ------ | ------ | ------ |
| SLURM_JOBID | job id of the executing job| 593 |
| SLURM_JOB_NODELIST | nodes allocated to the job | p03-amd[01-02] |
| SLURM_JOB_NUM_NODES | number of nodes allocated to the job | 2 |
| SLURM_STEP_NODELIST | nodes allocated to the job step | p03-amd01 |
| SLURM_STEP_NUM_NODES | number of nodes allocated to the job step | 1 |
| SLURM_JOB_PARTITION | name of the partition | p03-amd |
| SLURM_SUBMIT_DIR | submit directory | /scratch/project/open-xx-yy/work |

See [Slurm srun documentation][2] for details.

Get job nodelist
```
$ echo $SLURM_JOB_NODELIST
p03-amd[01-02]
```

Expand nodelist to list of nodes.
```
$ scontrol show hostnames $SLURM_JOB_NODELIST
p03-amd01
p03-amd02
```

## Partitions

| PARTITION | nodes| cores per node | features |
| ------ | ------ | ------ | ------ |
| p00-arm | 1 | 64 | aarch64,cortex-a72 |
| p01-arm | 8 | 48 | aarch64,a64fx,ib |
| p02-intel | 2 | 64 | x86_64,intel,icelake,ib,fpga,bitware,nvdimm |
| p03-amd | 2 | 64 | x86_64,amd,milan,ib,gpgpu,mi100,fpga,xilinx |
| p04-edge | 1 | 16 | 86_64,intel,broadwell,ib |
| p05-synt | 1 | 8 | x86_64,amd,milan,ib,ht |

Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.

FIFO scheduling with backfiling is employed.

## Partition 00 - ARM (Cortex-A72)

Whole node allocation.

One node:

```console
salloc -A PROJECT-ID -p p00-arm
```

## Partition 01 - ARM (A64FX)

Whole node allocation.

One node:

```console
salloc -A PROJECT-ID -p p01-arm
```

```console
salloc -A PROJECT-ID -p p01-arm -N=1
```

Multiple nodes:

```console
salloc -A PROJECT-ID -p p01-arm -N=8
```

## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)

FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per FPGA, resource separation is not enforced.
Use only FPGAs allocated to the job!

One FPGA:

```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga
```

Two FPGAs on the same node:

```console
salloc -A PROJECT-ID -p p02-intel --gres=fpga:2
```

All FPGAs:

```console
salloc -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2
```

## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)

GPGPUs and FPGAs are treated as resources. See below for more details about resources.

Partial allocation - per GPGPU and per FPGA, resource separation is not enforced.
Use only GPGPUs and FPGAs allocated to the job!

One GPU:

```console
salloc -A PROJECT-ID -p p03-amd --gres=gpgpu
```

Two GPUs on the same node:

```console
salloc -A PROJECT-ID -p p03-amd --gres=gpgpu:2
```

Four GPUs on the same node:

```console
salloc -A PROJECT-ID -p p03-amd --gres=gpgpu:4
```

All GPUs:

```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4
```

One FPGA:

```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga
```

Two FPGAs:

```console
salloc -A PROJECT-ID -p p03-amd --gres=fpga:2
```

All FPGAs:

```console
salloc -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2
```

One GPU and one FPGA on the same node:

```console
salloc -A PROJECT-ID -p p03-amd --gres=gpgpu,fpga
```

Four GPUs and two FPGAs on the same node:

```console
salloc -A PROJECT-ID -p p03-amd --gres=gpgpu:4,fpga:2
```

All GPUs and FPGAs:

```console
salloc -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4,fpga:2
```

## Partition 04 - Edge Server

Whole node allocation:

```console
salloc -A PROJECT-ID -p p04-edge
```

## Partition 05 - FPGA Synthesis Server

Whole node allocation:

```console
salloc -A PROJECT-ID -p p05-synt
```

## Features

Nodes have feature tags assigned to them.
Users can select nodes based on the feature tags using --constraint option.

| Feature | Description |
| ------ | ------ |
| aarch64 | platform |
| x86_64 | platform |
| amd | manufacturer |
| intel | manufacturer |
| icelake | processor family |
| broadwell | processor family |
| milan | processor family |
| ib | Infiniband |
| gpgpu | equipped with GPGPU |
| fpga | equipped with FPGA |
| nvdimm | equipped with NVDIMMs |
| ht | Hyperthreading enabled |
| noht | Hyperthreading disabled |

```
$ sinfo -o '%16N %f'
NODELIST         AVAIL_FEATURES
p00-arm01        aarch64,cortex-a72
p01-arm[01-08]   aarch64,a64fx,ib
p02-intel01      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,ht
p02-intel02      x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
p03-amd01        x86_64,amd,milan,ib,gpgpu,mi100,fpga,xilinx,ht
p03-amd02        x86_64,amd,milan,ib,gpgpu,mi100,fpga,xilinx,noht
p04-edge01       x86_64,intel,broadwell,ib,ht
p05-synt01       x86_64,amd,milan,ib,ht
```

```
$ salloc -A PROJECT-ID -p p02-intel --constraint noht
```

```
$ scontrol -d show node p02-intel02 | grep ActiveFeatures
   ActiveFeatures=x86_64,intel,icelake,ib,fpga,bitware,nvdimm,noht
```

## Resources

Slurm supports the ability to define and schedule arbitrary resources - Generic RESources (GRES) in Slurm's terminology. We use GRES for scheduling/allocating GPGPUs and FPGAs.

!!! warning
Use only allocated GPGPUs and FPGAs. Resource separation is not enforced. If you use non-allocated resources, you can observe strange behaviour and get into troubles.

Get information about GRES on node:

```
$ scontrol -d show node p03-amd01 | grep Gres=
   Gres=gpgpu:amd_mi100:4,fpga:xilinx_alveo_u250:2
$ scontrol -d show node p03-amd02 | grep Gres=
   Gres=gpgpu:amd_mi100:4,fpga:xilinx_alveo_u280:2
```

Example: Alocate one FPGA
```
$ salloc -A PROJECT-ID -p p03-amd --gres fpga:1
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd01 are ready for job

  avx2 modules + all modules
```

### Find Out Allocated Resources

Information about allocated resources is available in Slurm job details, attributes JOB_GRES and GRERS.

```
$ scontrol -d show job $SLURM_JOBID |grep GRES=
   JOB_GRES=fpga:xilinx_alveo_u250:1
     Nodes=p03-amd01 CPU_IDs=0-1 Mem=0 GRES=fpga:xilinx_alveo_u250:1(IDX:0)
```
IDX in the GRES attribute specifies index/indexes of FPGA(s) (or GPGPUs) allocated to the job on the node. In the given example - allocated resources are fpga:xilinx_alveo_u250:1(IDX:0), we should use FPGA with index/number 0 on node p03-amd01.

### Request Specific Resources

It is possible to allocate specific resources. It is useful for partition p03-amd, where FPGAs of different types are available.

GRES entry is using format "name[[:type]:count", in the following example name is fpga, type is xilinx_alveo_u280, and count is count 2.

```
$ salloc -A PROJECT-ID -p p03-amd --gres=fpga:xilinx_alveo_u280:2
salloc: Granted job allocation XXX
salloc: Waiting for resource configuration
salloc: Nodes p03-amd02 are ready for job

$ scontrol -d show job $SLURM_JOBID | grep -i gres
   JOB_GRES=fpga:xilinx_alveo_u280:2
     Nodes=p03-amd02 CPU_IDs=0 Mem=0 GRES=fpga:xilinx_alveo_u280(IDX:0-1)
   TresPerNode=gres:fpga:xilinx_alveo_u280:2
```

[1]: https://slurm.schedmd.com/
[2]: https://slurm.schedmd.com/srun.html#SECTION_OUTPUT-ENVIRONMENT-VARIABLES