Newer
Older
# Complementary System Job Scheduling
## Introduction
[Slurm][1] workload manager is used to allocate and access Complementary systems resources.
Display partitions/queues
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
p00-arm up 1-00:00:00 0/1/0/1 p00-arm01
p01-arm* up 1-00:00:00 0/8/0/8 p01-arm[01-08]
p02-intel up 1-00:00:00 0/2/0/2 p02-intel[01-02]
p03-amd up 1-00:00:00 0/2/0/2 p03-amd[01-02]
p04-edge up 1-00:00:00 0/1/0/1 p04-edge01
p05-synt up 1-00:00:00 0/1/0/1 p05-synt01
$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
104 p01-arm interact user R 1:48 2 p01-arm[01-02]
```
Show job details
$ scontrol show job 104
```
Run interactive job
$ salloc -A PROJECT-ID -p p01-arm
```
Run interactive job, with X11 forwarding
$ salloc -A PROJECT-ID -p p01-arm --x11
Do not use `srun` for initiating interactive jobs, subsequent `srun`, `mpirun` invocations would block forever.
$ sbatch -A PROJECT-ID -p p01-arm ../script.sh
```
Useful command options (salloc, sbatch, srun)
* -n, --ntasks
* -c, --cpus-per-task
* -N, --nodes
| PARTITION | nodes| cores per node |
| ------ | ------ | ------ |
| p00-arm | 1 | 64 |
| p01-arm | 8 | 48 |
| p02-intel | 2 | 64 |
| p03-amd | 2 | 64 |
| p04-edge | 1 | 16 |
| p05-synt | 1 | 8 |
Use `-t`, `--time` option to specify job run time limit. Default job time limit is 2 hours, maximum job time limit is 24 hours.
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
FIFO scheduling with backfiling is employed.
## Partition 00 - ARM (Legacy)
Whole node allocation.
One node:
```console
sbatch -A PROJECT-ID -p p00-arm ./script.sh
```
## Partition 01 - ARM (A64FX)
Whole node allocation.
One node:
```console
sbatch -A PROJECT-ID -p p01-arm ./script.sh
```
```console
sbatch -A PROJECT-ID -p p01-arm -N=1 ./script.sh
```
Multiple nodes:
```console
sbatch -A PROJECT-ID -p p01-arm -N=8 ./script.sh
```
## Partition 02 - Intel (Ice Lake, NVDIMMs + Bitware FPGAs)
Partial allocation - per FPGA, resource separation is not enforced.
One FPGA:
```console
sbatch -A PROJECT-ID -p p02-intel --gres=fpga ./script.sh
```
Two FPGAs on the same node:
```console
sbatch -A PROJECT-ID -p p02-intel --gres=fpga:2 ./script.sh
```
All FPGAs:
```console
sbatch -A PROJECT-ID -p p02-intel -N 2 --gres=fpga:2 ./script.sh
```
## Partition 03 - AMD (Milan, MI100 GPUs + Xilinx FPGAs)
Partial allocation - per GPU and per FPGA, resource separation is not enforced.
One GPU:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu ./script.sh
```
Two GPUs on the same node:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:2 ./script.sh
```
Four GPUs on the same node:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:4 ./script.sh
```
All GPUs:
```console
sbatch -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4 ./script.sh
```
One FPGA:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=fpga ./script.sh
```
Two FPGAs:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=fpga:2 ./script.sh
```
All FPGAs:
```console
sbatch -A PROJECT-ID -p p03-amd -N 2--gres=fpga:2 ./script.sh
```
One GPU and one FPGA on the same node:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu,fpga ./script.sh
```
Four GPUs and two FPGAs on the same node:
```console
sbatch -A PROJECT-ID -p p03-amd --gres=gpgpu:4,fpga:2 ./script.sh
```
All GPUs and FPGAs:
```console
sbatch -A PROJECT-ID -p p03-amd -N 2 --gres=gpgpu:4,fpga:2 ./script.sh
```
## Partition 04 - Edge Server
Whole node allocation:
```console
sbatch -A PROJECT-ID -p p04-edge ./script.sh
```
## Partition 05 - FPGA Synthesis Server
Whole node allocation:
```console
sbatch -A PROJECT-ID -p p05-synt ./script.sh
```
[1]: https://slurm.schedmd.com/