Newer
Older
# Resource Allocation and Job Execution
To run a job, computational resources of DGX-2 must be allocated.
The DGX-2 machine is integrated to and accessible through Barbora cluster, the queue for the DGX-2 machine is called **qdgx**.
When allocating computational resources for the job, specify:
1. your Project ID
1. a queue for your job - **qdgx**;
1. the maximum time allocated to your calculation (default is **4 hour**, maximum is **48 hour**);
1. a jobscript if batch processing is intended.
Submit the job using the `sbatch` (for batch processing) or `salloc` (for interactive session) command:
[kru0052@login2.barbora ~]$ salloc -A PROJECT-ID -p qdgx --time=02:00:00
salloc: Granted job allocation 36631
salloc: Waiting for resource configuration
salloc: Nodes cn202 are ready for job
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... On | 00000000:34:00.0 Off | 0 |
| N/A 32C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM3... On | 00000000:36:00.0 Off | 0 |
| N/A 31C P0 48W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM3... On | 00000000:39:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM3... On | 00000000:3B:00.0 Off | 0 |
| N/A 36C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 |
| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 |
| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 8 Tesla V100-SXM3... On | 00000000:B7:00.0 Off | 0 |
| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 9 Tesla V100-SXM3... On | 00000000:B9:00.0 Off | 0 |
| N/A 30C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 10 Tesla V100-SXM3... On | 00000000:BC:00.0 Off | 0 |
| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 11 Tesla V100-SXM3... On | 00000000:BE:00.0 Off | 0 |
| N/A 35C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 12 Tesla V100-SXM3... On | 00000000:E0:00.0 Off | 0 |
| N/A 31C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 13 Tesla V100-SXM3... On | 00000000:E2:00.0 Off | 0 |
| N/A 29C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 14 Tesla V100-SXM3... On | 00000000:E5:00.0 Off | 0 |
| N/A 34C P0 51W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
Submit the interactive job using the `salloc` command.
### Job Execution
The DGX-2 machine runs only a bare-bone, minimal operating system. Users are expected to run
**[Apptainer/Singularity][1]** containers in order to enrich the environment according to the needs.
Containers (Docker images) optimized for DGX-2 may be downloaded from
[NVIDIA Gpu Cloud][2]. Select the code of interest and
copy the docker nvcr.io link from the Pull Command section. This link may be directly used
to download the container via Apptainer/Singularity, see the example below:
[kru0052@login2.barbora ~] $ salloc -A PROJECT-ID -p qdgx --time=02:00:00
salloc: Granted job allocation 36633
salloc: Waiting for resource configuration
salloc: Nodes cn202 are ready for job
kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
Singularity tensorflow_19.02-py3.sif:~>
Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
TF 1.13.0-rc0
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609]
...
...
...
2019-03-11 08:30:12.263822: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
1 1.0 338.2 6.999 7.291 2.00000
10 10.0 3658.6 5.658 5.950 1.62000
20 20.0 25628.6 2.957 3.258 1.24469
30 30.0 30815.1 0.177 0.494 0.91877
40 40.0 30826.3 0.004 0.330 0.64222
50 50.0 30884.3 0.002 0.327 0.41506
60 60.0 30888.7 0.001 0.325 0.23728
70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
```
**GPU stat**
The GPU load can be determined by the `gpustat` utility.
```console
Every 2,0s: gpustat --color
dgx Mon Mar 11 09:31:00 2019
[0] Tesla V100-SXM3-32GB | 47'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[1] Tesla V100-SXM3-32GB | 48'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[2] Tesla V100-SXM3-32GB | 56'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[3] Tesla V100-SXM3-32GB | 57'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[4] Tesla V100-SXM3-32GB | 46'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[5] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[6] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[7] Tesla V100-SXM3-32GB | 54'C, 97 % | 23660 / 32480 MB | kru0052(23645M)
[8] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[9] Tesla V100-SXM3-32GB | 46'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
[10] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[11] Tesla V100-SXM3-32GB | 56'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[12] Tesla V100-SXM3-32GB | 47'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
[13] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[14] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M)
[15] Tesla V100-SXM3-32GB | 58'C, 95 % | 23660 / 32480 MB | kru0052(23645M)
```
[1]: https://docs.it4i.cz/software/tools/singularity/
[2]: https://ngc.nvidia.com/