Skip to content
Snippets Groups Projects
Commit 41384d54 authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Merge branch 'dgx' into 'master'

Dgx

See merge request !322
parents 77fc036a 8c03e5b4
No related branches found
No related tags found
1 merge request!322Dgx
Pipeline #19814 passed
...@@ -803,6 +803,8 @@ node-pre-gyp ...@@ -803,6 +803,8 @@ node-pre-gyp
npm npm
- node_modules/spawn-sync/README.md - node_modules/spawn-sync/README.md
iojs iojs
>>>>>>> readme
UCX UCX
Dask-ssh Dask-ssh
SCRATCH
HOME
PROJECT
...@@ -7,8 +7,29 @@ ...@@ -7,8 +7,29 @@
## How to Access ## How to Access
The DGX-2 machine can be accessed through the scheduler from Salomon login nodes `salomon.it4i.cz`. The DGX-2 machine can be accessed through the scheduler from Barbora login nodes `barbora.it4i.cz` as a compute node cn202.
The NVIDIA DGX-2 has its own instance of the scheduler, it can be accessed by loading the `DGX-2` module. See [Resource Allocation and Job Execution][1]. ## Storage
[1]: job_execution.md There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT.
### HOME
The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1].
### SCRATCH
The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory.
Users may freely create subdirectories and files on the filesystem (`/scratch/user/$USER`).
Accessible capacity is 22TB, shared among all users.
!!! warning
Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted.
### PROJECT
The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters.
For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section.
[1]: ../../barbora/storage/#home-file-system
[2]: ../../storage/project-storage
...@@ -2,23 +2,14 @@ ...@@ -2,23 +2,14 @@
To run a job, computational resources of DGX-2 must be allocated. To run a job, computational resources of DGX-2 must be allocated.
DGX-2 uses an independent PBS scheduler. To access the scheduler, load the DGX-2 module:
```console
$ml DGX-2
```
## Resources Allocation Policy ## Resources Allocation Policy
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources. The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources.
* **qdgx**, the queue for DGX-2 machine The queue for the DGX-2 machine is called **qdgx**.
!!! note !!! note
Maximum walltime of a job is **48** hours. The qdgx queue is configured to run one job and accept one job in a queue per user with the maximum walltime of a job being **48** hours.
!!! note
The qdgx queue is configured to run one job and accept one job in a queue per user.
## Job Submission and Execution ## Job Submission and Execution
...@@ -28,9 +19,9 @@ The `qsub` submits the job into the queue. The command creates a request to the ...@@ -28,9 +19,9 @@ The `qsub` submits the job into the queue. The command creates a request to the
When allocating computational resources for the job, specify: When allocating computational resources for the job, specify:
1. a queue for your job (the default is **qdgx**) 1. a queue for your job (the default is **qdgx**);
1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**) 1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**);
1. a Jobscript or interactive switch 1. a jobscript or interactive switch.
!!! info !!! info
You can access the DGX PBS scheduler by loading the "DGX-2" module. You can access the DGX PBS scheduler by loading the "DGX-2" module.
...@@ -40,16 +31,14 @@ Submit the job using the `qsub` command: ...@@ -40,16 +31,14 @@ Submit the job using the `qsub` command:
**Example** **Example**
```console ```console
[kru0052@login4.salomon ~]$ ml DGX-2 [kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=02:00:00 -I
PBS 18.1.3 for DGX-2 machine qsub: waiting for job 258.dgx to start
[kru0052@login4.salomon ~]$ qsub -q qdgx -l walltime=02:00:00 -I qsub: job 258.dgx ready
qsub: waiting for job 258.ldgx to start
qsub: job 258.ldgx ready kru0052@cn202:~$ nvidia-smi
Wed Jun 16 07:46:32 2021
kru0052@dgx:~$ nvidia-smi
Thu Mar 14 07:46:32 2019
+-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+ |-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
...@@ -102,10 +91,7 @@ Thu Mar 14 07:46:32 2019 ...@@ -102,10 +91,7 @@ Thu Mar 14 07:46:32 2019
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 | | 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | | N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+ +-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit kru0052@cn202:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
!!! tip !!! tip
...@@ -124,13 +110,11 @@ to download the container via singularity, see the example below: ...@@ -124,13 +110,11 @@ to download the container via singularity, see the example below:
#### Example - Singularity Run Tensorflow #### Example - Singularity Run Tensorflow
```console ```console
[kru0052@login4.salomon ~]$ ml DGX-2 [kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=01:00:00 -I
PBS 18.1.3 for DGX-2 machine qsub: waiting for job 96.dgx to start
$ qsub -q qdgx -l walltime=01:00:00 -I qsub: job 96.dgx ready
qsub: waiting for job 96.ldgx to start
qsub: job 96.ldgx ready
kru0052@dgx:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3 kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
Singularity tensorflow_19.02-py3.sif:~> Singularity tensorflow_19.02-py3.sif:~>
Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512 Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
PY 3.5.2 (default, Nov 12 2018, 13:43:14) PY 3.5.2 (default, Nov 12 2018, 13:43:14)
...@@ -164,10 +148,7 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14) ...@@ -164,10 +148,7 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14)
70 70.0 30763.2 0.001 0.324 0.10889 70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988 80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025 90 90.0 26350.9 0.001 0.324 0.00025
kru0052@dgx:~$ exit kru0052@cn202:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
**GPU stat** **GPU stat**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment