Skip to content
Snippets Groups Projects
Commit 41384d54 authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Merge branch 'dgx' into 'master'

Dgx

See merge request !322
parents 77fc036a 8c03e5b4
No related branches found
No related tags found
1 merge request!322Dgx
Pipeline #19814 passed
......@@ -803,6 +803,8 @@ node-pre-gyp
npm
- node_modules/spawn-sync/README.md
iojs
>>>>>>> readme
UCX
Dask-ssh
SCRATCH
HOME
PROJECT
......@@ -7,8 +7,29 @@
## How to Access
The DGX-2 machine can be accessed through the scheduler from Salomon login nodes `salomon.it4i.cz`.
The DGX-2 machine can be accessed through the scheduler from Barbora login nodes `barbora.it4i.cz` as a compute node cn202.
The NVIDIA DGX-2 has its own instance of the scheduler, it can be accessed by loading the `DGX-2` module. See [Resource Allocation and Job Execution][1].
## Storage
[1]: job_execution.md
There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT.
### HOME
The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1].
### SCRATCH
The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory.
Users may freely create subdirectories and files on the filesystem (`/scratch/user/$USER`).
Accessible capacity is 22TB, shared among all users.
!!! warning
Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted.
### PROJECT
The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters.
For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section.
[1]: ../../barbora/storage/#home-file-system
[2]: ../../storage/project-storage
......@@ -2,23 +2,14 @@
To run a job, computational resources of DGX-2 must be allocated.
DGX-2 uses an independent PBS scheduler. To access the scheduler, load the DGX-2 module:
```console
$ml DGX-2
```
## Resources Allocation Policy
The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources.
* **qdgx**, the queue for DGX-2 machine
The queue for the DGX-2 machine is called **qdgx**.
!!! note
Maximum walltime of a job is **48** hours.
!!! note
The qdgx queue is configured to run one job and accept one job in a queue per user.
The qdgx queue is configured to run one job and accept one job in a queue per user with the maximum walltime of a job being **48** hours.
## Job Submission and Execution
......@@ -28,9 +19,9 @@ The `qsub` submits the job into the queue. The command creates a request to the
When allocating computational resources for the job, specify:
1. a queue for your job (the default is **qdgx**)
1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**)
1. a Jobscript or interactive switch
1. a queue for your job (the default is **qdgx**);
1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**);
1. a jobscript or interactive switch.
!!! info
You can access the DGX PBS scheduler by loading the "DGX-2" module.
......@@ -40,16 +31,14 @@ Submit the job using the `qsub` command:
**Example**
```console
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l walltime=02:00:00 -I
qsub: waiting for job 258.ldgx to start
qsub: job 258.ldgx ready
kru0052@dgx:~$ nvidia-smi
Thu Mar 14 07:46:32 2019
[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=02:00:00 -I
qsub: waiting for job 258.dgx to start
qsub: job 258.dgx ready
kru0052@cn202:~$ nvidia-smi
Wed Jun 16 07:46:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
......@@ -102,10 +91,7 @@ Thu Mar 14 07:46:32 2019
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
kru0052@cn202:~$ exit
```
!!! tip
......@@ -124,13 +110,11 @@ to download the container via singularity, see the example below:
#### Example - Singularity Run Tensorflow
```console
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
$ qsub -q qdgx -l walltime=01:00:00 -I
qsub: waiting for job 96.ldgx to start
qsub: job 96.ldgx ready
[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=01:00:00 -I
qsub: waiting for job 96.dgx to start
qsub: job 96.dgx ready
kru0052@dgx:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3
Singularity tensorflow_19.02-py3.sif:~>
Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512
PY 3.5.2 (default, Nov 12 2018, 13:43:14)
......@@ -164,10 +148,7 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14)
70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
kru0052@cn202:~$ exit
```
**GPU stat**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment