Skip to content
Snippets Groups Projects
Commit 37559e72 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

Merge branch 'dgx' into 'master'

Dgx

See merge request !249
parents 3fa40c83 6db16c37
No related branches found
No related tags found
5 merge requests!368Update prace.md to document the change from qprace to qprod as the default...,!367Update prace.md to document the change from qprace to qprod as the default...,!366Update prace.md to document the change from qprace to qprod as the default...,!323extended-acls-storage-section,!249Dgx
...@@ -7,32 +7,4 @@ ...@@ -7,32 +7,4 @@
## Shell Access ## Shell Access
The DGX-2 can be accessed by SSH protocol via login node ldgx at the address `ldgx.it4i.cz`. [VPN][1] connection is required in order to connect to ldgx. The DGX-2 machine can be accessed by SSH protocol via login nodes at the address `loginX.salomon.it4i.cz`.
```console
_ ___ _____ ____ ___ _ ____ ______ __ ____
| \ | \ \ / /_ _| _ \_ _| / \ | _ \ / ___\ \/ / |___ \
| \| |\ \ / / | || | | | | / _ \ | | | | | _ \ /_____ __) |
| |\ | \ V / | || |_| | | / ___ \ | |_| | |_| |/ \_____/ __/
|_| \_| \_/ |___|____/___/_/ \_\ |____/ \____/_/\_\ |_____|
...running on Ubuntu 18.04 (DGX-2)
[kru0052@ldgx ~]$
```
### Authentication
Authentication is available by private key only.
!!! info
Should you need access to the DGX-2 machine, request it at support@it4i.cz.
### Data Transfer
Data in and out of the system may be transferred by the SCP protocol.
!!! warning
/HOME directory on ldgx is not the same as /HOME directory on dgx. /SCRATCH storage is shared between login node and DGX-2 machine.
[1]: ../../general/accessing-the-clusters/vpn-access/
\ No newline at end of file
...@@ -27,12 +27,17 @@ When allocating computational resources for the job, specify: ...@@ -27,12 +27,17 @@ When allocating computational resources for the job, specify:
!!! note !!! note
Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU. Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU.
!!! info
You can access the DGX PBS scheduler by loadnig the "DGX-2" module.
Submit the job using the `qsub` command: Submit the job using the `qsub` command:
**Example for 1 GPU** **Example for 1 GPU**
```console ```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I [kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I
qsub: waiting for job 257.ldgx to start qsub: waiting for job 257.ldgx to start
qsub: job 257.ldgx ready qsub: job 257.ldgx ready
...@@ -47,12 +52,18 @@ Thu Mar 14 07:46:01 2019 ...@@ -47,12 +52,18 @@ Thu Mar 14 07:46:01 2019
| 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 | | 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | | N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+ +-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
**Example for 4 GPU** **Example for 4 GPU**
```console ```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I [kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I
qsub: waiting for job 256.ldgx to start qsub: waiting for job 256.ldgx to start
qsub: job 256.ldgx ready qsub: job 256.ldgx ready
...@@ -76,12 +87,18 @@ Thu Mar 14 07:45:29 2019 ...@@ -76,12 +87,18 @@ Thu Mar 14 07:45:29 2019
| 3 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 | | 3 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | | N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+ +-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
**Example for 16 GPU (all DGX-2)** **Example for 16 GPU (all DGX-2)**
```console ```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I [kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I
qsub: waiting for job 258.ldgx to start qsub: waiting for job 258.ldgx to start
qsub: job 258.ldgx ready qsub: job 258.ldgx ready
...@@ -141,6 +158,10 @@ Thu Mar 14 07:46:32 2019 ...@@ -141,6 +158,10 @@ Thu Mar 14 07:46:32 2019
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 | | 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | | N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+ +-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
!!! tip !!! tip
...@@ -156,6 +177,8 @@ The jobscript is a user made script controlling a sequence of commands for execu ...@@ -156,6 +177,8 @@ The jobscript is a user made script controlling a sequence of commands for execu
#### Example - Singularity Run Tensorflow #### Example - Singularity Run Tensorflow
```console ```console
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
$ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I $ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I
qsub: waiting for job 96.ldgx to start qsub: waiting for job 96.ldgx to start
qsub: job 96.ldgx ready qsub: job 96.ldgx ready
...@@ -194,6 +217,10 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14) ...@@ -194,6 +217,10 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14)
70 70.0 30763.2 0.001 0.324 0.10889 70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988 80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025 90 90.0 26350.9 0.001 0.324 0.00025
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
``` ```
**GPU stat** **GPU stat**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment