Skip to content
Snippets Groups Projects
Commit 37559e72 authored by Lukáš Krupčík's avatar Lukáš Krupčík
Browse files

Merge branch 'dgx' into 'master'

Dgx

See merge request !249
parents 3fa40c83 6db16c37
Branches
No related tags found
5 merge requests!368Update prace.md to document the change from qprace to qprod as the default...,!367Update prace.md to document the change from qprace to qprod as the default...,!366Update prace.md to document the change from qprace to qprod as the default...,!323extended-acls-storage-section,!249Dgx
......@@ -7,32 +7,4 @@
## Shell Access
The DGX-2 can be accessed by SSH protocol via login node ldgx at the address `ldgx.it4i.cz`. [VPN][1] connection is required in order to connect to ldgx.
```console
_ ___ _____ ____ ___ _ ____ ______ __ ____
| \ | \ \ / /_ _| _ \_ _| / \ | _ \ / ___\ \/ / |___ \
| \| |\ \ / / | || | | | | / _ \ | | | | | _ \ /_____ __) |
| |\ | \ V / | || |_| | | / ___ \ | |_| | |_| |/ \_____/ __/
|_| \_| \_/ |___|____/___/_/ \_\ |____/ \____/_/\_\ |_____|
...running on Ubuntu 18.04 (DGX-2)
[kru0052@ldgx ~]$
```
### Authentication
Authentication is available by private key only.
!!! info
Should you need access to the DGX-2 machine, request it at support@it4i.cz.
### Data Transfer
Data in and out of the system may be transferred by the SCP protocol.
!!! warning
/HOME directory on ldgx is not the same as /HOME directory on dgx. /SCRATCH storage is shared between login node and DGX-2 machine.
[1]: ../../general/accessing-the-clusters/vpn-access/
\ No newline at end of file
The DGX-2 machine can be accessed by SSH protocol via login nodes at the address `loginX.salomon.it4i.cz`.
......@@ -27,12 +27,17 @@ When allocating computational resources for the job, specify:
!!! note
Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU.
!!! info
You can access the DGX PBS scheduler by loadnig the "DGX-2" module.
Submit the job using the `qsub` command:
**Example for 1 GPU**
```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I
qsub: waiting for job 257.ldgx to start
qsub: job 257.ldgx ready
......@@ -47,12 +52,18 @@ Thu Mar 14 07:46:01 2019
| 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 |
| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
```
**Example for 4 GPU**
```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I
qsub: waiting for job 256.ldgx to start
qsub: job 256.ldgx ready
......@@ -76,12 +87,18 @@ Thu Mar 14 07:45:29 2019
| 3 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 |
| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
```
**Example for 16 GPU (all DGX-2)**
```console
[kru0052@ldgx ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I
qsub: waiting for job 258.ldgx to start
qsub: job 258.ldgx ready
......@@ -141,6 +158,10 @@ Thu Mar 14 07:46:32 2019
| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 |
| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
```
!!! tip
......@@ -156,6 +177,8 @@ The jobscript is a user made script controlling a sequence of commands for execu
#### Example - Singularity Run Tensorflow
```console
[kru0052@login4.salomon ~]$ ml DGX-2
PBS 18.1.3 for DGX-2 machine
$ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I
qsub: waiting for job 96.ldgx to start
qsub: job 96.ldgx ready
......@@ -194,6 +217,10 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14)
70 70.0 30763.2 0.001 0.324 0.10889
80 80.0 30845.5 0.001 0.324 0.02988
90 90.0 26350.9 0.001 0.324 0.00025
kru0052@dgx:~$ exit
[kru0052@login4.salomon ~]$ ml purge
PBS 13.1.1 for cluster Salomon
[kru0052@login4.salomon ~]$
```
**GPU stat**
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment