diff --git a/docs.it4i/dgx2/accessing.md b/docs.it4i/dgx2/accessing.md index 7099d25bbc1920e3e560bc2352a33e7f84c28fd1..49419f53e739070a02fdd8493a3a2b675434160f 100644 --- a/docs.it4i/dgx2/accessing.md +++ b/docs.it4i/dgx2/accessing.md @@ -7,32 +7,4 @@ ## Shell Access -The DGX-2 can be accessed by SSH protocol via login node ldgx at the address `ldgx.it4i.cz`. [VPN][1] connection is required in order to connect to ldgx. - -```console - _ ___ _____ ____ ___ _ ____ ______ __ ____ -| \ | \ \ / /_ _| _ \_ _| / \ | _ \ / ___\ \/ / |___ \ -| \| |\ \ / / | || | | | | / _ \ | | | | | _ \ /_____ __) | -| |\ | \ V / | || |_| | | / ___ \ | |_| | |_| |/ \_____/ __/ -|_| \_| \_/ |___|____/___/_/ \_\ |____/ \____/_/\_\ |_____| - - ...running on Ubuntu 18.04 (DGX-2) - -[kru0052@ldgx ~]$ -``` - -### Authentication - -Authentication is available by private key only. - -!!! info - Should you need access to the DGX-2 machine, request it at support@it4i.cz. - -### Data Transfer - -Data in and out of the system may be transferred by the SCP protocol. - -!!! warning - /HOME directory on ldgx is not the same as /HOME directory on dgx. /SCRATCH storage is shared between login node and DGX-2 machine. - -[1]: ../../general/accessing-the-clusters/vpn-access/ \ No newline at end of file +The DGX-2 machine can be accessed by SSH protocol via login nodes at the address `loginX.salomon.it4i.cz`. diff --git a/docs.it4i/dgx2/job_execution.md b/docs.it4i/dgx2/job_execution.md index 08142249c43a4d2777f9cc7db4db60d92bb004e6..a0cd4ee9c6ec96cfd07ab3f45292537c77db9fad 100644 --- a/docs.it4i/dgx2/job_execution.md +++ b/docs.it4i/dgx2/job_execution.md @@ -27,12 +27,17 @@ When allocating computational resources for the job, specify: !!! note Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU. +!!! info + You can access the DGX PBS scheduler by loadnig the "DGX-2" module. + Submit the job using the `qsub` command: **Example for 1 GPU** ```console -[kru0052@ldgx ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I +[kru0052@login4.salomon ~]$ ml DGX-2 +PBS 18.1.3 for DGX-2 machine +[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I qsub: waiting for job 257.ldgx to start qsub: job 257.ldgx ready @@ -47,12 +52,18 @@ Thu Mar 14 07:46:01 2019 | 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 | | N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +kru0052@dgx:~$ exit +[kru0052@login4.salomon ~]$ ml purge +PBS 13.1.1 for cluster Salomon +[kru0052@login4.salomon ~]$ ``` **Example for 4 GPU** ```console -[kru0052@ldgx ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I +[kru0052@login4.salomon ~]$ ml DGX-2 +PBS 18.1.3 for DGX-2 machine +[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I qsub: waiting for job 256.ldgx to start qsub: job 256.ldgx ready @@ -76,12 +87,18 @@ Thu Mar 14 07:45:29 2019 | 3 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 | | N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +kru0052@dgx:~$ exit +[kru0052@login4.salomon ~]$ ml purge +PBS 13.1.1 for cluster Salomon +[kru0052@login4.salomon ~]$ ``` **Example for 16 GPU (all DGX-2)** ```console -[kru0052@ldgx ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I +[kru0052@login4.salomon ~]$ ml DGX-2 +PBS 18.1.3 for DGX-2 machine +[kru0052@login4.salomon ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I qsub: waiting for job 258.ldgx to start qsub: job 258.ldgx ready @@ -141,6 +158,10 @@ Thu Mar 14 07:46:32 2019 | 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 | | N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +kru0052@dgx:~$ exit +[kru0052@login4.salomon ~]$ ml purge +PBS 13.1.1 for cluster Salomon +[kru0052@login4.salomon ~]$ ``` !!! tip @@ -156,6 +177,8 @@ The jobscript is a user made script controlling a sequence of commands for execu #### Example - Singularity Run Tensorflow ```console +[kru0052@login4.salomon ~]$ ml DGX-2 +PBS 18.1.3 for DGX-2 machine $ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I qsub: waiting for job 96.ldgx to start qsub: job 96.ldgx ready @@ -194,6 +217,10 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14) 70 70.0 30763.2 0.001 0.324 0.10889 80 80.0 30845.5 0.001 0.324 0.02988 90 90.0 26350.9 0.001 0.324 0.00025 +kru0052@dgx:~$ exit +[kru0052@login4.salomon ~]$ ml purge +PBS 13.1.1 for cluster Salomon +[kru0052@login4.salomon ~]$ ``` **GPU stat**