diff --git a/.spelling b/.spelling index da257cf0cc12d3fe1558c35b598a0b09757d3440..99b741e0fec52ff3dca90b00e72ceaf692990e8c 100644 --- a/.spelling +++ b/.spelling @@ -1,3 +1,6 @@ +nvidia +smi +nvidia-smi NICE DGX-2 DGX diff --git a/docs.it4i/dgx2/accessing.md b/docs.it4i/dgx2/accessing.md new file mode 100644 index 0000000000000000000000000000000000000000..8ed5ef9196582a67072bb02a04467576637a82e4 --- /dev/null +++ b/docs.it4i/dgx2/accessing.md @@ -0,0 +1,29 @@ +# Shell Access + +The DGX-2 can be accessed by SSH protocol via login node ldgx at the address `ldgx.it4i.cz`. + +```console + _ ___ _____ ____ ___ _ ____ ______ __ ____ +| \ | \ \ / /_ _| _ \_ _| / \ | _ \ / ___\ \/ / |___ \ +| \| |\ \ / / | || | | | | / _ \ | | | | | _ \ /_____ __) | +| |\ | \ V / | || |_| | | / ___ \ | |_| | |_| |/ \_____/ __/ +|_| \_| \_/ |___|____/___/_/ \_\ |____/ \____/_/\_\ |_____| + + ...running on Ubuntu 18.04 (DGX-2) + +[kru0052@ldgx ~]$ +``` + +## Authentication + +Authentication is available by private key only. + +!!! info + Should you need access to the DGX-2 machine, request it at support@it4i.cz. + +## Data Transfer + +Data in and out of the system may be transferred by the SCP protocol. + +!!! warning + /HOME directory on ldgx is not the same as /HOME directory on dgx. /SCRATCH storage is shared between login node and DGX-2 machine. diff --git a/docs.it4i/dgx2/job_execution.md b/docs.it4i/dgx2/job_execution.md new file mode 100644 index 0000000000000000000000000000000000000000..08142249c43a4d2777f9cc7db4db60d92bb004e6 --- /dev/null +++ b/docs.it4i/dgx2/job_execution.md @@ -0,0 +1,223 @@ +# Resource Allocation and Job Execution + +To run a job, computational resources for this particular job must be allocated. + +## Resources Allocation Policy + +The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources. + +* **qdgx**, the queue for DGX-2 machine + +!!! note + Job maximum walltime is **4** hours, there might be only **5** jobs in the queue and only **one** running job per user. + +## Job Submission and Execution + +The `qsub` submits the job into the queue. The command creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to allocation policies and constraints. After the resources are allocated the jobscript or interactive shell is executed on the allocated node. + +### Job Submission + +When allocating computational resources for the job, specify: + +1. a queue for your job (the default is **qdgx**) +1. the number of computational nodes required (maximum is **16**, we have only one DGX-2 machine (yet)) +1. the maximum wall time allocated to your calculation (default is **2 hour**, maximum is **4 hour**) +1. a Jobscript or interactive switch + +!!! note + Right now, the DGX-2 is divided into 16 computational nodes. Every node contains 6 CPUs (3 physical cores + 3 HT cores) and 1 GPU. + +Submit the job using the `qsub` command: + +**Example for 1 GPU** + +```console +[kru0052@ldgx ~]$ qsub -q qdgx -l select=1 -l walltime=04:00:00 -I +qsub: waiting for job 257.ldgx to start +qsub: job 257.ldgx ready + +kru0052@dgx:~$ nvidia-smi +Thu Mar 14 07:46:01 2019 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 | +| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +``` + +**Example for 4 GPU** + +```console +[kru0052@ldgx ~]$ qsub -q qdgx -l select=4 -l walltime=04:00:00 -I +qsub: waiting for job 256.ldgx to start +qsub: job 256.ldgx ready + +kru0052@dgx:~$ nvidia-smi +Thu Mar 14 07:45:29 2019 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 | +| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 1 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 | +| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 2 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 | +| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 3 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 | +| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +``` + +**Example for 16 GPU (all DGX-2)** + +```console +[kru0052@ldgx ~]$ qsub -q qdgx -l select=16 -l walltime=04:00:00 -I +qsub: waiting for job 258.ldgx to start +qsub: job 258.ldgx ready + +kru0052@dgx:~$ nvidia-smi +Thu Mar 14 07:46:32 2019 ++-----------------------------------------------------------------------------+ +| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | +|-------------------------------+----------------------+----------------------+ +| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | +|===============================+======================+======================| +| 0 Tesla V100-SXM3... On | 00000000:34:00.0 Off | 0 | +| N/A 32C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 1 Tesla V100-SXM3... On | 00000000:36:00.0 Off | 0 | +| N/A 31C P0 48W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 2 Tesla V100-SXM3... On | 00000000:39:00.0 Off | 0 | +| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 3 Tesla V100-SXM3... On | 00000000:3B:00.0 Off | 0 | +| N/A 36C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 4 Tesla V100-SXM3... On | 00000000:57:00.0 Off | 0 | +| N/A 29C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 5 Tesla V100-SXM3... On | 00000000:59:00.0 Off | 0 | +| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 6 Tesla V100-SXM3... On | 00000000:5C:00.0 Off | 0 | +| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 7 Tesla V100-SXM3... On | 00000000:5E:00.0 Off | 0 | +| N/A 35C P0 53W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 8 Tesla V100-SXM3... On | 00000000:B7:00.0 Off | 0 | +| N/A 30C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 9 Tesla V100-SXM3... On | 00000000:B9:00.0 Off | 0 | +| N/A 30C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 10 Tesla V100-SXM3... On | 00000000:BC:00.0 Off | 0 | +| N/A 35C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 11 Tesla V100-SXM3... On | 00000000:BE:00.0 Off | 0 | +| N/A 35C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 12 Tesla V100-SXM3... On | 00000000:E0:00.0 Off | 0 | +| N/A 31C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 13 Tesla V100-SXM3... On | 00000000:E2:00.0 Off | 0 | +| N/A 29C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 14 Tesla V100-SXM3... On | 00000000:E5:00.0 Off | 0 | +| N/A 34C P0 51W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +| 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 | +| N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | ++-------------------------------+----------------------+----------------------+ +``` + +!!! tip + Submit the intreractive job using the `qsub -I ...` command. + +!!! info + You can determine allocated GPUs from environment variable **CUDA_ALLOCATED_DEVICES**. Variable **CUDA_VISIBLE_DEVICES** has to be count from **0** every time! + +### Job Execution + +The jobscript is a user made script controlling a sequence of commands for executing the calculation. It is often written in BASH, though other scripts may be used as well. The jobscript is supplied to the PBS `qsub` command as an argument and is executed by the PBS Professional workload manager. + +#### Example - Singularity Run Tensorflow + +```console +$ qsub -q qdgx -l select=16 -l walltime=01:00:00 -I +qsub: waiting for job 96.ldgx to start +qsub: job 96.ldgx ready + +kru0052@dgx:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3 +Singularity tensorflow_19.02-py3.sif:~> +Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +TF 1.13.0-rc0 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +TF 1.13.0-rc0 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +TF 1.13.0-rc0 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +TF 1.13.0-rc0 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +TF 1.13.0-rc0 +PY 3.5.2 (default, Nov 12 2018, 13:43:14) +[GCC 5.4.0 20160609] +... +... +... +2019-03-11 08:30:12.263822: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally + 1 1.0 338.2 6.999 7.291 2.00000 + 10 10.0 3658.6 5.658 5.950 1.62000 + 20 20.0 25628.6 2.957 3.258 1.24469 + 30 30.0 30815.1 0.177 0.494 0.91877 + 40 40.0 30826.3 0.004 0.330 0.64222 + 50 50.0 30884.3 0.002 0.327 0.41506 + 60 60.0 30888.7 0.001 0.325 0.23728 + 70 70.0 30763.2 0.001 0.324 0.10889 + 80 80.0 30845.5 0.001 0.324 0.02988 + 90 90.0 26350.9 0.001 0.324 0.00025 +``` + +**GPU stat** + +The GPU load can be determined by `gpustat` utility. + +```console +Every 2,0s: gpustat --color + +dgx Mon Mar 11 09:31:00 2019 +[0] Tesla V100-SXM3-32GB | 47'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[1] Tesla V100-SXM3-32GB | 48'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[2] Tesla V100-SXM3-32GB | 56'C, 97 % | 23660 / 32480 MB | kru0052(23645M) +[3] Tesla V100-SXM3-32GB | 57'C, 97 % | 23660 / 32480 MB | kru0052(23645M) +[4] Tesla V100-SXM3-32GB | 46'C, 97 % | 23660 / 32480 MB | kru0052(23645M) +[5] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[6] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[7] Tesla V100-SXM3-32GB | 54'C, 97 % | 23660 / 32480 MB | kru0052(23645M) +[8] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[9] Tesla V100-SXM3-32GB | 46'C, 95 % | 23660 / 32480 MB | kru0052(23645M) +[10] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[11] Tesla V100-SXM3-32GB | 56'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[12] Tesla V100-SXM3-32GB | 47'C, 95 % | 23660 / 32480 MB | kru0052(23645M) +[13] Tesla V100-SXM3-32GB | 45'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[14] Tesla V100-SXM3-32GB | 55'C, 96 % | 23660 / 32480 MB | kru0052(23645M) +[15] Tesla V100-SXM3-32GB | 58'C, 95 % | 23660 / 32480 MB | kru0052(23645M) +``` diff --git a/docs.it4i/img/dgx-htop.png b/docs.it4i/img/dgx-htop.png new file mode 100644 index 0000000000000000000000000000000000000000..61ea3c22cf6203f921b8e4d3731cf23f94b15a58 Binary files /dev/null and b/docs.it4i/img/dgx-htop.png differ diff --git a/mkdocs.yml b/mkdocs.yml index 399cbe036ac15a0eb746f6ea370221b4c1bf4e82..ef59cd30a4cbcc227c269f193c1a99374ebfc3e2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -80,6 +80,8 @@ nav: - Visualization Servers: salomon/visualization.md - NVIDIA DGX-2: - Introduction: dgx2/introduction.md + - Accessing the DGX-2: dgx2/accessing.md + - Resource Allocation and Job Execution: dgx2/job_execution.md - Software: - Environment and Modules: environment-and-modules.md - Modules: