diff --git a/.spelling b/.spelling index 60cb19a94c06fcdfae925c5cdf30f30045e0b284..2497fe226110e137f681159fec85335ddb560ef0 100644 --- a/.spelling +++ b/.spelling @@ -803,6 +803,8 @@ node-pre-gyp npm - node_modules/spawn-sync/README.md iojs ->>>>>>> readme UCX Dask-ssh +SCRATCH +HOME +PROJECT diff --git a/docs.it4i/dgx2/accessing.md b/docs.it4i/dgx2/accessing.md index 88c0cce803d00c1371c60589e59e49a9a652a178..ce6ac76da9044fb6ea1d1aec09e52ddb21cde7c3 100644 --- a/docs.it4i/dgx2/accessing.md +++ b/docs.it4i/dgx2/accessing.md @@ -7,8 +7,29 @@ ## How to Access -The DGX-2 machine can be accessed through the scheduler from Salomon login nodes `salomon.it4i.cz`. +The DGX-2 machine can be accessed through the scheduler from Barbora login nodes `barbora.it4i.cz` as a compute node cn202. -The NVIDIA DGX-2 has its own instance of the scheduler, it can be accessed by loading the `DGX-2` module. See [Resource Allocation and Job Execution][1]. +## Storage -[1]: job_execution.md +There are three shared file systems on the DGX-2 system: HOME, SCRATCH (LSCRATCH), and PROJECT. + +### HOME + +The HOME filesystem is realized as an NFS filesystem. This is a shared home from the [Barbora cluster][1]. + +### SCRATCH + +The SCRATCH is realized on an NVME storage. The SCRATCH filesystem is mounted in the `/scratch` directory. +Users may freely create subdirectories and files on the filesystem (`/scratch/user/$USER`). +Accessible capacity is 22TB, shared among all users. + +!!! warning + Files on the SCRATCH filesystem that are not accessed for more than 60 days will be automatically deleted. + +### PROJECT + +The PROJECT data storage is IT4Innovations' central data storage accessible from all clusters. +For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Data Storage][2] section. + +[1]: ../../barbora/storage/#home-file-system +[2]: ../../storage/project-storage diff --git a/docs.it4i/dgx2/job_execution.md b/docs.it4i/dgx2/job_execution.md index a0fe9fe1a62c7d2658e57f46a4326afaafcfdab8..3594682e1487ebfbe150d4ea2b7d74ffd471d322 100644 --- a/docs.it4i/dgx2/job_execution.md +++ b/docs.it4i/dgx2/job_execution.md @@ -2,23 +2,14 @@ To run a job, computational resources of DGX-2 must be allocated. -DGX-2 uses an independent PBS scheduler. To access the scheduler, load the DGX-2 module: - -```console -$ml DGX-2 -``` - ## Resources Allocation Policy The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources. -* **qdgx**, the queue for DGX-2 machine +The queue for the DGX-2 machine is called **qdgx**. !!! note - Maximum walltime of a job is **48** hours. - -!!! note - The qdgx queue is configured to run one job and accept one job in a queue per user. + The qdgx queue is configured to run one job and accept one job in a queue per user with the maximum walltime of a job being **48** hours. ## Job Submission and Execution @@ -28,9 +19,9 @@ The `qsub` submits the job into the queue. The command creates a request to the When allocating computational resources for the job, specify: -1. a queue for your job (the default is **qdgx**) -1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**) -1. a Jobscript or interactive switch +1. a queue for your job (the default is **qdgx**); +1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**); +1. a jobscript or interactive switch. !!! info You can access the DGX PBS scheduler by loading the "DGX-2" module. @@ -40,16 +31,14 @@ Submit the job using the `qsub` command: **Example** ```console -[kru0052@login4.salomon ~]$ ml DGX-2 -PBS 18.1.3 for DGX-2 machine -[kru0052@login4.salomon ~]$ qsub -q qdgx -l walltime=02:00:00 -I -qsub: waiting for job 258.ldgx to start -qsub: job 258.ldgx ready - -kru0052@dgx:~$ nvidia-smi -Thu Mar 14 07:46:32 2019 +[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=02:00:00 -I +qsub: waiting for job 258.dgx to start +qsub: job 258.dgx ready + +kru0052@cn202:~$ nvidia-smi +Wed Jun 16 07:46:32 2021 +-----------------------------------------------------------------------------+ -| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 | +| NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | @@ -102,10 +91,7 @@ Thu Mar 14 07:46:32 2019 | 15 Tesla V100-SXM3... On | 00000000:E7:00.0 Off | 0 | | N/A 34C P0 50W / 350W | 0MiB / 32480MiB | 0% Default | +-------------------------------+----------------------+----------------------+ -kru0052@dgx:~$ exit -[kru0052@login4.salomon ~]$ ml purge -PBS 13.1.1 for cluster Salomon -[kru0052@login4.salomon ~]$ +kru0052@cn202:~$ exit ``` !!! tip @@ -124,13 +110,11 @@ to download the container via singularity, see the example below: #### Example - Singularity Run Tensorflow ```console -[kru0052@login4.salomon ~]$ ml DGX-2 -PBS 18.1.3 for DGX-2 machine -$ qsub -q qdgx -l walltime=01:00:00 -I -qsub: waiting for job 96.ldgx to start -qsub: job 96.ldgx ready +[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=01:00:00 -I +qsub: waiting for job 96.dgx to start +qsub: job 96.dgx ready -kru0052@dgx:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3 +kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3 Singularity tensorflow_19.02-py3.sif:~> Singularity tensorflow_19.02-py3.sif:~> mpiexec --bind-to socket -np 16 python /opt/tensorflow/nvidia-examples/cnn/resnet.py --layers=18 --precision=fp16 --batch_size=512 PY 3.5.2 (default, Nov 12 2018, 13:43:14) @@ -164,10 +148,7 @@ PY 3.5.2 (default, Nov 12 2018, 13:43:14) 70 70.0 30763.2 0.001 0.324 0.10889 80 80.0 30845.5 0.001 0.324 0.02988 90 90.0 26350.9 0.001 0.324 0.00025 -kru0052@dgx:~$ exit -[kru0052@login4.salomon ~]$ ml purge -PBS 13.1.1 for cluster Salomon -[kru0052@login4.salomon ~]$ +kru0052@cn202:~$ exit ``` **GPU stat**