From 60e170821e73b6a88f49f40b8cbec512bb6941e0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Roman=20Sl=C3=ADva?= <roman.sliva@vsb.cz> Date: Wed, 13 Sep 2023 09:24:46 +0200 Subject: [PATCH] docs.it4i/dgx2 - slurm --- docs.it4i/dgx2/accessing.md | 4 ++- docs.it4i/dgx2/job_execution.md | 43 ++++++++++++--------------------- 2 files changed, 18 insertions(+), 29 deletions(-) diff --git a/docs.it4i/dgx2/accessing.md b/docs.it4i/dgx2/accessing.md index 6885e4cf1..ad4f6969c 100644 --- a/docs.it4i/dgx2/accessing.md +++ b/docs.it4i/dgx2/accessing.md @@ -7,7 +7,8 @@ ## How to Access -The DGX-2 machine can be accessed through the scheduler from Barbora login nodes `barbora.it4i.cz` as a compute node cn202. +The DGX-2 machine is integrated into [Barbora cluster][3]. +The DGX-2 machine can be accessed from Barbora login nodes `barbora.it4i.cz` through the Barbora scheduler queue qdgx as a compute node cn202. ## Storage @@ -32,3 +33,4 @@ For more information on accessing PROJECT, its quotas, etc., see the [PROJECT Da [1]: ../../barbora/storage/#home-file-system [2]: ../../storage/project-storage +[3]: ../../barbora/introduction diff --git a/docs.it4i/dgx2/job_execution.md b/docs.it4i/dgx2/job_execution.md index 027bca200..849cd7469 100644 --- a/docs.it4i/dgx2/job_execution.md +++ b/docs.it4i/dgx2/job_execution.md @@ -2,38 +2,24 @@ To run a job, computational resources of DGX-2 must be allocated. -## Resources Allocation Policy - -The resources are allocated to the job in a fair-share fashion, subject to constraints set by the queue. The queue provides prioritized and exclusive access to computational resources. - -The queue for the DGX-2 machine is called **qdgx**. - -!!! note - The qdgx queue is configured to run one job and accept one job in a queue per user with the maximum walltime of a job being **48** hours. - -## Job Submission and Execution - -The `qsub` submits the job into the queue. The command creates a request to the PBS Job manager for allocation of specified resources. The resources will be allocated when available, subject to allocation policies and constraints. After the resources are allocated, the jobscript or interactive shell is executed on the allocated node. - -### Job Submission +The DGX-2 machine is integrated to and accessible through Barbora cluster, the queue for the DGX-2 machine is called **qdgx**. When allocating computational resources for the job, specify: -1. a queue for your job (the default is **qdgx**); -1. the maximum wall time allocated to your calculation (default is **4 hour**, maximum is **48 hour**); -1. a jobscript or interactive switch. - -!!! info - You can access the DGX PBS scheduler by loading the "DGX-2" module. +1. your Project ID +1. a queue for your job - **qdgx**; +1. the maximum time allocated to your calculation (default is **4 hour**, maximum is **48 hour**); +1. a jobscript if batch processing is intended. -Submit the job using the `qsub` command: +Submit the job using the `sbatch` (for batch processing) or `salloc` (for interactive session) command: **Example** ```console -[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=02:00:00 -I -qsub: waiting for job 258.dgx to start -qsub: job 258.dgx ready +[kru0052@login2.barbora ~]$ salloc -A PROJECT-ID -p qdgx --time=02:00:00 +salloc: Granted job allocation 36631 +salloc: Waiting for resource configuration +salloc: Nodes cn202 are ready for job kru0052@cn202:~$ nvidia-smi Wed Jun 16 07:46:32 2021 @@ -95,7 +81,7 @@ kru0052@cn202:~$ exit ``` !!! tip - Submit the interactive job using the `qsub -I ...` command. + Submit the interactive job using the `salloc` command. ### Job Execution @@ -110,9 +96,10 @@ to download the container via Apptainer/Singularity, see the example below: #### Example - Apptainer/Singularity Run Tensorflow ```console -[kru0052@login2.barbora ~]$ qsub -q qdgx -l walltime=01:00:00 -I -qsub: waiting for job 96.dgx to start -qsub: job 96.dgx ready +[kru0052@login2.barbora ~] $ salloc -A PROJECT-ID -p qdgx --time=02:00:00 +salloc: Granted job allocation 36633 +salloc: Waiting for resource configuration +salloc: Nodes cn202 are ready for job kru0052@cn202:~$ singularity shell docker://nvcr.io/nvidia/tensorflow:19.02-py3 Singularity tensorflow_19.02-py3.sif:~> -- GitLab