Update 3 files

- /docs.it4i/general/vllm_deepseek.md - /docs.it4i/software/vllm_deepseek.md - /mkdocs.yml

Update 3 files
ca40ffe0 · Jan Siwiec · eff9bb47 · ca40ffe0 · ca40ffe0
Commit ca40ffe0 authored 1 month ago by Jan Siwiec
--- a/docs.it4i/general/vllm_deepseek.md
+++ b/docs.it4i/general/vllm_deepseek.md
-# Using vLLM with DeepSeek on Karolina (Multi-GPU, Single and multiple nodes)
+# Using vLLM with DeepSeek on Karolina
 This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster.
 It covers requesting GPU resources, loading necessary modules, setting environment variables,
 and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes.
-## steps to take:
 ## Multi GPUs - Single Node Setup
-### Step 1: Request Compute Resources via SLURM 
+### 1. Request Compute Resources via SLURM
 Use `salloc` to allocate an interactive job session on the GPU partition.
 ```console
 salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00
 ```
 Replace `PROJECT-ID` with your actual project ID.
-### 1 Load Required Modules on Karolina
+### 2. Load Required Modules on Karolina
 Load the necessary software modules including Python and CUDA.
 ``` console
 ml Python/3.12.3-GCCcore-13.3.0
 ml CUDA/12.4.0
 ```
 Verify CUDA is loaded correctly
 ```console
 nvcc --version
 ```
-### 2 create and activate the venv
+### 3. Create and Activate Virtual Environment
 ```console
 python -m venv vllm
@@ -31,49 +41,69 @@ pip install "vllm==0.7.3" "ray==2.40.0"
 ```
 Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set.
 ```console
 echo $LD_LIBRARY_PATH
 ```
-### 3 Set Environment Variables for Cache Directories
+### 4. Set Environment Variables for Cache Directories
 These directories will be used by HuggingFace and vLLM to store model weights.
 ``` console
 export HF_HUB_CACHE=/scratch/project/fta-25-9
 export VLLM_CACHE_ROOT=/scratch/project/fta-25-9
 ```
 Adjust paths to your project's scratch directory.
-### 4 serve the model
+### 4. Serve Model
 Launch the DeepSeek model using vLLM.
 ``` console
 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9
 ```
-note that trust-remote-code (required for some models).
--tensor-parallel-size should match the number of GPUs allocated (4 in this case).
+note that `trust-remote-code` (required for some models).
-The --download-dir should point to a high-performance, writable scratch directory.
-## multi GPUs-multi nodes
+`--tensor-parallel-size` should match the number of GPUs allocated (4 in this case).
+The `--download-dir` should point to a high-performance, writable scratch directory.
+## Multi GPUs - Multi Nodes Setup
 This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration. 
-## steps:
+### 1. Request Compute Resources via SLURM
-### ask for compute
 Request multiple nodes with GPUs using SLURM. Replace the account ID as needed.
 ``` console
 salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00
 ```
--nodes=2: Requests 2 compute nodes.
--gpus=9: Requests 9 GPUs in total (in all the nodes).
+`--nodes=2`: Requests 2 compute nodes.
-### 1 Load Modules in Karolina
+`--gpus=9`: Requests 9 GPUs in total (in all the nodes).
+### 2. Load Modules in Karolina
 Load the required modules on each node
 ``` console
 ml Python/3.12.3-GCCcore-13.3.0
 ml CUDA/12.4.0
 ```
 verify CUDA installation
 ``` console
 nvcc --version   
 ```
-### 2 Create and Activate the Virtual Environment
+### 3. Create and Activate Virtual Environment
 ```console
 python -m venv vllm
 source vllm/bin/activate
@@ -81,49 +111,66 @@ pip install "vllm==0.7.3" "ray==2.40.0"
 ```
 Ensure you do this on all nodes.
-### 3 set up ray cluster
+### 4. Set Up Ray Cluster
 Choose a free port (e.g., 6379) and start the Ray cluster:
 ``` console
 ray start --head --port=6379
 ```
-just spacify the port you will start your cluster at
-The IP address and port used, will be needed for the worker nodes
+Specify the port you will start your cluster on.
-change your current node to the  worker nodes using
+The IP address and port used will be needed for the worker nodes.
+Change your current node to the worker nodes using:
 On Worker Nodes:
 - identify the `node IDs` of the allocated nodes:
 - `squeue --me`: Get list of assigned nodes
 - `SSH` into each of the other nodes (excluding the head node)
 ``` console
 squeue --me
 ssh acn-node_id
 ```
 Connect to the Ray head from the worker node using its IP and port:
 ``` console
 ray start --address='head-node-ip:port-number'
 ```
-repeat this for all worker nodes to join the cluster.
+Repeat this for all worker nodes to join the cluster.
 Check Cluster Status:
 - On any node, confirm that all nodes have joined:
 ```console
 ray status
 ```
-### Serve the Model with vLLM:
+### Serve Model with vLLM:
 Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes.
 ``` console
 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cuda
 ```
 This command assumes model parallelism:
 `--tensor_parallel-size` : how many GPUs to split each layer across
 `--pipeline-parallel-size` : how many layers to split across different nodes/devices
-### Test the Model Server
+### Test Model Server
 You can test the endpoint using `curl`. Run this from the same node (adjust IP if needed).
 ``` console
 curl http://127.0.0.1:8000/v1/completions     -H "Content-Type: application/json"     -d '{
      "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
@@ -132,12 +179,3 @@ curl http://127.0.0.1:8000/v1/completions     -H "Content-Type: application/json
    }'
 ```
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -267,6 +267,8 @@ nav:
      - Deep Learning:
          - AlphaFold: software/machine-learning/alphafold.md
          - DeepDock: software/machine-learning/deepdock.md
+    -LLM:
+      - vLLM with Deepseek: software/vllm_deepseek.md
    - MPI:
      - Introduction: software/mpi/mpi.md
      - OpenMPI Examples: software/mpi/ompi-examples.md