Skip to content
Snippets Groups Projects
Commit ca40ffe0 authored by Jan Siwiec's avatar Jan Siwiec
Browse files

Update 3 files

- /docs.it4i/general/vllm_deepseek.md
- /docs.it4i/software/vllm_deepseek.md
- /mkdocs.yml
parent eff9bb47
Branches
Tags
1 merge request!498Update 3 files
Pipeline #44707 failed
# Using vLLM with DeepSeek on Karolina (Multi-GPU, Single and multiple nodes) # Using vLLM with DeepSeek on Karolina
This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster. This guide walks through how to set up and serve the DeepSeek model using vLLM on the Karolina HPC cluster.
It covers requesting GPU resources, loading necessary modules, setting environment variables, It covers requesting GPU resources, loading necessary modules, setting environment variables,
and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes. and launching the model server with tensor parallelism across multiple GPUs on a single and multiple nodes.
## steps to take:
## Multi GPUs - Single Node Setup ## Multi GPUs - Single Node Setup
### Step 1: Request Compute Resources via SLURM ### 1. Request Compute Resources via SLURM
Use `salloc` to allocate an interactive job session on the GPU partition. Use `salloc` to allocate an interactive job session on the GPU partition.
```console ```console
salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00 salloc -A PROJECT-ID --partition=qgpu --gpus=4 --time=02:00:00
``` ```
Replace `PROJECT-ID` with your actual project ID. Replace `PROJECT-ID` with your actual project ID.
### 1 Load Required Modules on Karolina
### 2. Load Required Modules on Karolina
Load the necessary software modules including Python and CUDA. Load the necessary software modules including Python and CUDA.
``` console ``` console
ml Python/3.12.3-GCCcore-13.3.0 ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0 ml CUDA/12.4.0
``` ```
Verify CUDA is loaded correctly Verify CUDA is loaded correctly
```console ```console
nvcc --version nvcc --version
``` ```
### 2 create and activate the venv
### 3. Create and Activate Virtual Environment
```console ```console
python -m venv vllm python -m venv vllm
...@@ -31,49 +41,69 @@ pip install "vllm==0.7.3" "ray==2.40.0" ...@@ -31,49 +41,69 @@ pip install "vllm==0.7.3" "ray==2.40.0"
``` ```
Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set. Check if the necessary environment variables like LD_LIBRARY_PATH are correctly set.
```console ```console
echo $LD_LIBRARY_PATH echo $LD_LIBRARY_PATH
``` ```
### 3 Set Environment Variables for Cache Directories ### 4. Set Environment Variables for Cache Directories
These directories will be used by HuggingFace and vLLM to store model weights. These directories will be used by HuggingFace and vLLM to store model weights.
``` console ``` console
export HF_HUB_CACHE=/scratch/project/fta-25-9 export HF_HUB_CACHE=/scratch/project/fta-25-9
export VLLM_CACHE_ROOT=/scratch/project/fta-25-9 export VLLM_CACHE_ROOT=/scratch/project/fta-25-9
``` ```
Adjust paths to your project's scratch directory. Adjust paths to your project's scratch directory.
### 4 serve the model
### 4. Serve Model
Launch the DeepSeek model using vLLM. Launch the DeepSeek model using vLLM.
``` console ``` console
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9 vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --trust-remote-code --tensor-parallel-size 4 --download-dir /scratch/project/fta-25-9
``` ```
note that trust-remote-code (required for some models).
--tensor-parallel-size should match the number of GPUs allocated (4 in this case). note that `trust-remote-code` (required for some models).
The --download-dir should point to a high-performance, writable scratch directory.
## multi GPUs-multi nodes `--tensor-parallel-size` should match the number of GPUs allocated (4 in this case).
The `--download-dir` should point to a high-performance, writable scratch directory.
## Multi GPUs - Multi Nodes Setup
This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration. This section describes how to launch a distributed vLLM model server across multiple nodes and using Ray for orchestration.
## steps: ### 1. Request Compute Resources via SLURM
### ask for compute
Request multiple nodes with GPUs using SLURM. Replace the account ID as needed. Request multiple nodes with GPUs using SLURM. Replace the account ID as needed.
``` console ``` console
salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00 salloc -A FTA-25-9 --partition=qgpu --nodes=2 --gpus=9 --time=12:00:00
``` ```
--nodes=2: Requests 2 compute nodes.
--gpus=9: Requests 9 GPUs in total (in all the nodes). `--nodes=2`: Requests 2 compute nodes.
### 1 Load Modules in Karolina
`--gpus=9`: Requests 9 GPUs in total (in all the nodes).
### 2. Load Modules in Karolina
Load the required modules on each node Load the required modules on each node
``` console ``` console
ml Python/3.12.3-GCCcore-13.3.0 ml Python/3.12.3-GCCcore-13.3.0
ml CUDA/12.4.0 ml CUDA/12.4.0
``` ```
verify CUDA installation verify CUDA installation
``` console ``` console
nvcc --version nvcc --version
``` ```
### 2 Create and Activate the Virtual Environment
### 3. Create and Activate Virtual Environment
```console ```console
python -m venv vllm python -m venv vllm
source vllm/bin/activate source vllm/bin/activate
...@@ -81,49 +111,66 @@ pip install "vllm==0.7.3" "ray==2.40.0" ...@@ -81,49 +111,66 @@ pip install "vllm==0.7.3" "ray==2.40.0"
``` ```
Ensure you do this on all nodes. Ensure you do this on all nodes.
### 3 set up ray cluster
### 4. Set Up Ray Cluster
Choose a free port (e.g., 6379) and start the Ray cluster: Choose a free port (e.g., 6379) and start the Ray cluster:
``` console ``` console
ray start --head --port=6379 ray start --head --port=6379
``` ```
just spacify the port you will start your cluster at
The IP address and port used, will be needed for the worker nodes Specify the port you will start your cluster on.
change your current node to the worker nodes using
The IP address and port used will be needed for the worker nodes.
Change your current node to the worker nodes using:
On Worker Nodes: On Worker Nodes:
- identify the `node IDs` of the allocated nodes: - identify the `node IDs` of the allocated nodes:
- `squeue --me`: Get list of assigned nodes - `squeue --me`: Get list of assigned nodes
- `SSH` into each of the other nodes (excluding the head node) - `SSH` into each of the other nodes (excluding the head node)
``` console ``` console
squeue --me squeue --me
ssh acn-node_id ssh acn-node_id
``` ```
Connect to the Ray head from the worker node using its IP and port: Connect to the Ray head from the worker node using its IP and port:
``` console ``` console
ray start --address='head-node-ip:port-number' ray start --address='head-node-ip:port-number'
``` ```
repeat this for all worker nodes to join the cluster.
Repeat this for all worker nodes to join the cluster.
Check Cluster Status: Check Cluster Status:
- On any node, confirm that all nodes have joined: - On any node, confirm that all nodes have joined:
```console ```console
ray status ray status
``` ```
### Serve the Model with vLLM:
### Serve Model with vLLM:
Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes. Once the Ray cluster is ready, launch the vLLM OpenAI-compatible API server on one of the worker nodes.
``` console ``` console
python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cuda python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Llama-70B --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-70B --served-model-name deepseek-ai/DeepSeek-R1-Distill-Llama-70B --host 0.0.0.0 --port 8000 --tensor_parallel-size 4 --pipeline-parallel-size 2 --device cuda
``` ```
This command assumes model parallelism: This command assumes model parallelism:
`--tensor_parallel-size` : how many GPUs to split each layer across `--tensor_parallel-size` : how many GPUs to split each layer across
`--pipeline-parallel-size` : how many layers to split across different nodes/devices `--pipeline-parallel-size` : how many layers to split across different nodes/devices
### Test the Model Server ### Test Model Server
You can test the endpoint using `curl`. Run this from the same node (adjust IP if needed). You can test the endpoint using `curl`. Run this from the same node (adjust IP if needed).
``` console ``` console
curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{ curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B", "model": "deepseek-ai/DeepSeek-R1-Distill-Llama-70B",
...@@ -132,12 +179,3 @@ curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json ...@@ -132,12 +179,3 @@ curl http://127.0.0.1:8000/v1/completions -H "Content-Type: application/json
}' }'
``` ```
...@@ -267,6 +267,8 @@ nav: ...@@ -267,6 +267,8 @@ nav:
- Deep Learning: - Deep Learning:
- AlphaFold: software/machine-learning/alphafold.md - AlphaFold: software/machine-learning/alphafold.md
- DeepDock: software/machine-learning/deepdock.md - DeepDock: software/machine-learning/deepdock.md
-LLM:
- vLLM with Deepseek: software/vllm_deepseek.md
- MPI: - MPI:
- Introduction: software/mpi/mpi.md - Introduction: software/mpi/mpi.md
- OpenMPI Examples: software/mpi/ompi-examples.md - OpenMPI Examples: software/mpi/ompi-examples.md
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment