Docker安装分布式vLLM
1 介绍
vLLM是一个快速且易于使用的LLM推理和服务库,适合用于生产环境。单主机部署会遇到显存不足的问题,因此需要分布式部署。
分布式安装方法
https://docs.vllm.ai/en/latest/serving/distributed_serving.html
2 安装方法
⚠️ 注意:前期一定要把docker环境、运行时和GPU安装好。
CUDA Version: 12.4
vllm:v0.7.2
2.1 下载镜像
# 下载镜像,镜像比较大
docker pull vllm/vllm-openai:v0.7.2
下载分布式部署的脚本
https://github.com/vllm-project/vllm/blob/main/examples/online_serving/run_cluster.sh
run_cluster.sh文件
#!/bin/bash
# Check for minimum number of required arguments
if [ $# -lt 4 ]; then
echo "Usage: $0 docker_image head_node_address --head|--worker path_to_hf_home [additional_args...]"
exit 1
fi
# Assign the first three arguments and shift them away
DOCKER_IMAGE="$1"
HEAD_NODE_ADDRESS="$2"
NODE_TYPE="$3" # Should be --head or --worker
PATH_TO_HF_HOME="$4"
shift 4
# Additional arguments are passed directly to the Docker command
ADDITIONAL_ARGS=("$@")
# Validate node type
if [ "${NODE_TYPE}" != "--head" ] && [ "${NODE_TYPE}" != "--worker" ]; then
echo "Error: Node type must be --head or --worker"
exit 1
fi
# Define a function to cleanup on EXIT signal
cleanup() {
docker stop node
docker rm node
}
trap cleanup EXIT
# Command setup for head or worker node
RAY_START_CMD="ray start --block"
if [ "${NODE_TYPE}" == "--head" ]; then
RAY_START_CMD+=" --head --port=6379"
else
RAY_START_CMD+=" --address=${HEAD_NODE_ADDRESS}:6379"
fi
# Run the docker command with the user specified parameters and additional arguments
docker run \
--entrypoint /bin/bash \
--network host \
--name node \
--shm-size 10.24g \
--gpus all \
-v "${PATH_TO_HF_HOME}:/root/.cache/huggingface" \
"${ADDITIONAL_ARGS[@]}" \
"${DOCKER_IMAGE}" -c "${RAY_START_CMD}"
2.2 创建容器
两台主机的IP如下,主节点宿主机IP:192.168.108.100,工作节点宿主机IP:192.168.108.101。
主节点(head节点)运行分布式vLLM脚本
官网的说明
# ip_of_head_node:主节点容器所在宿主机的IP地址
# /path/to/the/huggingface/home/in/this/node: 映射到到容器中的路径
# ip_of_this_node:当前节点所在宿主机的IP地址
# --head:表示主节点
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--head \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
本机执行
bash run_cluster.sh \
vllm/vllm-openai:v0.7.2 \
192.168.108.100 \
--head \
/home/vllm \
-e VLLM_HOST_IP=192.168.108.100 \
> nohup.log 2>&1 &
工作节点(worker节点)运行分布式vLLM脚本
官网的说明
# ip_of_head_node:主节点容器所在宿主机的IP地址
# /path/to/the/huggingface/home/in/this/node: 映射到到容器中的路径
# ip_of_this_node:当前节点所在宿主机的IP地址
# --worker:表示工作节点
bash run_cluster.sh \
vllm/vllm-openai \
ip_of_head_node \
--worker \
/path/to/the/huggingface/home/in/this/node \
-e VLLM_HOST_IP=ip_of_this_node
本机执行
bash run_cluster.sh \
vllm/vllm-openai:v0.7.2 \
192.168.108.100 \
--worker \
/home/vllm \
-e VLLM_HOST_IP192.168.108.101 \
> nohup.log 2>&1 &
查看集群的信息
# 进入容器
docker exec -it node /bin/bash
# 查看集群信息
ray status
# 返回值中有GPU数量、CPU配置和内存大小等
======== Autoscaler status: 2025-02-13 20:18:13.886242 ========
Node status
---------------------------------------------------------------
Active:
1 node_89c804d654976b3c606850c461e8dc5c6366de5e0ccdb360fcaa1b1c
1 node_4b794efd101bc393da41f0a45bd72eeb3fb78e8e507d72b5fdfb4c1b
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/128.0 CPU
0.0/4.0 GPU
0B/20 GiB memory
0B/19.46GiB object_store_memory
Demands:
(no resource demands)
3 安装模型
⚠️ 本地有4张GPU卡。
官网说明
# 启动模型服务,可根据情况设置模型参数
# /path/to/the/model/in/the/container:模型路径
# tensor-parallel-size:张量并行数量,模型层内拆分后并行计算;
# pipeline-parallel-size:管道并行数量,模型不同层拆分后并行计算,在单个显存不够时可以设置此参数
vllm serve /path/to/the/model/in/the/container \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2
本机执行
将下载好的Qwen2.5-7B-Instruct模型,放在“/home/vllm”目录下
# 进入节点,主节点和工作节点都可以
docker exec -it node /bin/bash
# 执行命令参数
nohup vllm serve /root/.cache/huggingface/Qwen2.5-7B-Instruct \
--served-model-name qwen2.5-7b \
--tensor-parallel-size 2 \
--pipeline-parallel-size 2 \
> nohup.log 2>&1 &
在宿主机上调用参数
curl http://localhost:8000/v1/chat/completions \
-X POST \
-H "Content-Type: application/json" \
-d '{
"model": "qwen2.5-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "介绍一下中国,不少于10000字"}
],
"stream": true
}'