部署大模型的极简笔记-EW帮帮网

部署大模型的极简笔记

一、VLLM部署

0、wsl 安装

Linux可跳过此步骤。

如果是Windows系统，可以使用 WSL：

Windows --> Linux

wsl --update
wsl --install
# 如果慢就用这个: 
wsl --update --web-download

# 卸载
wsl --uninstall

# 查看版本信息
wsl -l -v
# 默认启动发行版
wsl --setdefault Ubuntu-22.04
# 启动
wsl
# 退出
exit

之后都是默认在 Linux 终端

1、基础环境

sudo apt update               # 更新软件包索引
sudo apt upgrade -y           # 升级已安装的软件包

sudo apt upgrade python3 python3-pip

# 查看版本信息
python3 --version
pip3 --version

2、虚拟环境

Conda：二选一，下载安装，回车 + yes

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh #Miniconda
wget https://repo.anaconda.com/archive/Anaconda3-2025.06-0-Linux-x86_64.sh #anaconda

bash Miniconda3-latest-Linux-x86_64.sh  #Miniconda
bash Anaconda3-2025.06-0-Linux-x86_64.sh #anaconda

# 刷新更改
source ~/.bashrc

# 查看版本
conda --version

创建虚拟环境并激活：

conda create -n vllm1 python==3.12

conda activate vllm1

3、相关下载

依照显卡驱动（cuda）下载 pytorch 框架：
https://pytorch.org/

# 举例：CUDA12.4, pytorch v2.6.0
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

下载vllm引擎：（直接运行这一步，自动下载也可以）

pip install vllm -i https://mirrors.aliyun.com/pypi/simple/

Huggingface下载模型：

# 安装平台插件
pip install huggingface_hub
# 加速下载设置
set HF_ENDPOINT=https://hf-mirror.com
# 下载模型文件, --local-dir配置下载的路径
huggingface-cli download --resume-download Qwen/Qwen3-0.6B --local-dir /mnt/c/huggingface/Qwen3-0.6B --local-dir-use-symlinks False

4、启动VLLM

查看本机IP：

wsl hostname -I

使用本地下载好的模型启动：

python3 -m vllm.entrypoints.openai.api_server \
    --model /mnt/c/huggingface/Qwen3-0.6B \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.8 \
    --max-model-len 6400 \

客户端访问：

http://xxx.xx.xxx.xx:8000/
http://localhost:8000/
http://127.0.0.1:8000/

浏览器访问：查看vllm服务器的接口

http://<wsl的IP>:8000/docs

5、示例代码

OpenAI 风格:

from openai import OpenAI

# wsl hostname -I 查IP
client = OpenAI(base_url="http://000.00.000.00:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="/mnt/f/huggingface/Qwen3-0.6B",  # 改成自己的路径
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
    	{"role": "user", "content": "你好，请简单介绍一下人工智能。"}
    ]
)
print(response.choices[0].message.content)

6、参数配置示例

from openai import OpenAI

client = OpenAI(base_url="http://0.0.0.0:8000/v1", api_key="EMPTY")
# 设置生成参数和输入消息
gen_kwargs = {
    "max_tokens": 1024,  # 生成的最大长度
    "temperature": 0.7,  # 生成丰富性，越大越有创造力
    "top_p": 0.8,  # 采样时的前P个候选词，越大越随机
    "extra_body":{
        "do_sample": True,  # 是否使用概率采样
        "top_k": 10,  # 采样时的前K个候选词，越大越随机
        "repetition_penalty": 1.2,  # 重复惩罚系数，越大越不容易重复
    }
}
# 定义消息内容
messages=[
        {"role": "system", "content": "You are a helpful assistant."},
    	{"role": "user", "content": "你好，请简单介绍一下人工智能。"}
    ]
response = client.chat.completions.create(
    **gen_kwargs,
    model="/mnt/c/huggingface/Qwen3-0.6B",  # 改成自己的路径
    messages=messages
)
print(response.choices[0].message.content)

二、Docker部署

1、环境准备

# 更新系统
sudo apt update && sudo apt upgrade -y

2、安装docker

sudo apt install docker.io -y

sudo systemctl start docker

sudo systemctl enable docker

检查安装：

docker --version

3、【可选】安装GPU支持

如果云服务器有 GPU（比如 A100、3090）：

拉取官方的 NVIDIA CUDA 基础镜像

docker pull nvidia/cuda:12.2.0-base-ubuntu20.04

或者，安装 NVIDIA Container Toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt update
sudo apt install -y nvidia-docker2
sudo systemctl restart docker

测试 GPU 是否可用：（打印显卡信息）

docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi

4、拉取镜像并启动容器

用 Docker 启动了 vLLM 服务：加载镜像 + 挂载 GPU + 开启 API 服务

<方法1>

拉取官方镜像：

docker pull vllm/vllm-openai:latest

docker run -it --gpus all -p 8000:8000 -v C:\huggingface:/data/huggingface vllm/vllm-openai:v0.8.4 --model /data/huggingface/Qwen3-0.6B --max-model-len 3200 --gpu-memory-utilization 0.8

<方法2>

加载本地镜像：

docker load -i E:\Docker\my-lll-vllm.tar

查看镜像：

docker images

运行镜像：（两种情况）

（1）镜像不带模型

docker run -it --gpus all -p 8000:8000 -v E:\huggingface:/huggingface my-lll-vllm:latest --model /huggingface/Qwen3-0.6B --max-model-len 4000 --gpu-memory-utilization 0.8

（2）镜像包含模型

docker run -it --gpus all -p 8000:8000 docker-with-model:latest

5、容器操作进阶

进入容器内部：

docker exec -it qwen3 /bin/bash

停止以及启动：

docker stop qwen3
docker start qwen3

删除容器：

docker rm -f qwen3

6、镜像打包

查找容器ID：

# 正在运行
docker ps
# 所有容器
docker ps -a

提交为镜像：

docker commit <ID> <name>:<tag>

保存为 .tar （ -o 后可跟路径）

docker save -o docker_name.tar docker_name:latest

7、【可选】模型打包到镜像

复制模型到容器再打包：

# 创建目录
docker exec -it 018a0895fd12 mkdir -p /models
# 移植模型
docker cp F:/huggingface/Qwen3-0.6B 018a0895fd12:/models/Qwen3-0.6B

后续同上，提交，保存即可

8、测试访问 API

浏览器查看服务器接口：（同理）

http://localhost:8000/docs

Python代码测试：

import requests

url = "http://<宿主机IP>:8000/v1/chat/completions"
data = {
    "model": "/models/Qwen3-0.6B",
    "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "你好，请简单介绍一下你自己。"}
    ]
}

r = requests.post(url, json=data)
print(r.json())

如果容器内模型已经打包在镜像里，可以直接用 /models/Qwen3-0.6B
如果用挂载的方式，需要确保宿主机路径和容器内路径一致
镜像里有模型就不用挂载，否则必须挂载（-v 可选挂载宿主机模型）

部署大模型的极简笔记