GLM-4.1V-9B-Thinking 本地化部署
由于项目上需要进行本地化部署vl模型做视觉识别,经过前期调研对比 qwenvl2.5 internal3vl 等多个开源模型进行横向对比后,筛选出 GLM-4.1V-9B-Thinking表现相对比较出色
所以就找了它的github进行部署,几经波折终于部署运行成功,vllm推理和gradio都部署通过成功,现将经过记录如下:1. 官网地址
吐槽一下,官网上并没有介绍该开源项目部署所需的环境,这步浪费了我一天的时间,看了好多issues,经过广大网友的报错环境信息,终于搞清楚它所需要的cuda和cunn环境(这俩最重要),其实gpu也很重要,我在这三块被坑了1天的时间,所以我将实验通过的所需环境列出来给大家,可以避免走弯路。
我过程中尝试选ppu 选cuda 12.8 cunn选12以上碰到各种坑 gpu选过v100 (那怕是2张32g)都启动失败
服务器型号 |
Ubuntu Server 22.04 LTS 64位 |
腾讯云上选ubuntuUbuntu Server 22.04 LTS 64位 |
|
GPU驱动 |
535.216.0 |
||
cuda版本 |
12.4.1 |
||
cudnn版本 |
9.5.1 |
||
gpu版本 |
A100 40G 显存 |
GPU计算型GT4 | GT4.4XLARGE96 |
- 按照官网地址进行代码克隆(温馨提醒,购买云服务器时候,最好购买个香港的,很多huggingface资源需要外网访问)
2.1. 创建虚拟环境, 先到服务器上手动创建虚拟环境 ,激活,后续操作都在虚拟环境中,避免跟服务器的一些包冲突
python -m venv glm4
source glm4/bin/activate
- 下载代码
- 安装所有的依赖包
安装之前,先修改源代码中的包安装路径,代码下载下来的安装会一直失败,是由于vllm版本和transform都是经过特殊定制的,这步坑了我半天,最终是通过看issuse解决的 关于环境安装的若干问题 / Several Issues Regarding Environment Setup · Issue #18 · THUDM/GLM-4.1V-Thinking · GitHub
修改后的requirerements.txt 如下
setuptools>=80.9.0
setuptools_scm>=8.3.1
git+https://github.com/huggingface/transformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0
git+https://github.com/vllm-project/vllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6
torchvision>=0.22.0
gradio>=5.35.0
pre-commit>=4.2.0
PyMuPDF>=1.26.1
av>=14.4.0
accelerate>=1.6.0
spaces>=0.37.1
开始安装各种依赖包
cd GLM-4.1V-Thinking
pip install -r requirements.txt
安装过程大概持续需要半个小时左右, 特别是在编译vllm环境和transform版本环境时候,等待时间比较久, 安装完成的各种包版本如下:
(glm4) ubuntu@VM-0-15-ubuntu:~$ pip list
Package Version
--------------------------------- ----------------------------
accelerate 1.8.1
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.13
aiosignal 1.4.0
airportsdata 20250706
annotated-types 0.7.0
anyio 4.9.0
astor 0.8.1
async-timeout 5.0.1
attrs 25.3.0
av 15.0.0
blake3 1.0.5
cachetools 6.1.0
certifi 2025.6.15
cfgv 3.4.0
charset-normalizer 3.4.2
click 8.2.1
cloudpickle 3.1.1
compressed-tensors 0.10.2
cupy-cuda12x 13.4.1
depyf 0.18.0
dill 0.4.0
diskcache 5.6.3
distlib 0.3.9
distro 1.9.0
dnspython 2.7.0
einops 0.8.1
email_validator 2.2.0
exceptiongroup 1.3.0
fastapi 0.116.0
fastapi-cli 0.0.8
fastapi-cloud-cli 0.1.1
fastrlock 0.8.3
ffmpy 0.6.0
filelock 3.18.0
frozenlist 1.7.0
fsspec 2025.5.1
gguf 0.17.1
gradio 5.35.0
gradio_client 1.10.4
groovy 0.1.2
h11 0.16.0
hf-xet 1.1.5
httpcore 1.0.9
httptools 0.6.4
httpx 0.27.2
huggingface-hub 0.33.2
identify 2.6.12
idna 3.10
interegular 0.3.3
Jinja2 3.1.6
jiter 0.10.0
jsonschema 4.24.0
jsonschema-specifications 2025.4.1
lark 1.2.2
llguidance 0.7.30
llvmlite 0.44.0
lm-format-enforcer 0.10.11
markdown-it-py 3.0.0
MarkupSafe 3.0.2
mdurl 0.1.2
mistral_common 1.6.3
mpmath 1.3.0
msgpack 1.1.1
msgspec 0.19.0
multidict 6.6.3
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
nodeenv 1.9.1
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.6.4.1
nvidia-cuda-cupti-cu12 12.6.80
nvidia-cuda-nvrtc-cu12 12.6.77
nvidia-cuda-runtime-cu12 12.6.77
nvidia-cudnn-cu12 9.5.1.17
nvidia-cufft-cu12 11.3.0.4
nvidia-cufile-cu12 1.11.1.6
nvidia-curand-cu12 10.3.7.77
nvidia-cusolver-cu12 11.7.1.2
nvidia-cusparse-cu12 12.5.4.2
nvidia-cusparselt-cu12 0.6.3
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.6.85
nvidia-nvtx-cu12 12.6.77
openai 1.93.1
opencv-python-headless 4.12.0.88
orjson 3.10.18
outlines 0.1.11
outlines_core 0.1.26
packaging 25.0
pandas 2.3.1
partial-json-parser 0.2.1.1.post6
pillow 11.3.0
pip 22.0.2
platformdirs 4.3.8
pre_commit 4.2.0
prometheus_client 0.22.1
prometheus-fastapi-instrumentator 7.1.0
propcache 0.3.2
protobuf 6.31.1
psutil 5.9.8
py-cpuinfo 9.0.0
pybase64 1.4.1
pycountry 24.6.1
pydantic 2.11.7
pydantic_core 2.33.2
pydub 0.25.1
Pygments 2.19.2
PyMuPDF 1.26.3
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-json-logger 3.3.0
python-multipart 0.0.20
pytz 2025.2
PyYAML 6.0.2
pyzmq 27.0.0
ray 2.47.1
referencing 0.36.2
regex 2024.11.6
requests 2.32.4
rich 14.0.0
rich-toolkit 0.14.8
rignore 0.5.1
rpds-py 0.26.0
ruff 0.12.2
safehttpx 0.1.6
safetensors 0.5.3
scipy 1.15.3
semantic-version 2.10.0
sentencepiece 0.2.0
sentry-sdk 2.32.0
setuptools 80.9.0
setuptools-scm 8.3.1
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
spaces 0.37.1
starlette 0.46.2
sympy 1.14.0
tiktoken 0.9.0
tokenizers 0.21.2
tomli 2.2.1
tomlkit 0.13.3
torch 2.7.0
torchaudio 2.7.0
torchvision 0.22.0
tqdm 4.67.1
transformers 4.54.0.dev0
triton 3.3.0
typer 0.16.0
typing_extensions 4.14.1
typing-inspection 0.4.1
tzdata 2025.2
urllib3 2.5.0
uvicorn 0.35.0
uvloop 0.21.0
virtualenv 20.31.2
vllm 0.9.2.dev398+g220aee90.cu124
watchfiles 1.1.0
websockets 15.0.1
xformers 0.0.30
xgrammar 0.1.19
yarl 1.20.1
安装完成基础环境后,就可以启动服务试试了,此处也是被坑了多次,主要问题是由于启动的gpu卡需要gpu40g,刚开始天真的认为2张v100 gpu(32G)就可以,其实启动的时候包括设过多卡,都会启动失败,最终找了一张A100 启动成功。 启动命令:
vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /
启动过程中有加载模型, 有缓存cuda卡加载等等,大概5分钟左右可以启动成功后,可以通过api接口进行调用推理
我后来又找到代码中的gradio的启动命令,需要手动修改下启动的模型文件(由于我后面启动服务的卡在大陆,导致模型文件不能从huggingface上拉取,只能使用本地模型路径),修改后的启动文件如下:
MODEL_PATH = "/home/ubuntu/.cache/huggingface/hub/models--THUDM--GLM-4.1V-9B-Thinking/snapshots/def1e4472aaf5617c7c696785ff36d67c5e6d058/" 这句代码代表是我本机缓存的模型文件
import argparse
import copy
import os
import re
import subprocess
import tempfile
import threading
import time
from pathlib import Path
import fitz
import gradio as gr
import spaces
import torch
from transformers import (
AutoProcessor,
Glm4vForConditionalGeneration,
TextIteratorStreamer,
)
parser = argparse.ArgumentParser()
parser.add_argument(
"--server_name",
type=str,
default="0.0.0.0",
help="IP address, LAN access changed to 0.0.0.0",
)
parser.add_argument("--server_port", type=int, default=7860, help="Use Port")
parser.add_argument("--share", action="store_true", help="Enable gradio sharing")
parser.add_argument("--mcp_server", action="store_true", help="Enable mcp service")
args = parser.parse_args()
MODEL_PATH = "/home/ubuntu/.cache/huggingface/hub/models--THUDM--GLM-4.1V-9B-Thinking/snapshots/def1e4472aaf5617c7c696785ff36d67c5e6d058/"
stop_generation = False
processor = None
model = None
def load_model():
global processor, model
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
model = Glm4vForConditionalGeneration.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="sdpa",
)
class GLM4VModel:
def __init__(self):
pass
def _strip_html(self, t):
return re.sub(r"<[^>]+>", "", t).strip()
def _wrap_text(self, t):
return [{"type": "text", "text": t}]
def _pdf_to_imgs(self, pdf_path):
doc = fitz.open(pdf_path)
imgs = []
for i in range(doc.page_count):
pix = doc.load_page(i).get_pixmap(dpi=180)
img_p = os.path.join(
tempfile.gettempdir(), f"{Path(pdf_path).stem}_{i}.png"
)
pix.save(img_p)
imgs.append(img_p)
doc.close()
return imgs
def _ppt_to_imgs(self, ppt_path):
tmp = tempfile.mkdtemp()
subprocess.run(
[
"libreoffice",
"--headless",
"--convert-to",
"pdf",
"--outdir",
tmp,
ppt_path,
],
check=True,
)
pdf_path = os.path.join(tmp, Path(ppt_path).stem + ".pdf")
return self._pdf_to_imgs(pdf_path)
def _files_to_content(self, media):
out = []
for f in media or []:
ext = Path(f.name).suffix.lower()
if ext in [
".mp4",
".avi",
".mkv",
".mov",
".wmv",
".flv",
".webm",
".mpeg",
".m4v",
]:
out.append({"type": "video", "url": f.name})
elif ext in [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"]:
out.append({"type": "image", "url": f.name})
elif ext in [".ppt", ".pptx"]:
for p in self._ppt_to_imgs(f.name):
out.append({"type": "image", "url": p})
elif ext == ".pdf":
for p in self._pdf_to_imgs(f.name):
out.append({"type": "image", "url": p})
return out
def _stream_fragment(self, buf: str) -> str:
think_html = ""
if "<think>" in buf:
if "</think>" in buf:
seg = re.search(r"<think>(.*?)</think>", buf, re.DOTALL)
if seg:
think_html = (
"<details open><summary style='cursor:pointer;font-weight:bold;color:#bbbbbb;'>💭 Thinking</summary>"
"<div style='color:#cccccc;line-height:1.4;padding:10px;border-left:3px solid #666;margin:5px 0;background-color:rgba(128,128,128,0.1);'>"
+ seg.group(1).strip().replace("\n", "<br>")
+ "</div></details>"
)
else:
part = buf.split("<think>", 1)[1]
think_html = (
"<details open><summary style='cursor:pointer;font-weight:bold;color:#bbbbbb;'>💭 Thinking</summary>"
"<div style='color:#cccccc;line-height:1.4;padding:10px;border-left:3px solid #666;margin:5px 0;background-color:rgba(128,128,128,0.1);'>"
+ part.replace("\n", "<br>")
+ "</div></details>"
)
answer_html = ""
if "<answer>" in buf:
if "</answer>" in buf:
seg = re.search(r"<answer>(.*?)</answer>", buf, re.DOTALL)
if seg:
answer_html = seg.group(1).strip()
else:
answer_html = buf.split("<answer>", 1)[1]
if not think_html and not answer_html:
return self._strip_html(buf)
return think_html + answer_html
def _build_messages(self, raw_hist, sys_prompt):
msgs = []
if sys_prompt.strip():
msgs.append(
{
"role": "system",
"content": [{"type": "text", "text": sys_prompt.strip()}],
}
)
for h in raw_hist:
if h["role"] == "user":
msgs.append({"role": "user", "content": h["content"]})
else:
raw = h["content"]
raw = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL)
raw = re.sub(r"<details.*?</details>", "", raw, flags=re.DOTALL)
clean = self._strip_html(raw).strip()
msgs.append({"role": "assistant", "content": self._wrap_text(clean)})
return msgs
@spaces.GPU(duration=240)
def stream_generate(self, raw_hist, sys_prompt):
global stop_generation, processor, model
stop_generation = False
msgs = self._build_messages(raw_hist, sys_prompt)
inputs = processor.apply_chat_template(
msgs,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
padding=True,
).to(model.device)
streamer = TextIteratorStreamer(
processor.tokenizer, skip_prompt=True, skip_special_tokens=False
)
gen_args = dict(
inputs,
max_new_tokens=8192,
repetition_penalty=1.1,
do_sample=True,
top_k=2,
temperature=None,
top_p=1e-5,
streamer=streamer,
)
generation_thread = threading.Thread(target=model.generate, kwargs=gen_args)
generation_thread.start()
buf = ""
for tok in streamer:
if stop_generation:
break
buf += tok
yield self._stream_fragment(buf)
generation_thread.join()
def format_display_content(content):
if isinstance(content, list):
text_parts = []
file_count = 0
for item in content:
if item["type"] == "text":
text_parts.append(item["text"])
else:
file_count += 1
display_text = " ".join(text_parts)
if file_count > 0:
return f"[{file_count} file(s) uploaded]\n{display_text}"
return display_text
return content
def create_display_history(raw_hist):
display_hist = []
for h in raw_hist:
if h["role"] == "user":
display_content = format_display_content(h["content"])
display_hist.append({"role": "user", "content": display_content})
else:
display_hist.append({"role": "assistant", "content": h["content"]})
return display_hist
# 加载模型和处理器
load_model()
glm4v = GLM4VModel()
def check_files(files):
vids = imgs = ppts = pdfs = 0
for f in files or []:
ext = Path(f.name).suffix.lower()
if ext in [
".mp4",
".avi",
".mkv",
".mov",
".wmv",
".flv",
".webm",
".mpeg",
".m4v",
]:
vids += 1
elif ext in [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"]:
imgs += 1
elif ext in [".ppt", ".pptx"]:
ppts += 1
elif ext == ".pdf":
pdfs += 1
if vids > 1 or ppts > 1 or pdfs > 1:
return False, "Only one video or one PPT or one PDF allowed"
if imgs > 10:
return False, "Maximum 10 images allowed"
if (ppts or pdfs) and (vids or imgs) or (vids and imgs):
return False, "Cannot mix documents, videos, and images"
return True, ""
def chat(files, msg, raw_hist, sys_prompt):
global stop_generation
stop_generation = False
ok, err = check_files(files)
if not ok:
raw_hist.append({"role": "assistant", "content": err})
display_hist = create_display_history(raw_hist)
yield display_hist, copy.deepcopy(raw_hist), None, ""
return
payload = glm4v._files_to_content(files) if files else None
if msg.strip():
if payload is None:
payload = glm4v._wrap_text(msg.strip())
else:
payload.append({"type": "text", "text": msg.strip()})
user_rec = {"role": "user", "content": payload if payload else msg.strip()}
if raw_hist is None:
raw_hist = []
raw_hist.append(user_rec)
place = {"role": "assistant", "content": ""}
raw_hist.append(place)
display_hist = create_display_history(raw_hist)
yield display_hist, copy.deepcopy(raw_hist), None, ""
for chunk in glm4v.stream_generate(raw_hist[:-1], sys_prompt):
if stop_generation:
break
place["content"] = chunk
display_hist = create_display_history(raw_hist)
yield display_hist, copy.deepcopy(raw_hist), None, ""
display_hist = create_display_history(raw_hist)
yield display_hist, copy.deepcopy(raw_hist), None, ""
def reset():
global stop_generation
stop_generation = True
time.sleep(0.1)
return [], [], None, ""
css = """.chatbot-container .message-wrap .message{font-size:14px!important}
details summary{cursor:pointer;font-weight:bold}
details[open] summary{margin-bottom:10px}"""
demo = gr.Blocks(title="GLM-4.1V Chat", theme=gr.themes.Soft(), css=css)
with demo:
gr.Markdown("""
<div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
GLM-4.1V-9B-Thinking Gradio Space🤗
</div>
<div style="text-align: center;">
<a href="https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking">🤗 Model Hub</a> |
<a href="https://github.com/THUDM/GLM-4.1V-Thinking">🌐 Github</a>
</div>
""")
raw_history = gr.State([])
with gr.Row():
with gr.Column(scale=7):
chatbox = gr.Chatbot(
label="Conversation",
type="messages",
height=800,
elem_classes="chatbot-container",
)
textbox = gr.Textbox(label="💭 Message")
with gr.Row():
send = gr.Button("Send", variant="primary")
clear = gr.Button("Clear")
with gr.Column(scale=3):
up = gr.File(
label="📁 Upload",
file_count="multiple",
file_types=["file"],
type="filepath",
)
gr.Markdown("Supports images / videos / PPT / PDF")
gr.Markdown(
"The maximum supported input is 10 images or 1 video/PPT/PDF. During the conversation, video and images cannot be present at the same time."
)
sys = gr.Textbox(label="⚙️ System Prompt", lines=6)
gr.on(
triggers=[send.click, textbox.submit],
fn=chat,
inputs=[up, textbox, raw_history, sys],
outputs=[chatbox, raw_history, up, textbox],
)
clear.click(reset, outputs=[chatbox, raw_history, up, textbox])
if __name__ == "__main__":
demo.launch(
server_name=args.server_name,
server_port=args.server_port,
share=False,
mcp_server=args.mcp_server,
inbrowser=True,
)
切换到这个文件代码下,使用如下命令启动
cd /home/ubuntu/GLM-4.1V-Thinking/inference
python trans_infer_gradio.py
启动后的窗口如下:
访问界面如下: