GLM-4.1V-9B-Thinking 本地化部署

发布于:2025-07-10 ⋅ 阅读:(24) ⋅ 点赞:(0)

GLM-4.1V-9B-Thinking 本地化部署

由于项目需要进行本地化部署vl模型视觉识别经过前期调研对比 qwenvl2.5 internal3vl 多个开源模型进行横向对比筛选 GLM-4.1V-9B-Thinking表现相对比较出色

所以找了github进行部署几经波折终于部署运行成功vllm推理gradio部署通过成功经过记录如下1. 官网地址

GitHub - THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

吐槽一下官网并没有介绍开源项目部署所需环境这步浪费了一天时间看了好多issues经过广大网友报错环境信息终于搞清楚需要cudacunn环境这俩最重要其实gpu很重要这三块坑了1时间所以实验通过所需环境列出来大家可以避免弯路

我过程中尝试选ppu 选cuda 12.8 cunn选12以上碰到各种坑 gpu选过v100 那怕232g启动失败

服务器型号

Ubuntu Server 22.04 LTS 64位

腾讯云ubuntuUbuntu Server 22.04 LTS 64位

GPU驱动

535.216.0

cuda版本

12.4.1

cudnn版本

9.5.1

gpu版本

A100 40G 显存

GPU计算型GT4 | GT4.4XLARGE96

  1. 按照官网地址进行代码克隆温馨提醒,购买云服务器时候,最好购买个香港的,很多huggingface资源需要外网访问

2.1. 创建虚拟环境 服务器手动创建虚拟环境 ,激活后续操作都在虚拟环境避免服务器一些冲突

python -m venv glm4
source glm4/bin/activate

  1. 下载代码

git clone GitHub - THUDM/GLM-4.1V-Thinking: GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.

  1. 安装所有依赖

安装之前修改源代码安装路径代码下载下来的安装会一直失败,是由于vllm版本和transform都是经过特殊定制的,这步坑了我半天最终通过issuse解决 关于环境安装的若干问题 / Several Issues Regarding Environment Setup · Issue #18 · THUDM/GLM-4.1V-Thinking · GitHub

修改requirerements.txt 如下

setuptools>=80.9.0
setuptools_scm>=8.3.1
git+https://github.com/huggingface/transformers.git@91221da2f1f68df9eb97c980a7206b14c4d3a9b0
git+https://github.com/vllm-project/vllm.git@220aee902a291209f2975d4cd02dadcc6749ffe6
torchvision>=0.22.0
gradio>=5.35.0
pre-commit>=4.2.0
PyMuPDF>=1.26.1
av>=14.4.0
accelerate>=1.6.0
spaces>=0.37.1

开始安装各种依赖

cd GLM-4.1V-Thinking
pip install -r  requirements.txt 

安装过程大概持续需要半个小时左右 特别是编译vllm环境transform版本环境时候等待时间比较 安装完成各种版本如下

(glm4) ubuntu@VM-0-15-ubuntu:~$ pip list
Package                           Version
--------------------------------- ----------------------------
accelerate                        1.8.1
aiofiles                          24.1.0
aiohappyeyeballs                  2.6.1
aiohttp                           3.12.13
aiosignal                         1.4.0
airportsdata                      20250706
annotated-types                   0.7.0
anyio                             4.9.0
astor                             0.8.1
async-timeout                     5.0.1
attrs                             25.3.0
av                                15.0.0
blake3                            1.0.5
cachetools                        6.1.0
certifi                           2025.6.15
cfgv                              3.4.0
charset-normalizer                3.4.2
click                             8.2.1
cloudpickle                       3.1.1
compressed-tensors                0.10.2
cupy-cuda12x                      13.4.1
depyf                             0.18.0
dill                              0.4.0
diskcache                         5.6.3
distlib                           0.3.9
distro                            1.9.0
dnspython                         2.7.0
einops                            0.8.1
email_validator                   2.2.0
exceptiongroup                    1.3.0
fastapi                           0.116.0
fastapi-cli                       0.0.8
fastapi-cloud-cli                 0.1.1
fastrlock                         0.8.3
ffmpy                             0.6.0
filelock                          3.18.0
frozenlist                        1.7.0
fsspec                            2025.5.1
gguf                              0.17.1
gradio                            5.35.0
gradio_client                     1.10.4
groovy                            0.1.2
h11                               0.16.0
hf-xet                            1.1.5
httpcore                          1.0.9
httptools                         0.6.4
httpx                             0.27.2
huggingface-hub                   0.33.2
identify                          2.6.12
idna                              3.10
interegular                       0.3.3
Jinja2                            3.1.6
jiter                             0.10.0
jsonschema                        4.24.0
jsonschema-specifications         2025.4.1
lark                              1.2.2
llguidance                        0.7.30
llvmlite                          0.44.0
lm-format-enforcer                0.10.11
markdown-it-py                    3.0.0
MarkupSafe                        3.0.2
mdurl                             0.1.2
mistral_common                    1.6.3
mpmath                            1.3.0
msgpack                           1.1.1
msgspec                           0.19.0
multidict                         6.6.3
nest-asyncio                      1.6.0
networkx                          3.4.2
ninja                             1.11.1.4
nodeenv                           1.9.1
numba                             0.61.2
numpy                             2.2.6
nvidia-cublas-cu12                12.6.4.1
nvidia-cuda-cupti-cu12            12.6.80
nvidia-cuda-nvrtc-cu12            12.6.77
nvidia-cuda-runtime-cu12          12.6.77
nvidia-cudnn-cu12                 9.5.1.17
nvidia-cufft-cu12                 11.3.0.4
nvidia-cufile-cu12                1.11.1.6
nvidia-curand-cu12                10.3.7.77
nvidia-cusolver-cu12              11.7.1.2
nvidia-cusparse-cu12              12.5.4.2
nvidia-cusparselt-cu12            0.6.3
nvidia-nccl-cu12                  2.26.2
nvidia-nvjitlink-cu12             12.6.85
nvidia-nvtx-cu12                  12.6.77
openai                            1.93.1
opencv-python-headless            4.12.0.88
orjson                            3.10.18
outlines                          0.1.11
outlines_core                     0.1.26
packaging                         25.0
pandas                            2.3.1
partial-json-parser               0.2.1.1.post6
pillow                            11.3.0
pip                               22.0.2
platformdirs                      4.3.8
pre_commit                        4.2.0
prometheus_client                 0.22.1
prometheus-fastapi-instrumentator 7.1.0
propcache                         0.3.2
protobuf                          6.31.1
psutil                            5.9.8
py-cpuinfo                        9.0.0
pybase64                          1.4.1
pycountry                         24.6.1
pydantic                          2.11.7
pydantic_core                     2.33.2
pydub                             0.25.1
Pygments                          2.19.2
PyMuPDF                           1.26.3
python-dateutil                   2.9.0.post0
python-dotenv                     1.1.1
python-json-logger                3.3.0
python-multipart                  0.0.20
pytz                              2025.2
PyYAML                            6.0.2
pyzmq                             27.0.0
ray                               2.47.1
referencing                       0.36.2
regex                             2024.11.6
requests                          2.32.4
rich                              14.0.0
rich-toolkit                      0.14.8
rignore                           0.5.1
rpds-py                           0.26.0
ruff                              0.12.2
safehttpx                         0.1.6
safetensors                       0.5.3
scipy                             1.15.3
semantic-version                  2.10.0
sentencepiece                     0.2.0
sentry-sdk                        2.32.0
setuptools                        80.9.0
setuptools-scm                    8.3.1
shellingham                       1.5.4
six                               1.17.0
sniffio                           1.3.1
spaces                            0.37.1
starlette                         0.46.2
sympy                             1.14.0
tiktoken                          0.9.0
tokenizers                        0.21.2
tomli                             2.2.1
tomlkit                           0.13.3
torch                             2.7.0
torchaudio                        2.7.0
torchvision                       0.22.0
tqdm                              4.67.1
transformers                      4.54.0.dev0
triton                            3.3.0
typer                             0.16.0
typing_extensions                 4.14.1
typing-inspection                 0.4.1
tzdata                            2025.2
urllib3                           2.5.0
uvicorn                           0.35.0
uvloop                            0.21.0
virtualenv                        20.31.2
vllm                              0.9.2.dev398+g220aee90.cu124
watchfiles                        1.1.0
websockets                        15.0.1
xformers                          0.0.30
xgrammar                          0.1.19
yarl                              1.20.1

安装完成基础环境就可以启动服务试试此处坑了多次主要问题由于启动gpu需要gpu40g刚开始天真认为2v100 gpu(32G就可以其实启动时候包括多卡都会启动失败最终找了一张A100 启动成功 启动命令

vllm serve THUDM/GLM-4.1V-9B-Thinking --limit-mm-per-prompt '{"image":32}' --allowed-local-media-path /

启动过程中加载模型 缓存cuda加载等等大概5分钟左右可以启动成功可以通过api接口进行调用推理

后来找到代码gradio启动命令需要手动修改启动模型文件由于后面启动服务大陆导致模型文件不能huggingface拉取只能使用本地模型路径修改启动文件如下

MODEL_PATH = "/home/ubuntu/.cache/huggingface/hub/models--THUDM--GLM-4.1V-9B-Thinking/snapshots/def1e4472aaf5617c7c696785ff36d67c5e6d058/" 这句代码代表本机缓存模型文件

import argparse
import copy
import os
import re
import subprocess
import tempfile
import threading
import time
from pathlib import Path

import fitz
import gradio as gr
import spaces
import torch
from transformers import (
    AutoProcessor,
    Glm4vForConditionalGeneration,
    TextIteratorStreamer,
)

parser = argparse.ArgumentParser()
parser.add_argument(
    "--server_name",
    type=str,
    default="0.0.0.0",
    help="IP address, LAN access changed to 0.0.0.0",
)
parser.add_argument("--server_port", type=int, default=7860, help="Use Port")
parser.add_argument("--share", action="store_true", help="Enable gradio sharing")
parser.add_argument("--mcp_server", action="store_true", help="Enable mcp service")
args = parser.parse_args()

MODEL_PATH = "/home/ubuntu/.cache/huggingface/hub/models--THUDM--GLM-4.1V-9B-Thinking/snapshots/def1e4472aaf5617c7c696785ff36d67c5e6d058/"
stop_generation = False
processor = None
model = None


def load_model():
    global processor, model
    processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
    model = Glm4vForConditionalGeneration.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        attn_implementation="sdpa",
    )


class GLM4VModel:
    def __init__(self):
        pass

    def _strip_html(self, t):
        return re.sub(r"<[^>]+>", "", t).strip()

    def _wrap_text(self, t):
        return [{"type": "text", "text": t}]

    def _pdf_to_imgs(self, pdf_path):
        doc = fitz.open(pdf_path)
        imgs = []
        for i in range(doc.page_count):
            pix = doc.load_page(i).get_pixmap(dpi=180)
            img_p = os.path.join(
                tempfile.gettempdir(), f"{Path(pdf_path).stem}_{i}.png"
            )
            pix.save(img_p)
            imgs.append(img_p)
        doc.close()
        return imgs

    def _ppt_to_imgs(self, ppt_path):
        tmp = tempfile.mkdtemp()
        subprocess.run(
            [
                "libreoffice",
                "--headless",
                "--convert-to",
                "pdf",
                "--outdir",
                tmp,
                ppt_path,
            ],
            check=True,
        )
        pdf_path = os.path.join(tmp, Path(ppt_path).stem + ".pdf")
        return self._pdf_to_imgs(pdf_path)

    def _files_to_content(self, media):
        out = []
        for f in media or []:
            ext = Path(f.name).suffix.lower()
            if ext in [
                ".mp4",
                ".avi",
                ".mkv",
                ".mov",
                ".wmv",
                ".flv",
                ".webm",
                ".mpeg",
                ".m4v",
            ]:
                out.append({"type": "video", "url": f.name})
            elif ext in [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"]:
                out.append({"type": "image", "url": f.name})
            elif ext in [".ppt", ".pptx"]:
                for p in self._ppt_to_imgs(f.name):
                    out.append({"type": "image", "url": p})
            elif ext == ".pdf":
                for p in self._pdf_to_imgs(f.name):
                    out.append({"type": "image", "url": p})
        return out

    def _stream_fragment(self, buf: str) -> str:
        think_html = ""
        if "<think>" in buf:
            if "</think>" in buf:
                seg = re.search(r"<think>(.*?)</think>", buf, re.DOTALL)
                if seg:
                    think_html = (
                        "<details open><summary style='cursor:pointer;font-weight:bold;color:#bbbbbb;'>💭 Thinking</summary>"
                        "<div style='color:#cccccc;line-height:1.4;padding:10px;border-left:3px solid #666;margin:5px 0;background-color:rgba(128,128,128,0.1);'>"
                        + seg.group(1).strip().replace("\n", "<br>")
                        + "</div></details>"
                    )
            else:
                part = buf.split("<think>", 1)[1]
                think_html = (
                    "<details open><summary style='cursor:pointer;font-weight:bold;color:#bbbbbb;'>💭 Thinking</summary>"
                    "<div style='color:#cccccc;line-height:1.4;padding:10px;border-left:3px solid #666;margin:5px 0;background-color:rgba(128,128,128,0.1);'>"
                    + part.replace("\n", "<br>")
                    + "</div></details>"
                )

        answer_html = ""
        if "<answer>" in buf:
            if "</answer>" in buf:
                seg = re.search(r"<answer>(.*?)</answer>", buf, re.DOTALL)
                if seg:
                    answer_html = seg.group(1).strip()
            else:
                answer_html = buf.split("<answer>", 1)[1]

        if not think_html and not answer_html:
            return self._strip_html(buf)
        return think_html + answer_html

    def _build_messages(self, raw_hist, sys_prompt):
        msgs = []

        if sys_prompt.strip():
            msgs.append(
                {
                    "role": "system",
                    "content": [{"type": "text", "text": sys_prompt.strip()}],
                }
            )

        for h in raw_hist:
            if h["role"] == "user":
                msgs.append({"role": "user", "content": h["content"]})
            else:
                raw = h["content"]
                raw = re.sub(r"<think>.*?</think>", "", raw, flags=re.DOTALL)
                raw = re.sub(r"<details.*?</details>", "", raw, flags=re.DOTALL)
                clean = self._strip_html(raw).strip()
                msgs.append({"role": "assistant", "content": self._wrap_text(clean)})
        return msgs

    @spaces.GPU(duration=240)
    def stream_generate(self, raw_hist, sys_prompt):
        global stop_generation, processor, model
        stop_generation = False
        msgs = self._build_messages(raw_hist, sys_prompt)
        inputs = processor.apply_chat_template(
            msgs,
            tokenize=True,
            add_generation_prompt=True,
            return_dict=True,
            return_tensors="pt",
            padding=True,
        ).to(model.device)

        streamer = TextIteratorStreamer(
            processor.tokenizer, skip_prompt=True, skip_special_tokens=False
        )
        gen_args = dict(
            inputs,
            max_new_tokens=8192,
            repetition_penalty=1.1,
            do_sample=True,
            top_k=2,
            temperature=None,
            top_p=1e-5,
            streamer=streamer,
        )

        generation_thread = threading.Thread(target=model.generate, kwargs=gen_args)
        generation_thread.start()

        buf = ""
        for tok in streamer:
            if stop_generation:
                break
            buf += tok
            yield self._stream_fragment(buf)

        generation_thread.join()


def format_display_content(content):
    if isinstance(content, list):
        text_parts = []
        file_count = 0
        for item in content:
            if item["type"] == "text":
                text_parts.append(item["text"])
            else:
                file_count += 1

        display_text = " ".join(text_parts)
        if file_count > 0:
            return f"[{file_count} file(s) uploaded]\n{display_text}"
        return display_text
    return content


def create_display_history(raw_hist):
    display_hist = []
    for h in raw_hist:
        if h["role"] == "user":
            display_content = format_display_content(h["content"])
            display_hist.append({"role": "user", "content": display_content})
        else:
            display_hist.append({"role": "assistant", "content": h["content"]})
    return display_hist


# 加载模型和处理器
load_model()
glm4v = GLM4VModel()


def check_files(files):
    vids = imgs = ppts = pdfs = 0
    for f in files or []:
        ext = Path(f.name).suffix.lower()
        if ext in [
            ".mp4",
            ".avi",
            ".mkv",
            ".mov",
            ".wmv",
            ".flv",
            ".webm",
            ".mpeg",
            ".m4v",
        ]:
            vids += 1
        elif ext in [".jpg", ".jpeg", ".png", ".gif", ".bmp", ".tiff", ".webp"]:
            imgs += 1
        elif ext in [".ppt", ".pptx"]:
            ppts += 1
        elif ext == ".pdf":
            pdfs += 1
    if vids > 1 or ppts > 1 or pdfs > 1:
        return False, "Only one video or one PPT or one PDF allowed"
    if imgs > 10:
        return False, "Maximum 10 images allowed"
    if (ppts or pdfs) and (vids or imgs) or (vids and imgs):
        return False, "Cannot mix documents, videos, and images"
    return True, ""


def chat(files, msg, raw_hist, sys_prompt):
    global stop_generation
    stop_generation = False

    ok, err = check_files(files)
    if not ok:
        raw_hist.append({"role": "assistant", "content": err})
        display_hist = create_display_history(raw_hist)
        yield display_hist, copy.deepcopy(raw_hist), None, ""
        return

    payload = glm4v._files_to_content(files) if files else None
    if msg.strip():
        if payload is None:
            payload = glm4v._wrap_text(msg.strip())
        else:
            payload.append({"type": "text", "text": msg.strip()})

    user_rec = {"role": "user", "content": payload if payload else msg.strip()}
    if raw_hist is None:
        raw_hist = []
    raw_hist.append(user_rec)

    place = {"role": "assistant", "content": ""}
    raw_hist.append(place)

    display_hist = create_display_history(raw_hist)
    yield display_hist, copy.deepcopy(raw_hist), None, ""

    for chunk in glm4v.stream_generate(raw_hist[:-1], sys_prompt):
        if stop_generation:
            break
        place["content"] = chunk
        display_hist = create_display_history(raw_hist)
        yield display_hist, copy.deepcopy(raw_hist), None, ""

    display_hist = create_display_history(raw_hist)
    yield display_hist, copy.deepcopy(raw_hist), None, ""


def reset():
    global stop_generation
    stop_generation = True
    time.sleep(0.1)
    return [], [], None, ""


css = """.chatbot-container .message-wrap .message{font-size:14px!important}
details summary{cursor:pointer;font-weight:bold}
details[open] summary{margin-bottom:10px}"""

demo = gr.Blocks(title="GLM-4.1V Chat", theme=gr.themes.Soft(), css=css)
with demo:
    gr.Markdown("""
               <div style="text-align: center; font-size: 32px; font-weight: bold; margin-bottom: 20px;">
                   GLM-4.1V-9B-Thinking Gradio Space🤗
                </div>
               <div style="text-align: center;">
               <a href="https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking">🤗 Model Hub</a> | 
               <a href="https://github.com/THUDM/GLM-4.1V-Thinking">🌐 Github</a> 
                </div>
                """)

    raw_history = gr.State([])

    with gr.Row():
        with gr.Column(scale=7):
            chatbox = gr.Chatbot(
                label="Conversation",
                type="messages",
                height=800,
                elem_classes="chatbot-container",
            )
            textbox = gr.Textbox(label="💭 Message")
            with gr.Row():
                send = gr.Button("Send", variant="primary")
                clear = gr.Button("Clear")
        with gr.Column(scale=3):
            up = gr.File(
                label="📁 Upload",
                file_count="multiple",
                file_types=["file"],
                type="filepath",
            )
            gr.Markdown("Supports images / videos / PPT / PDF")
            gr.Markdown(
                "The maximum supported input is 10 images or 1 video/PPT/PDF. During the conversation, video and images cannot be present at the same time."
            )
            sys = gr.Textbox(label="⚙️ System Prompt", lines=6)

    gr.on(
        triggers=[send.click, textbox.submit],
        fn=chat,
        inputs=[up, textbox, raw_history, sys],
        outputs=[chatbox, raw_history, up, textbox],
    )
    clear.click(reset, outputs=[chatbox, raw_history, up, textbox])

if __name__ == "__main__":
    demo.launch(
        server_name=args.server_name,
        server_port=args.server_port,
        share=False,
        mcp_server=args.mcp_server,
        inbrowser=True,
    )

切换这个文件代码使用如下命令启动

cd /home/ubuntu/GLM-4.1V-Thinking/inference
python  trans_infer_gradio.py 

启动窗口如下

访问界面如下


网站公告

今日签到

点亮在社区的每一天
去签到