Qwen2.5-VL视觉-语言模型做图片理解调研-EW帮帮网

我们这里调研使用多模态大模型来做图片理解、图生文等处理功能。调研的是千问大模型Qwen2.5-VL系列。

考虑到服务器GPU的大小和内存限制，我们选择比较小的模型进行部署和测试。

Qwen/Qwen2.5-VL-3B-Instruct 模型参数量30亿，模型占用内存约7.5G；

Qwen/Qwen2.5-VL-7B-Instruct 模型参数量70亿，模型占用内存约 17G；

等....

我们这里选择参数量和内存占用最小的Qwen/Qwen2.5-VL-3B-Instruct 进行部署和测试。

模型可以在modelscope魔塔社区里查看和下载。魔搭社区

1、Qwen2.5-VL-3B-Instruct简介

关键增强：

视觉理解：Qwen2.5-VL不仅擅长识别如花、鸟、鱼和昆虫等常见物体，还非常擅长分析图像中的文本、图表、图标、图形和布局。
自主性：Qwen2.5-VL可以直接作为视觉代理，能够进行推理并动态指导工具使用，包括电脑和手机操作。
理解和捕捉长视频及事件：Qwen2.5-VL可以理解超过1小时的视频，并且新增了通过定位相关视频片段来捕捉事件的能力。
支持多种格式的视觉定位：Qwen2.5-VL可以通过生成边界框或点来精确地在图像中定位对象，并提供稳定的JSON输出以供坐标和属性使用。
生成结构化输出：对于发票扫描件、表格等形式的数据，Qwen2.5-VL支持其内容的结构化输出，这在金融、商业等领域具有应用价值

2、快速推理部署

这里的场景是应用Qwen2.5-VL-3B-Instruct对输入图片进行处理，返回图片的描述、识别图片中的文字。并测试模型使用的内存、返回结果用时。代码实现如下：

2.1 server.py

from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import JSONResponse
from PIL import Image
import io
import uvicorn
import time
import torch
from modelscope import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
from modelscope import snapshot_download

app = FastAPI(title="Qwen2.5-VL-3B-Instruct API", description="图片描述生成API")

# 模型初始化
@app.on_event("startup")
async def load_model():
    global model, processor
    try:
        # 下载模型
        model_dir = snapshot_download('Qwen/Qwen2.5-VL-3B-Instruct')
        print(f"模型目录: {model_dir}")
        
        # 加载模型和处理器
        model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            model_dir, torch_dtype=torch.float16, device_map="auto", low_cpu_mem_usage=True  # 减少CPU内存使用
        )
        processor = AutoProcessor.from_pretrained(model_dir)
        
        print("模型加载成功")
    except Exception as e:
        print(f"模型加载失败: {e}")
        raise HTTPException(status_code=500, detail="模型加载失败")

@app.post("/describe-image/", response_model=dict)
async def describe_image(file: UploadFile = File(...)):
    try:
        # 读取图片
        contents = await file.read()
        image = Image.open(io.BytesIO(contents)).convert("RGB")
        
        # 图像尺寸验证
        if min(image.size) < 32:
            raise ValueError(f"图像尺寸过小: {image.size}")
                                        
        # 调整图像大小
        original_size = image.size
        image = image.resize((224, 224), Image.Resampling.BICUBIC)
       
        # 打印图像信息
        print(f"图片名称: {file}")
        print(f"原始图像尺寸: {original_size}")
        print(f"调整后图像尺寸: {image.size}")
                                                                                                    
        # 构建消息结构
        messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "image": image
                    },
                    {"type": "text", "text": "使用中文详细描述图片.如果图片中有文字，请将文字识别并输出."},
                ],
            }
        ]
        
        # 准备输入
        start_time = time.time()
        
        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        image_inputs, video_inputs = process_vision_info(messages)
        # 检查图像输入
        if not image_inputs:
            raise ValueError("图像输入处理失败")
        #print(f"图像输入形状: {image_inputs[0].shape}")

        # 验证token数量
        tokenized_input = processor.tokenizer(text)
        token_count = len(tokenized_input["input_ids"])
        print(f"Token数量: {token_count}")
                                
        if token_count > 4096:  # 根据模型实际限制调整
            raise ValueError(f"输入token数量({token_count})超过模型最大限制")

        inputs = processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt",
        )
        # 移到GPU
        inputs = inputs.to(model.device)
        
        # 模型推理
        with torch.no_grad():
            generated_ids = model.generate(**inputs, max_new_tokens=1024)
        
        # 处理输出
        generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        output_text = processor.batch_decode(
            generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
        )
        # 推理后释放缓存
        torch.cuda.empty_cache()
        
        inference_time = time.time() - start_time

        return {
            "description": output_text[0],
            "inference_time": inference_time,
            "model": "Qwen2.5-VL-3B-Instruct"
        }
    
    except Exception as e:
        import traceback
        print(traceback.format_exc())
        raise HTTPException(status_code=500, detail=f"处理图片时出错: {str(e)}")

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

2.2 client.py

import requests
import json
from pathlib import Path

# API端点
url = "http://localhost:8000/describe-image/"

# 图片文件路径
image_path = "/home/ubuntu/dev/vl/qwen_test/000000004016.png"  # 替换为你的图片路径
#image_path = "puppy.png"


# 发送请求
with open(image_path, "rb") as f:
    files = {"file": f}
    response = requests.post(url, files=files)

# 处理响应
if response.status_code == 200:
    result = response.json()
    print("图片描述:")
    print(result["description"])
    print("用时:",result['inference_time'])
else:
    print(f"请求失败，状态码: {response.status_code}")
    print(f"错误信息: {response.text}")


# 按文件夹读取图片
# 图片路径
image_dir = Path("./images/")

for image_path in image_dir.rglob('*'):
    if image_path.is_file() and image_path.suffix.lower() in ['.jpg', '.jpeg', '.png', '.bmp']:
        print(f"处理图片: {image_path.name}")

        # 发送请求
        with open(image_path, "rb") as f:
            files = {"file": f}
            response = requests.post(url, files=files)

        # 处理响应
        if response.status_code == 200:
            result = response.json()
            print(f"  图片描述: {result['description']}")
            print(f"  推理时间: {result['inference_time']:.2f}秒")
        else:
            print(f"  请求失败，状态码: {response.status_code}")
            print(f"  错误信息: {response.text}")

2.3 性能对比

代码示例里使用的是 Qwen/Qwen2.5-VL-3B-Instruct ，可以将上述的模型替换为 Qwen/Qwen2.5-VL-7B-Instruct，分别对跑出来的结果进行对比，根据效果、资源消耗来做选择。

模型	推理部署	cpu/cpu	图片描述效果	响应速度	是否可上线
Qwen/Qwen2.5-VL-3B-Instruct	transformer普通部署		较准确	6s	可上线使用
Qwen/Qwen2.5-VL-3B-Instruct	transformer普通部署, 使用float16、推荐的加速和保存配置、调整图片大小	auto 8G-10G	较准确	3s-5s	可上线使用
Qwen/Qwen2.5-VL-7B-Instruct	transformer普通部署, 使用float16、推荐的加速和保存配置、调整图片大小	17G	准确性比 3B的略好一点	3s-5s	可上线使用

2.4 结果示例

输入图片

处理图片: puppy.png

a、模型 Qwen2.5-VL-3B-Instruct

图片描述: 图片中有一只小狗，它正站在雪地上。小狗的毛色是黑色、棕色和白色相间的，看起来非常可爱。它的耳朵竖立着，眼睛大而圆，显得非常好奇。小狗的鼻子湿润，似乎刚刚从雪地里出来。背景是一片被雪覆盖的地面，远处可以看到一些树木和一个木制的长椅。整个场景给人一种宁静而美丽的冬日感觉。

图片中的文字是“a puppy playing in the snow”，翻译成中文就是“一只小狗在雪地里玩耍”。

推理时间: 4.50秒

b、模型 Qwen2.5-VL-7B-Instruct

图片描述: 这张图片展示了一只小狗在雪地里玩耍的场景。小狗的毛色主要是黑色和白色，脸部有一些棕色的斑点。它的头上有一层薄薄的积雪，看起来像是刚刚从雪堆里爬出来。背景中可以看到一些树木和一个木制的长椅，表明这是一个户外的公园或花园环境。图片下方有一段被部分遮挡的文字，但可以辨认出“ed caption: a puppy playing in the snow”，意思是“带字幕：一只小狗在雪地里玩耍”。

推理时间: 4.18秒

3、结论

Qwen2.5-VL视觉-语言模型在图片理解、图文生成上是做的效果比较理想，推理时间在2GPU的服务器上3-5s响应，也是比较符合上线使用的。当然参数量大一些，模型在处理复杂的图片理解上效果会更好一些，但要根据自己业务的场景、服务资源来做选择。这个模型比上篇中CLIP的模型效果更好、应用也更灵活。所以，能选择使用生成式的视觉-语言模型就优先使用它。

Qwen2.5-VL视觉-语言模型做图片理解调研

1、Qwen2.5-VL-3B-Instruct简介

关键增强：

2、快速推理部署

2.1 server.py

2.2 client.py

2.3 性能对比

2.4 结果示例

3、结论

网站公告

今日签到

热门文章

最新发布