🔥 News
2025.06.26
🌟 我们非常自豪地推出Kwai Keye-VL,这是快手Kwai Keye团队精心打造的前沿多模态大语言模型。作为快手先进技术生态中的核心AI产品,Keye在视频理解、视觉感知和推理任务方面表现卓越,树立了新的性能标杆。我们的团队正在不懈努力突破可能的边界,敬请期待更多令人兴奋的进展!
快速入门
以下,我们通过简单示例展示如何结合🤗 Transformers使用Kwai Keye-VL。
Kwai Keye-VL的代码已集成至最新版Hugging Face transformers库,建议您通过以下命令从源码构建:
pip install git+https://github.com/huggingface/transformers accelerate
我们提供一套工具包,帮助您像调用API一样更便捷地处理各类视觉输入。包括base64编码、URL链接以及交错排列的图像和视频。您可以通过以下命令进行安装:
# It's highly recommanded to use `[decord]` feature for faster video loading.
pip install "keye-vl-utils[decord]==1.0.0"
如果您未使用Linux系统,可能无法直接从PyPI安装decord。此时可以使用pip install keye-vl-utils
命令,该工具包将自动回退至torchvision进行视频处理。但您仍可通过源码安装decord来启用视频加载时的decord支持。
使用🤗 Transformers进行对话
以下代码片段展示如何结合transformers和keye_vl_utils
使用对话模型:
继Qwen3之后,我们也提供了软切换机制,允许用户动态控制模型行为。通过在用户提示中添加/think、/no_think或不添加任何指令,即可切换模型的思考模式。
from transformers import AutoModel, AutoTokenizer, AutoProcessor
from keye_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model_path = "Kwai-Keye/Keye-VL-8B-Preview"
model = AutoModel.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = KeyeForConditionalGeneration.from_pretrained(
# "Kwai-Keye/Keye-VL-8B-Preview",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained(model_pat, min_pixels=min_pixels, max_pixels=max_pixels, trust_remote_code=True)
# Non-Thinking Mode
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
},
{"type": "text", "text": "Describe this image./no_think"},
],
}
]
# Auto-Thinking Mode
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Thinking mode
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg",
},
{"type": "text", "text": "Describe this image./think"},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
视频推理
# Messages containing a images list as a video and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": [
"file:///path/to/frame1.jpg",
"file:///path/to/frame2.jpg",
"file:///path/to/frame3.jpg",
"file:///path/to/frame4.jpg",
],
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a local video path and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "file:///path/to/video1.mp4",
"max_pixels": 360 * 420,
"fps": 1.0,
},
{"type": "text", "text": "Describe this video."},
],
}
]
# Messages containing a video url and a text query
messages = [
{
"role": "user",
"content": [
{
"type": "video",
"video": "http://s2-11508.kwimgs.com/kos/nlav11508/MLLM/videos_caption/98312843263.mp4",
},
{"type": "text", "text": "Describe this video."},
],
}
]
#In Keye-VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
**video_kwargs,
)
inputs = inputs.to("cuda")
# Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
批量推理
# Sample messages for batch inference
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/image1.jpg"},
{"type": "image", "image": "file:///path/to/image2.jpg"},
{"type": "text", "text": "What are the common elements in these pictures?"},
],
}
]
messages2 = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
# Combine messages for batch processing
messages = [messages1, messages2]
# Preparation for batch inference
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Batch Inference
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)
更多使用提示
对于输入图像,我们支持本地文件、base64编码和URL链接。对于视频,目前仅支持本地文件。
# You can directly insert a local file path, a URL, or a base64-encoded image into the position where you want in the text.
## Local file path
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "file:///path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Image URL
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "http://path/to/your/image.jpg"},
{"type": "text", "text": "Describe this image."},
],
}
]
## Base64 encoded image
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "data:image;base64,/9j/..."},
{"type": "text", "text": "Describe this image."},
],
}
]
提升性能的图像分辨率设置
该模型支持多种分辨率输入。默认情况下,它会采用原始分辨率处理输入,但更高的分辨率可提升性能(需消耗更多计算资源)。用户可设置像素的最小值和最大值(如256-1280的token计数范围)来优化配置,从而平衡运行速度与内存占用。
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
"Kwai-Keye/Keye-VL-8B-Preview", min_pixels=min_pixels, max_pixels=max_pixels
)
此外,我们提供两种方法对输入模型的图像尺寸进行细粒度控制:
定义最小像素和最大像素范围:图像将按原比例缩放,确保像素值落在设定区间内。
指定精确尺寸:直接设置resized_height和resized_width参数。这些数值会自动圆整为28的整数倍。
# min_pixels and max_pixels
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"resized_height": 280,
"resized_width": 420,
},
{"type": "text", "text": "Describe this image."},
],
}
]
# resized_height and resized_width
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/image.jpg",
"min_pixels": 50176,
"max_pixels": 50176,
},
{"type": "text", "text": "Describe this image."},
],
}
]
👀 架构与训练策略
快意-VL模型架构基于Q文3-8B语言模型,整合了从开源SigLIP初始化的视觉编码器。该模型支持原生动态分辨率,通过将每幅图像划分为14x14的补丁序列来保持原始宽高比,随后通过简单的MLP层映射并融合视觉标记。模型采用3D旋转位置编码(RoPE)对文本、图像和视频信息进行统一处理,在位置编码与绝对时间之间建立一对一对应关系,从而确保对视频信息时序变化的精准感知。
🌟 Pre-Train
快影关键预训练流程,采用四阶段渐进式策略:图文匹配、ViT-LLM对齐、多任务预训练及模型融合退火。
预训练数据:海量、高质量、多样化
- 多样性:涵盖图文对、视频、纯文本等数据类型,任务类型包括细粒度描述、OCR文本识别、问答、目标定位等。
- 高质量:采用CLIP评分和视觉语言模型(VLM)判别器进行数据筛选,并利用MinHASH去重技术防止数据泄露。
- 自建数据集:专门构建高质量内部数据集,尤其在精细描述和中文OCR领域,以弥补开源数据的不足。
训练流程:四阶段渐进式优化
Kwai Keye-VL采用四阶段渐进式训练策略:
- 阶段0(视觉预训练):持续预训练视觉编码器以适应内部数据分布并支持动态分辨率。
- 阶段1(跨模态对齐):冻结骨干模型,仅训练MLP以低成本实现鲁棒的图文对齐。
- 阶段2(多任务预训练):解锁全部参数,全面提升模型的视觉理解能力。
- 阶段3(退火训练):通过高质量数据微调,进一步提升模型的细粒度理解能力。
最终,Kwai Keye-VL探索同构异构融合技术——对不同数据比例的退火训练模型进行参数平均,在保留多维能力的同时减少模型偏差,从而增强模型的鲁棒性。
📈 实验结果
- Keye-VL-8B凭借强大且先进的感知能力崭露头角,其性能足以与顶尖模型媲美。
- Keye-VL-8B在视频理解领域展现出非凡的熟练度。在包括Video-MME、Video-MMMU、TempCompass、LongVideoBench和MMVU在内的一系列权威公共视频基准测试中,该模型的表现明显超越了同规模的其他顶级模型。
- 在需要复杂逻辑推理和数学问题求解的评估集(如WeMath、MathVerse和LogicVista)中,Kwai Keye-VL-8B展现出强劲的性能曲线,凸显了其在逻辑推演和解决复杂量化问题方面的高阶能力。