轻量级自动驾驶多视图视觉问答模型-EM-VLM4AD-EW帮帮网

EM-VLM4AD

论文

名称	内容
论文标题	Multi-Frame, Lightweight & Efficient Vision-Language Models for Question Answering in Autonomous Driving
论文链接	https://arxiv.org/abs/2403.19838
源码链接	akshaygopalkr/EM-VLM4AD (github.com)
收录	CVPR Workshop 2024

Abstract

过去方法的问题
- VLM 模型太大，难以实现 real-time VQA for AD
- 大多数 VLM 都是 single-image 的VQA，少有 multiple images 的VLM，特别在AD领域
主要贡献
- 提出了EM-VLM4AD模型，Efficient Multi-frame VLM for Autonomous Driving
  - 该模型在memory 和 FLOPs 上比现有的AD-VLMs少了10倍
  - 可以实现multiple images的 VQA
- explore two different lightweight LM backbones for EM-VLM4AD，两个BackBones分别是：
  - finetuned Text-to-Text Transfer Transformer (T5) Base LM
  - 8-bit quantized T5-Large LM finetuned using low-rank adaptation (LoRA)
- 与 DriveLM 数据集的 BaseLine 进行比较（四个指标BLEU-4、CIDEr、ROUGE-L、METEOR）
  - 在ROUGE-L和CIDEr指标上 stronger performance
总结展望
- In future research, we aspire to evolve our model into a video-language model capable of generating responses from multi-view video inputs, thereby enhancing EM-VLM4AD’s ability to handle temporal-related inquiries

Methods

模型整体结构如下图所示，总共分为两大部分：

Image Patch Encoder（图像编码网络）
T5-Medium/T5-Large（语言模型）

在这里插入图片描述

Image Embedding Network

这部分讲解Image Patch Encoder、Gated Pooling Attention、Projection Layer，即图片在输入进LLM（Large Language Model）之前的过程。

Image Patch Encoder 使用的是 the pretrained weights of ViT-B/32 pretrained on ImageNet，但是并没有使用ViT的整个模型，而是只用到了ViT的输入嵌入层，即生成embedding的部分
为了处理Multi-view，需要将经过编码后的每个视角的Embeddings进行合并，这里使用的是Gated Pooling Attention 和 Projection Layer
最后得到一个Multi-View Image Embedings

因此大致的流程就是输入Front, Front-Left, Front-Right, Back,Back-Left, Back-Right共六个视角的图片，然后每张图片都经过 ViT-B/23，得到Individual View Embeddings，然后通过Gated Pooling Attention 和 Projection Layer合并映射成一个Multi-View Image Embedings，最后输入进T5

具体流程如下：

输入图像形状为 $\in \mathbb{R}^{3 \times H \times W}$ ，接下来会 flattened and sliced into patches with a linear projection and positional embedding
之后的形状变为了 $V_i \in \mathbb{R}^{S_I \times H_I}$ ，其中 $i$ 表示第 $i$ 张图片
- $S_I$ is the sequence length for the image embedding
- $H_I$ is the hidden dimension of the image embedding
注意：其中第一步第二步，即从 $\in \mathbb{R}^{3 \times H \times W}$ 到 $V_i \in \mathbb{R}^{S_I \times H_I}$ 都是由ViT输入嵌入层完成的，其实就是把每张图片输入进行ViT输入嵌入层就行了
然后可以得到 6 个 Image Embedding，一个视角对应一个，然后Flatten每个Embedding到一维
之后使用 gated pooling attention（来自论文Mivc），关于为什么使用？论文解释如下：

Gated Pooling Attention 执行过程如下：首先会求出每个 $V_i$ 的权重 $\alpha_i$ ，然后进行加权求和。
$\sum_{i=1}^{N} \alpha_i V_i$
其中， $\alpha_i$ 计算方式如下，并且 $\sum_{i=1}^{N} \alpha_i = 1$
$\alpha_i = \frac{\exp \left\{ w^T \left( \tanh (Z V_i^T) \otimes \sigma (G V_i^T) \right) \right\}}{\sum_{j=1}^{N} \exp \left\{ w^T \left( \tanh (Z V_j^T) \otimes \sigma (G V_j^T) \right) \right\}}$
其中， $\in \mathbb{R}^{K}, \; Z \in \mathbb{R}^{K \times M}, \; G \in \mathbb{R}^{K \times M}, \; M = S_I H_I$

其中 $K$ 为超参，在论文中设置为 128
通过Gated Pooling Attention后形状为 $\in \mathbb{R}^{S_I \times H_I}$ ，之后通过Projection Layer将 $V$ 投影到 $H_T$ 维度，与文本的Embedding维度相匹配，便于和文本的Embedding进行拼接变成 $\mathbb{R}^{(S_T + S_I) \times H_T}$ ，其中 $S_T$ 为the sequence length of the text embedding
最后Multi-View Image Embedding的形状为 $\in \mathbb{R}^{S_I \times H_T}$

Language Model

为了减少计算量和推理耗时，论文中采用小于十亿的参数量的LLMs，使用了两个不同版本的预训练T5模型

T5-Base, which contains around 223 million parameters
an 8-bit quantized version of T5-Large (≈ 750M parameters)

将得到的 Multi-View Image Embedding 和 Text Ebedding 进行拼接，然后输入进T5模型，最后得到输出。

在实验过程中发现 fine-tuning the whole model for T5-Base works best，但是对于 the quantized T5-Large we use LoRA-Fine-Tuning-Aware Quantization

Training Process

该部分讲解训练过程，数据集相关配置如下：

DriveLM dataset
a 90%/5%/5% split of the traffic scenes

在这里插入图片描述

训练过程如下，总共分为两步：

Stage 1：冻结Image Patch Encoder和T5 LM的参数，只训练Gate Pooling Attention 和 Projection Layer。原因如下：This forces the multi-view image embeddings to align with the type of embeddings the LM expects. 意思是迫使得到的multi-view image embeddings与LM所需要的embeddings进行对齐，即本来LM是用来处理Text Embedding的，但是你用来处理图片，因此先训练GPA和PL层产生合适的Embedding，即LM所expects的Embedding。
Stage 2：只Image Patch Encoder参数冻结，同时训练T5、Gated Pooling Attention和Projection Layer。

如下是原论文中的描述

在这里插入图片描述

注意：在整个训练过程中并不会训练Image Patch Encoder，该部分采用的是在ImageNet上预训练的ViT模型

Experiments

实验部分进行了定量分析，计算量分析，定性分析。

主要采用image captioning tasks中常见的几个指标来评估模型生成的答案

BLEU-4
ROUGE-L
METEOR
CIDEr

Quantitative Results

实验结果如下，其中T5-Base比 8-bit quantized T5-Large 要好可以归因于T4-Base可以训练一个更大的参数集，这有助于语言模型更好地适应输入的视觉语言嵌入。

本文中的训练验证测试集是自己划分的，与DriveLM-Agent使用的不一样，因为目前DriveLM-Agent闭源，因此作者这里使用的DriveLM-Agent的各个指标其实是在DriveLM-Agent的私有测试集的结果。

作者这里是自己划分了一下DriveLM数据集，然后训练测试，最后和DriveLM-Agent在其私有测试集上的结果进行对比。

在这里插入图片描述

多帧处理是EM-VLM4AD优于重要原因，DriveLM-Agent仅仅使用了front-view frame作为输入，本文的模型通过custom multi-view embedding network成功合并多帧视图的信息

本文的研究强调，学习在DriveLM数据集上执行VQA可以在不增加模型复杂性的情况下完成。因此，简单地增加模型复杂性可能不会为这个特定的任务带来最佳的改进。

Computational Analysis

计算分析主要在parameters、FLOPs、Memory（GB）三个方面进行。

作者在A100上使用DriveLM数据集上的examples通过fvcore FLOP counter评估每个模块的计算量，并将Image Encoder和LM计算量相加，因为这些VLM模型主要的计算量和参数量就在Image Encoder上和LM上。

在这里插入图片描述

其中 EM-VLM4AD with the T5-Large backbone需要的内存最少主要是因为参数都是8 bit量化的

Qualitative Results

EM-VLM4AD 展现了对 DriveLM 使用的 c-tag 格式的理解能力，该格式以 $c, CAM, x_{pos}, y_{pos} >$ 的形式编码交通目标。此外，该模型学会了智能提取每个问题最相关的帧，从而成为一个高效的多帧 VLM 系统。

然而，EM-VLM4AD 也存在一个特定的弱点：在回答与“预测自车行为”相关的问题时表现欠佳。由于与行为相关的问题通常需要多帧信息来做出准确预测，为网络输入多视角视频以增加时间上下文，可能会提高这类问题的回答效果。

在这里插入图片描述

轻量级自动驾驶多视图视觉问答模型-EM-VLM4AD

EM-VLM4AD

论文

Abstract

Methods

Image Embedding Network

Language Model

Training Process

Experiments

Quantitative Results

Computational Analysis

Qualitative Results

网站公告

今日签到

热门文章

最新发布