Routing Experts: 学习在多模态大型语言模型中路由动态专家 ICLR 2025-EW帮帮网

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

Abstract

Recently, mixture of experts (MoE) has become a popular paradigm for achieving the trade-off between modal capacity and efficiency of multimodal large language models (MLLMs). Different from previous efforts, we are dedicated to exploring the dynamic experts in existing MLLMs and showing that a standard MLLM can also be a mixture of experts. However, achieving this target is still notoriously challenging. The well-trained MLLMs are more accustomed to the fixed pathway and a drastic change in its inference manner also greatly impedes its performance. To address these issues, we propose a novel dynamic expert routing method for existing MLLMs, termed Routing Experts (RoE), which can achieve example-dependent optimal path routing without obvious structure tweaks. Meanwhile, a new structure sparsity regularization is also introduced to force the well-trained MLLMs to learn more short-cut pathways. In addition, we also address the alignment of the training and inference of MLLMs in terms of network routing. To validate RoE, we apply it to a set of existing MLLMs, including LLaVA-1.5, LLaVA-HR and VILA, and conduct extensive experiments on a bunch of VL benchmarks. The experiment results not only show the effectiveness of our RoE in improving MLLMs’ efficiency, but also yield obvious advantages over MoELLaVA in both performance and speed, e.g., an average performance gain of 3.3% on 5 benchmarks while being 1.61 times faster. Our code is anonymously released at https://github.com/DoubtedSteam/RoE

最近，混合专家（MoE）已成为实现多模态大语言模型（MLLMs）模态容量与效率权衡的流行范式。 不同于以往的努力，本文致力于探索现有 MLLMs 中的动态专家，并表明标准的 MLLM 也可以是一个混合专家系统。然而，实现这一目标仍然具有极大的挑战性。 训练良好的 MLLMs 更习惯于固定的推理路径，而对其推理方式的剧烈改变也会极大地阻碍其性能。为了解决这些问题，本文提出了一种新颖的动态专家路由方法，称为路由专家（RoE），它可以在不明显修改结构的情况下实现针对每个样本的最佳路径路由。同时，引入了一种新的结构稀疏正则化方法，以迫使训练良好的 MLLMs 学习更多的捷径路径。此外，本文还解决了 MLLMs 在网络路由方面训练和推理的对齐问题。为了验证 RoE，本文将其应用于一组现有的 MLLMs，包括 LLaVA-1.5、LLaVA-HR 和 VILA，并在一系列视觉语言（VL）基准上进行了广泛实验。 实验结果不仅表明提出的 RoE 在提高 MLLMs 效率方面的有效性，而且在性能和速度上都比 MoELLaVA 具有明显优势，例如，在 5 个基准上的平均性能提升 3.3%，同时速度提高了 1.61 倍。

Introduction

Recently, the great success of large language models (LLMs) Radford et al. (2018); Zhang et al. (2022); Bai et al. (2023a); Touvron et al. (2023) attracts an influx of interest in extending them to more modalities, e.g., vision and language (VL) Jiang et al. (2020); Luo et al. (2023b); Tong et al. (2024); Zhou et al. (2019). Despite great progress, multi-modal large language models (MLLMs) Li et al. (2023b); Dai et al. (2023); Rose et al. (2023); Wang et al. (2024); Koh et al. (2024) also suffer from excessive computation due to the introduction of more modality tokens. For instance, LLaVA Liu et al. (2023b) requires 6.15 times more computation than its unimodal inference on ScienceQA Lu et al. (2022). Inspired by the progress of LLMs Radford et al. (2018); Zhang et al. (2022); Touvron et al. (2023), recent efforts Bai et al. (2023a); Ainslie et al. (2023); Jiang et al. (2024); Raposo et al. (2024) are also devoted to exploring new MLLMs with a Mixture-of-Experts (MoE) structure, thereby archiving a good trade-off between model capacity and inference efficiency.

Different from these pioneers Gou et al. (2023); Shen et al. (2023); Lin et al. (2024), we focus on exploring the dynamic experts in MLLMs that already exist and show that a well-trained common MLLM can also be a mixture of experts. The motivation is akin to MoE, that is, LLMs or MLLMs need enough parameter capacity to meet scaling law Kaplan et al. (2020), but it is evident that the entire model is often redundant for specific tasks, especially the easy ones Wu et al. (2024). For instance, the advanced MLLMs like LLaVA-1.5 Liu et al. (2023a) exhibit much stronger generalization capability than previous vision-language (VL) models Li et al. (2019); Kim et al. (2021); Dou et al. (2022); Gao et al. (2023), but is still on par with the bespoke ones Lu et al. (2019); Kim et al. (2021) with much smaller parameter sizes on the benchmarks like VQAv2 Goyal et al. (2017).

However, in terms of methodology, we are keen to explore the dynamic and sparse structures of MLLMs that already exist, rather than building a new sparse model like previous MoE methods Shen et al. (2023); Lin et al. (2024). We observe that the activations of MLLMs’ different layers for the examples are distinct. As shown in Fig. 1, some layers barely contribute to the transformation and reasoning of a given example. This finding implies that the inherent knowledge of common MLLMs is likely to be distributed as in MoE models Eigen et al. (2013), indicating the feasibility of routing the expert subnetworks in an already existing MLLM.

Figure 1: The visualization of the l1-distances between the input and output features of each layer of LLaVA-7B Liu et al. (2023b). A lower l1-distance1 indicates that this layer has less impact on the feature update of this example, which also suggests that it is not that important during inference. For two examples, the contributions of different layers are also different.

However, achieving this target is still intractable. In particular, we aim to adaptively skip the less important layers of MLLMs for each example, thereby achieving better efficiency, as shown in Fig. 2. Although intuitive, this attempt at MLLMs still suffers from several challenges. The first one is the feature gap that occurs in dynamic inference. Unlike previous dynamic models Lin et al. (2024); Jiang et al. (2024); Sun et al. (2024); Shen et al. (2024); Luo et al. (2024b), which are mostly trained from scratch and well accommodate dynamic inference, this layer-wise skipping will make MLLMs encounter a drastic change in feature space during inference, greatly limiting its performance upper-bound. Meanwhile, how to make MLLMs choose a short-cut pathway is also difficult. Since MLLMs are already end-to-end well trained, they usually prefer not to skip under the default training objectives. In addition, existing MLLMs Dai et al. (2023); Liu et al. (2023a); Lin et al. (2023); Liu et al. (2023b); Zhou et al. (2024) often organize multiple examples as a multi-turn conversation for efficient training, which however contradicts the dynamic routing of each example. Overall, these ingredients greatly hinder the achievement of dynamic routing in existing MLLMs.

Figure 2: Illustration of the proposed Routing Experts (RoE). Existing MoE models (a) often build a new sparse structure with multiple FFNs as experts, and each pathway takes the same computation for all examples. Our RoE (b) aims to explore the expert pathways within the model itself via adapter-based skip connections, realizing dynamic computation for different examples. (c) Routing tokens are used to decide layer-wise path selection, i.e., the adapter-based skip connection or the default Transformer layer. It also serves to align the training and testing of MLLMs.

最近，大型语言模型（LLMs）的巨大成功吸引了大量兴趣，将其扩展到更多模态，例如视觉和语言（VL）。 尽管取得了巨大进展，多模态大语言模型（MLLMs）也因引入更多模态标记而面临过多计算的问题。例如，LLaVA 在 ScienceQA 上的多模态推理需要比单模态推理多出 6.15 倍的计算量。受到 LLMs 进展的启发，最近的研究也在探索基于混合专家（MoE）结构的新 MLLMs，从而在模型容量和推理效率之间取得良好权衡。

不同于这些先驱研究，本文专注于探索现有 MLLMs 中的动态专家，并表明一个训练良好的普通 MLLM 也可以是一个混合专家系统。 动机类似于 MoE，即 LLMs 或 MLLMs 需要足够的参数容量来满足扩展定律，但很明显，整个模型对于特定任务（尤其是简单任务）往往是冗余的。例如，像 LLaVA-1.5 这样的先进 MLLMs 比之前的视觉语言（VL）模型表现出更强的泛化能力，但在某些基准（如 VQAv2）上，其参数规模更小的定制模型表现相当。

然而，在方法论上，本文专注于探索现有 MLLMs 的动态和稀疏结构，而不是像以前的 MoE 方法那样构建新的稀疏模型。 本文观察到 MLLMs 不同层对样本的激活是不同的。如图 1 所示，某些层对给定样本的转换和推理几乎没有贡献。这一发现表明，普通 MLLMs 的内在知识很可能像 MoE 模型一样分布，这表明在现有 MLLM 中路由专家子网络的可行性。

然而，实现这一目标仍然非常困难。 特别是，本文旨在自适应地跳过每个样本中不太重要的层，从而实现更高的效率，如图 2 所示。尽管直观，但这种尝试在 MLLMs 上仍然面临几个挑战。 首先是动态推理中出现的特征间隙问题。与大多数从头开始训练并很好地适应动态推理的先前动态模型不同，逐层跳过会导致 MLLMs 在推理过程中遇到特征空间的巨大变化，极大地限制了其性能上限。同时，如何让 MLLMs 选择捷径路径也很困难。由于 MLLMs 已经端到端良好训练，它们通常不会在默认训练目标下选择跳过。此外，现有的 MLLMs 通常将多个样本组织成多轮对话以提高训练效率，但这与每个样本的动态路由相矛盾。总体而言，这些因素极大地阻碍了在现有 MLLMs 中实现动态路由的目标。

To address these issues, we propose an innovative dynamic routing method for MLLMs, termed Routing Experts (RoE). RoE regards each layer of MLLMs as an independent expert, and its objective is to find out and connect the important ones as an optimal routing path for each example. In practice, RoE uses path routers to decide whether to skip layers. To compensate the issue of the feature gap, we introduce the adapter-based connections to replace the less important layers, which are easy to deploy and can well serve feature transformations Wu et al. (2024). To optimize RoE, we also propose a novel sparsity regularization to encourage the learning of sparse and diverse pathway routing. Combined with this objective, the simple yet effective routing tokens are further proposed to facilitate the optimization of RoE in multi-turn conversations, addressing the issue of misalignment between training and inference. With these innovative designs, RoE shows that a standard and well-trained MLLM can also be a mixture of experts.

To validate RoE, we apply it to a set of advanced MLLMs, including LLaVA-1.5 Liu et al. (2023a), LLaVA-HR Luo et al. (2024c) and VILA Lin et al. (2023), on 10 competitive VL benchmarks, including VQA2.0 Goyal et al. (2017), GQA Hudson & Manning (2019), TextVQA Singh et al. (2019), POPE Li et al. (2023c), MME Fu et al. (2023), and MM-Vet Yu et al. (2023). The experimental results show that our RoE can greatly speed up the inference of common MLLMs, while still maintaining their competitive performance on various benchmarks. For instance, our RoE improves the inference speed of LLaVA-1.5 by 21.3% without performance drops on most benchmarks. Compared with the previous MoE-based method, e.g., MoE-LLaVA Lin et al. (2024), RoE not only has better performance on all benchmarks, but also exhibits faster inference speed, e.g., 6.77 v.s. 4.95 examples per second for traditional VL benchmarks on average.

为了解决这些问题，提出了一种创新的 MLLMs 动态路由方法，称为路由专家（RoE）。 RoE 将 MLLMs 的每一层视为一个独立的专家，其目标是找出并连接重要的层作为每个样本的最佳路由路径。在实践中，RoE 使用路径路由器来决定是否跳过某些层。为了解决特征间隙问题，本文引入了基于适配器的连接来替换不太重要的层，这种方法易于部署并且能够很好地进行特征转换。 为了优化 RoE，本文还提出了一种新颖的稀疏正则化方法，以鼓励学习稀疏且多样化的路径路由。结合这一目标，进一步提出了简单而有效的路由令牌设计，以促进 RoE 在多轮对话中的优化，解决了训练和推理之间的不对齐问题。通过这些创新设计，RoE 表明标准且训练良好的 MLLM 也可以成为一个混合专家系统。

为了验证 RoE，本文将其应用于一组先进的 MLLMs，包括 LLaVA-1.5、LLaVA-HR 和 VILA，在 10 个竞争激烈的 VL 基准上进行测试，包括 VQA2.0、GQA、TextVQA、POPE、MME 和 MM-Vet。 实验结果表明，提出的 RoE 可以显著加快常见 MLLMs 的推理速度，同时在各种基准上仍保持其竞争力的性能。 例如，提出的 RoE 将 LLaVA-1.5 的推理速度提高了 21.3%，而在大多数基准上没有性能下降。与之前的基于 MoE 的方法（例如 MoE-LLaVA）相比，RoE 不仅在所有基准上表现更好，而且推理速度更快，例如，在传统 VL 基准上的平均推理速度为 6.77 对比 4.95 样本/秒。

Overall, our contributions are three-fold:

• We present a novel attempt of dynamic routing in existing MLLMs, termed Routing Experts (RoE), which aims to transform existing MLLMs into a mixture of experts without obvious structure tweaks.

• To achieve effective and efficient RoE for the well-trained MLLMs, we introduce adapter-based skip connection to alleviate the feature gap problem and a novel sparsity regularization to help MLLMs learn dynamic and sparse inference. Besides, the routing token design also aligns the training and inference of RoE-MLLMs.

• On ten highly-competitive benchmarks, RoE can significantly improve the efficiency of three advanced MLLMs while retaining similar or even better performance.

总体而言，贡献分为三个方面：

提出了一种在现有 MLLMs 中实现动态路由的新尝试，称为路由专家（RoE），旨在将现有的 MLLMs 转换为混合专家系统，而无需对结构进行明显修改。
为了在训练良好的 MLLMs 上实现高效且有效的 RoE，引入了基于适配器的跳过连接以缓解特征间隙问题，并提出了一种新颖的稀疏正则化方法，帮助 MLLMs 学习动态且稀疏的推理方式。此外，路由令牌的设计还实现了 RoE-MLLMs 在训练和推理之间的一致性。
在十个竞争激烈的基准测试中，RoE 显著提升了三种先进 MLLMs 的效率，同时保持了相似甚至更好的性能。

总结

动机

本文的核心动机是解决现有 MLLMs 在动态推理中的效率瓶颈。尽管混合专家（MoE）架构已被证明可以有效平衡模型容量与推理效率，但大多数研究集中于设计全新的稀疏模型，而非充分利用现有 MLLMs 的潜力。作者观察到，现有 MLLMs 的不同层对于不同样本的贡献存在显著差异，这表明其内部知识可能以类似 MoE 的方式分布。因此，探索如何在现有 MLLMs 中实现动态路由成为一项重要且具有挑战性的任务。

核心思想

本文提出了一种名为 Routing Experts (RoE) 的动态路由方法，旨在将现有的 MLLMs 转换为混合专家系统，而无需对其结构进行显著修改。核心思想是通过自适应地跳过不重要的层，为每个样本找到最佳的推理路径，从而提高推理效率。

具体方法

路径路由器 ：RoE 使用路径路由器决定是否跳过某些层。
适配器连接 ：为了解决特征间隙问题，引入了基于适配器的连接来替代不重要的层。
稀疏正则化 ：提出了一种新的稀疏正则化方法，以鼓励学习稀疏且多样化的路径路由。
路由令牌设计 ：设计了简单的路由令牌，以对齐训练和推理过程，特别是在多轮对话场景中。

主要结论

实验表明，RoE 不仅显著提高了现有 MLLMs 的推理效率（例如，LLaVA-1.5 的推理速度提高了 21.3%），而且在多个基准上的性能优于现有的 MoE 方法（例如，MoE-LLaVA），同时推理速度更快（6.77 vs. 4.95 样本/秒）。这证明了 RoE 在效率和性能上的双重优势。

Routing Experts: 学习在多模态大型语言模型中路由动态专家 ICLR 2025

Routing Experts: Learning to Route Dynamic Experts in Multi-modal Large Language Models

Abstract

Introduction

总结

动机

核心思想

具体方法

主要结论

网站公告

今日签到

热门文章

最新发布