DeepSeekMoE: 面向专家混合语言模型的终极专家专门化

发布于:2025-03-21 ⋅ 阅读:(24) ⋅ 点赞:(0)

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-𝐾 out of 𝑁 experts, face challenges in ensuring expert specialization , i.e., each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies:
(1) Finely segmenting the experts into 𝑚𝑁 ones and activating 𝑚𝐾 from them , allowing for a more flexible combination of activated experts;
(2) Isolating 𝐾𝑠 experts as shared ones , aiming at capturing common knowledge and mitigating redundancy in routed experts.

Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B , which has 1.5× expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters , which sets the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

然而,传统的MoE架构(如GShard),通过激活𝑁个专家中的前𝐾个,面临确保专家专业化 (即每个专家获取非重叠且聚焦的知识)的挑战。为此,本文提出了DeepSeekMoE架构,旨在实现终极专家专业化。它包含两大核心策略:(1) 细粒度专家分段 :将专家细分为𝑚𝑁个,并从中激活𝑚𝐾个,以实现更灵活的专家组合;(2) 共享专家隔离 :将𝐾𝑠个专家设定为共享专家,用于捕获通用知识并减少路由专家的冗余。从20亿参数的小规模模型开始,本文证明了DeepSeekMoE 2B在性能上与具有1.5倍专家参数和计算量的GShard 2.9B相当。此外,DeepSeekMoE 2B几乎接近其密集版本的性能,后者设定了MoE模型的性能上限。随后,本文将DeepSeekMoE扩展到160亿参数,并展示其在仅使用约40%计算量的情况下,性能可媲美LLaMA2 7B。进一步的初步实验表明,在扩展到1450亿参数时,DeepSeekMoE相比GShard架构仍具显著优势,并在仅使用28.5%(甚至可能18.2%)计算量的情况下,性能接近DeepSeek 67B。

Introduction

Recent research and practices have empirically demonstrated that, with sufficient training data available, scaling language models with increased parameters and computational budgets can yield remarkably stronger models (Brown et al., 2020; Hoffmann et al., 2022; OpenAI, 2023; Touvron et al., 2023a). It is imperative to acknowledge, however, that the endeavor to scale models to an extremely large scale is also associated with exceedingly high computational costs. Considering the substantial costs, the Mixture-of-Experts (MoE) architecture (Jacobs et al., 1991; Jordan and Jacobs, 1994; Shazeer et al., 2017) has emerged as a popular solution. It can enable parameter scaling, while concurrently keeping computational costs at a modest level. Recent applications of MoE architectures in Transformers (Vaswani et al., 2017) have yielded successful attempts at scaling language models to a substantial size (Du et al., 2022; Fedus et al., 2021; Lepikhin et al., 2021; Zoph, 2022), accompanied with remarkable performance. These achievements underscore the considerable potential and promise of MoE language models.

【背景介绍】

近年来的研究和实践已证明,在训练数据充足的情况下,通过增加参数和计算预算扩展语言模型可以显著提升模型能力(Brown等,2020;Hoffmann等,2022;OpenAI,2023;Touvron等,2023a)。然而,值得注意的是,将模型扩展到超大规模也伴随着极高的计算成本。考虑到这些成本,混合专家(MoE)架构(Jacobs等,1991;Jordan和Jacobs,1994;Shazeer等,2017)成为一种流行的解决方案。它能够在扩展参数的同时,将计算成本保持在适度水平。最近,MoE架构在Transformer(Vaswani等,2017)中的应用成功实现了语言模型的大规模扩展(Du等,2022;Fedus等,2021;Lepikhin等,2021;Zoph,2022),并取得了卓越的性能。这些成就凸显了MoE语言模型的巨大潜力和前景。

Despite the promising potential of MoE architectures, existing MoE architectures potentially suffer from issues of knowledge hybridity and knowledge redundancy, which limit the expert specialization, i.e., each expert acquires non-overlapping and focused knowledge. Conventional MoE architectures substitute the Feed-Forward Networks (FFNs) in a Transformer with MoE layers. Each MoE layer consists of multiple experts, with each structurally identical to a standard FFN, and each token is assigned to one (Fedus et al., 2021) or two (Lepikhin et al., 2021) experts. This architecture manifests two potential issues:
(1) Knowledge Hybridity: existing MoE practices often employ a limited number of experts (e.g., 8 or 16), and thus tokens assigned to a specific expert will be likely to cover diverse knowledge. Consequently, the designated expert will intend to assemble vastly different types of knowledge in its parameters, which are hard to utilize simultaneously.
(2) Knowledge Redundancy: tokens assigned to different experts may require common knowledge. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby leading to redundancy in expert parameters. These issues collectively hinder the expert specialization in existing MoE practices, preventing them from reaching the theoretical upper-bound performance of MoE models.

【提出问题】

尽管MoE架构具有巨大的潜力,但现有MoE架构可能存在知识混杂 知识冗余 的问题,这限制了专家专业化(即每个专家获取非重叠且聚焦的知识)。传统MoE架构用MoE层替代Transformer中的前馈网络(FFN)。每个MoE层由多个专家组成,每个专家结构与标准FFN相同,每个Token被分配给一个(Fedus等,2021)或两个(Lepikhin等,2021)专家。这种架构存在两个潜在问题:
(1) 知识混杂 :现有MoE实践通常使用有限数量的专家(例如8个或16个),因此分配给某个特定专家的Token可能涵盖多种知识。结果是,该专家会试图在其参数中整合截然不同的知识类型,而这些知识难以同时有效利用。
(2) 知识冗余 :分配给不同专家的Token可能需要共同知识,导致多个专家在其参数中重复学习共享知识,从而造成专家参数的冗余。这些问题共同阻碍了现有MoE实践中专家的专业化,使其无法达到MoE模型的理论性能上限。

In response to the aforementioned issues, we introduce DeepSeekMoE, an innovative MoE architecture specifically designed towards ultimate expert specialization. Our architecture involves two principal strategies:
(1) Fine-Grained Expert Segmentation: while maintaining the number of parameters constant, we segment the experts into a finer grain by splitting the FFN intermediate hidden dimension. Correspondingly, keeping a constant computational cost, we also activate more fine-grained experts to enable a more flexible and adaptable combination of activated experts. Fine-grained expert segmentation allows diverse knowledge to be decomposed more finely and be learned more precisely into different experts, where each expert will retain a higher level of specialization. In addition, the increased flexibility in combining activated experts also contributes to a more accurate and targeted knowledge acquisition.
(2) Shared Expert Isolation: we isolate certain experts to serve as shared experts that are always activated, aiming at capturing and consolidating common knowledge across varying contexts. Through compressing common knowledge into these shared experts, redundancy among other routed experts will be mitigated. This can enhance the parameter efficiency and ensure that each routed expert retains specialized by focusing on distinctive aspects. These architectural innovations in DeepSeekMoE offer opportunities to train a parameter-efficient MoE language model where each expert is highly specialized.

Starting from a modest scale with 2B parameters, we validate the advantages of the DeepSeek-MoE architecture. We conduct evaluations on 12 zero-shot or few-shot benchmarks spanning diverse tasks. Empirical results indicate that DeepSeekMoE 2B surpasses GShard 2B (Lepikhin et al., 2021) by a substantial margin, and even matches GShard 2.9B, a larger MoE model with 1.5× expert parameters and computation. Remarkably, we find that DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with an equivalent number of parameters, which sets the strict upper bound of MoE language models. In pursuit of deeper insights, we conduct elaborate ablation studies and analysis on the expert specialization for DeepSeekMoE. These studies validate the effectiveness of fine-grained expert segmentation and shared expert isolation, and provide empirical evidence supporting the assertion that DeepSeekMoE can achieve a high level of expert specialization.

【提出方法】

针对上述问题,本文提出了DeepSeekMoE,这是一种创新的MoE架构,旨在实现终极专家专业化。本文的架构包含两大核心策略:
(1) 细粒度专家分段 :在保持参数总数不变的情况下,本文通过分割FFN中间隐藏维度将专家划分得更细。相应地,在保持计算成本恒定的情况下,本文激活更多的细粒度专家,以实现更灵活、适应性更强的专家组合。细粒度专家分段使多样化知识能够被更精细地分解并更精确地分配给不同专家,从而使每个专家保持更高程度的专业化。此外,激活专家组合的灵活性增强也有助于更准确、更有针对性的知识获取。
(2) 共享专家隔离 :本文将某些专家隔离为共享专家,始终激活它们,以捕获和整合跨上下文的通用知识。通过将通用知识压缩到这些共享专家中,减少了其他路由专家之间的冗余。这不仅提高了参数效率,还确保了每个路由专家专注于独特方面,保持专业化。这些架构创新为训练一个参数高效且专家高度专业化的MoE语言模型提供了机会。

从20亿参数的小规模模型开始,本文验证了DeepSeekMoE架构的优势。本文在涵盖多种任务的12个零样本或少样本基准测试中进行评估。实证结果表明,DeepSeekMoE 2B大幅超越了GShard 2B(Lepikhin等,2021),甚至与具有1.5倍专家参数和计算量的更大规模模型GShard 2.9B相当。值得注意的是,发现DeepSeekMoE 2B几乎接近其密集版本的性能,后者设定了MoE模型的严格性能上限。 为进一步深入了解,本文对DeepSeekMoE的专家专业化进行了详细的消融研究和分析。这些研究验证了细粒度专家分段和共享专家隔离的有效性,并提供了实证证据支持DeepSeekMoE能够实现高度专家专业化的论断。

Leveraging our architecture, we subsequently scale up the model parameters to 16B and train DeepSeekMoE 16B on a large-scale corpus with 2T tokens. Evaluation results reveal that with only about 40% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B (DeepSeek-AI, 2024), a dense model trained on the same 2T corpus. We also compare DeepSeekMoE with open source models and the evaluations demonstrate that DeepSeekMoE 16B consistently outperforms models with a similar number of activated parameters by a large margin, and achieves comparable performance with LLaMA2 7B (Touvron et al., 2023b), which has approximately 2.5 times the activated parameters. Figure 1 demonstrates the evaluation results on the Open LLM Leaderboard. Additionally, we conduct supervised fine-tuning (SFT) for alignment, transforming the model into a chat model. Evaluation results show that DeepSeekMoE Chat 16B also achieves comparable performance with DeepSeek Chat 7B and LLaMA2 SFT 7B in the chat setting. Encouraged by these results, we further undertake a preliminary endeavor to scale up DeepSeekMoE to 145B. The experimental results still validate its substantial advantages over the GShard architecture consistently. In addition, it shows performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

【实验结果】

基于本文的架构,随后将模型扩展到160亿参数,并在包含2万亿Token的大规模语料库上训练DeepSeekMoE 16B。评估结果显示,仅使用约40%的计算量,DeepSeekMoE 16B的性能与在同一语料库上训练的密集模型DeepSeek 7B(DeepSeek-AI,2024)相当。本文还将DeepSeekMoE与其他开源模型进行比较,评估表明DeepSeekMoE 16B在类似激活参数数量的情况下大幅优于其他模型,并与具有约2.5倍激活参数的LLaMA2 7B(Touvron等,2023b)性能相当。图1展示了在Open LLM Leaderboard上的评估结果。此外,本文对模型进行了监督微调(SFT)以实现对齐,将其转化为聊天模型。评估结果显示,DeepSeekMoE Chat 16B在聊天场景中与DeepSeek Chat 7B和LLaMA2 SFT 7B的性能相当。受这些结果鼓舞,本文进一步尝试将DeepSeekMoE扩展到1450亿参数。实验结果仍然一致验证了其相对于GShard架构的显著优势,并在仅使用28.5%(甚至可能18.2%)计算量的情况下,性能接近DeepSeek 67B。

Our contributions are summarized as follows:

Architectural Innovation. We introduce DeepSeekMoE, an innovative MoE architecture aiming at achieving ultimate expert specialization, which employs two principal strategies of fine-grained expert segmentation and shared expert isolation.
Empirical Validation. We conduct extensive experiments to empirically validate the effectiveness of the DeepSeekMoE architecture. Experimental results validate the high level of expert specialization in DeepSeekMoE 2B, and indicate that DeepSeekMoE 2B can nearly approach the upper bound performance for MoE models.
Scalability. We scale up DeepSeekMoE to train a 16B model and show that with only about 40% of computations, DeepSeekMoE 16B achieves comparable performance with DeepSeek 7B and LLaMA2 7B. We also undertake a preliminary endeavor to scale up DeepSeekMoE to 145B, highlighting its consistent advantages over the GShard architecture and showing a comparable performance with DeepSeek 67B.
Alignment for MoE. We successfully perform supervised fine-tuning on DeepSeekMoE 16B to create an aligned chat model, showcasing the adaptability and versatility of DeepSeekMoE 16B.
Public Release. In the spirit of open research, we release the model checkpoint of DeepSeekMoE 16B to the public. Notably, this model can be deployed on a single GPU with 40GB of memory without the need for quantization.

【贡献】
架构创新 :提出了DeepSeekMoE,一种旨在实现终极专家专业化的创新MoE架构,采用细粒度专家分段和共享专家隔离两大核心策略。
实证验证 :进行了广泛的实验,验证了DeepSeekMoE架构的有效性。实验结果表明,DeepSeekMoE 2B实现了高程度的专家专业化,并几乎接近MoE模型的性能上限。
可扩展性 :将DeepSeekMoE扩展到160亿参数,仅使用约40%的计算量即可达到与DeepSeek 7B和LLaMA2 7B相当的性能。还初步尝试将其扩展到1450亿参数,展示了其相对于GShard架构的一致优势,并与DeepSeek 67B性能相当。
MoE对齐 :成功对DeepSeekMoE 16B进行了监督微调,创建了一个对齐的聊天模型,展示了DeepSeekMoE 16B的适应性和多功能性。
公开发布 :本着开放研究的精神,向公众发布了DeepSeekMoE 16B的模型检查点。值得注意的是,该模型可以在单块40GB内存的GPU上部署,无需量化。

Figure 2 | Illustration of DeepSeekMoE. Subfigure (a) showcases an MoE layer with the conventional top-2 routing strategy. Subfigure (b) illustrates the fine-grained expert segmentation strategy. Subsequently, subfigure (c) demonstrates the integration of the shared expert isolation strategy, constituting the complete DeepSeekMoE architecture. It is noteworthy that across these three architectures, the number of expert parameters and computational costs remain constant.


内容总结:

  1. 提出的核心问题 :现有MoE架构存在知识混杂 知识冗余 问题,限制了专家专业化,使其无法达到MoE模型的理论性能上限。
  2. 关键思想 :通过细粒度专家分段 共享专家隔离 ,实现更高的专家专业化和参数效率。
  3. 方法亮点
    • 细粒度专家分段允许更灵活的专家组合,提升知识分解和获取的精度。
    • 共享专家隔离捕获通用知识,减少路由专家的冗余。
  4. 重要结论
    • DeepSeekMoE 2B几乎接近密集版本的性能上限,验证了架构的有效性。
    • DeepSeekMoE 16B在仅使用40%计算量的情况下,性能与LLaMA2 7B相当。
    • 初步扩展到1450亿参数时,DeepSeekMoE表现出一致优势,计算量仅为GShard的28.5%-18.2%。

DeepSeekMoE Architecture

On top of the generic MoE architecture outlined in Section 2, we introduce DeepSeekMoE , which is specifically designed to exploit the potential of expert specialization. As illustrated in Figure 2, our architecture incorporates two principal strategies: fine-grained expert segmentation and shared expert isolation . Both of these strategies are designed to elevate the level of expert specialization.

DeepSeekMoE架构

在第2节概述的通用MoE架构基础上,本文提出了DeepSeekMoE,该架构专门设计用于挖掘专家专业化的潜力。如图2所示,本文的架构包含两个主要策略:细粒度专家分割 共享专家隔离 。这两种策略均旨在提升专家专业化的水平。

Fine-Grained Expert Segmentation

In scenarios where the number of experts is limited, tokens assigned to a particular expert will be more likely to cover diverse types of knowledge. As a consequence, the designated expert will intend to learn vastly different types of knowledge in its parameters, and they are hard to be simultaneously utilized. However, if each token can be routed to more experts, diverse knowledge will gain the potential to be decomposed and learned in different experts respectively. In this context, each expert can still retain a high level of expert specialization, contributing to a more focused knowledge distribution across experts.

In pursuit of this goal, while maintaining a consistent number of expert parameters and computational cost, we segment the experts with a finer grain. The finer expert segmentation enables a more flexible and adaptable combination of activated experts. To be specific, on top of a typical MoE architecture shown in Figure 2(a), we segment each expert FFN into 𝑚 smaller experts by reducing the FFN intermediate hidden dimension to 1/𝑚 times its original size. Since each expert becomes smaller, in response, we also increase the number of activated experts to 𝑚 times to keep the same computation cost, as illustrated in Figure 2(b). With the fine-grained expert segmentation , the output of an MoE layer can be expressed as:

where the total number of expert parameters is equal to 𝑁 times the number of parameters in a standard FFN, and 𝑚𝑁 denotes the total number of fine-grained experts. With the fine-grained expert segmentation strategy, the number of nonzero gates will also increases to 𝑚𝐾.

From a combinatorial perspective, the fine-grained expert segmentation strategy substantially enhances the combinatorial flexibility of activated experts. As an illustrative example, we consider the case where N=16. A typical top-2 routing strategy can yiel\binom{16}{2}=120possible combinations. By contrast, if each expert is split into 4 smaller experts, the fine-grained routing strategy can yield \binom{64}{8}= 4, 426, 165, 368potential combinations. The surge in combinatorial flexibility enhances the potential for achieving more accurate and targeted knowledge acquisition.

细粒度专家分割

在专家数量有限的情况下,分配给某个特定专家的Token更可能涵盖多种类型的知识。因此,该专家在其参数中会试图学习各种截然不同的知识,但这些知识难以同时被有效利用。然而,如果每个Token可以路由到更多专家,则不同类型的知识有潜力被分解并分别由不同专家学习。在此背景下,每个专家仍能保持高度的专业化,从而促进知识在专家之间的更集中分布。

为了实现这一目标,在保持专家参数总量和计算成本不变的前提下,对专家进行更细粒度的分割。这种细粒度分割使激活专家的组合更加灵活和适应性强。具体来说,在图2(a)所示的典型MoE架构上,将每个专家FFN分割为𝑚个更小的专家,方法是将FFN中间隐藏维度缩减至原大小的1/𝑚。由于每个专家变得更小,相应地,也将激活专家的数量增加到𝑚倍,以维持相同的计算成本,如图2(b)所示。通过这种细粒度专家分割,MoE层的输出可以表示为:

其中,专家参数总数等于𝑁乘以标准FFN中的参数数量,而𝑚𝑁表示细粒度专家的总数。采用细粒度专家分割策略后,非零门控的数量也会增加到𝑚𝐾。

从组合的角度来看,细粒度专家分割策略显著增强了激活专家的组合灵活性。例如,考虑𝑁=16的情况。典型的Top-2路由策略可以产生\binom{16}{2}=120一定数量的组合 ,而如果将每个专家分割为4个更小的专家,细粒度路由策略可以产生\binom{64}{8}= 4, 426, 165, 368更多的潜在组合 。这种组合灵活性的提升有助于实现更精确、更有针对性的知识获取。

Shared Expert Isolation

With a conventional routing strategy, tokens assigned to different experts may necessitate some common knowledge or information. As a result, multiple experts may converge in acquiring shared knowledge in their respective parameters, thereby resulting in redundancy in expert parameters . However, if there are shared experts dedicated to capturing and consolidating common knowledge across varying contexts, the parameter redundancy among other routed experts will be alleviated. This alleviation of redundancy will contribute to a more parameter-efficient model with more specialized experts .

Towards this objective, in addition to the fine-grained expert segmentation strategy , we further isolate ks​ experts to serve as shared experts. Regardless of the router module, each token will be deterministically assigned to these shared experts. In order to maintain a constant computational cost, the number of activated experts among the other routed experts will be decreased by ks​, as depicted in Figure 2(c). With the shared expert isolation strategy integrated, an MoE layer in the complete DeepSeekMoE architecture is formulated as follows:

Finally, in DeepSeekMoE , the number of shared experts is Ks​, the total number of routed experts is mN−Ks​, and the number of nonzero gates is mK−Ks​.

It is worth noting that the prototype of shared expert isolation can be credited to Rajbhandari et al. (2022). The key distinction lies in the fact that they derive this strategy from an engineering perspective, while we approach it from an algorithmic standpoint .

共享专家隔离

使用传统路由策略时,分配给不同专家的Token可能需要一些共同的知识或信息。结果,多个专家可能会在其各自的参数中收敛于学习共享知识,从而导致专家参数冗余。然而,如果有专门的共享专家负责捕获和整合跨上下文的共同知识,则其他路由专家之间的参数冗余将得到缓解。这种冗余的减少有助于构建一个参数效率更高且专家更专业化的模型。

为此,在细粒度专家分割策略的基础上,进一步隔离了𝑘𝑠个专家作为共享专家。无论路由模块如何,每个Token都会确定性地分配给这些共享专家。为了保持恒定的计算成本,其他路由专家中激活专家的数量将减少𝑘𝑠,如图2(c)所示。结合共享专家隔离策略后,完整的DeepSeekMoE架构中的MoE层可以表示为:

最终,在DeepSeekMoE中,共享专家的数量为𝐾𝑠,路由专家的总数为𝑚𝑁−𝐾𝑠,非零门控的数量为𝑚𝐾−𝐾𝑠。

值得注意的是,共享专家隔离的原型可归功于Rajbhandari等人(2022)。关键区别在于,他们从工程角度推导出这一策略,而本文则从算法角度出发。

Load Balance Consideration

Automatically learned routing strategies may encounter the issue of load imbalance, which manifests two notable defects. Firstly, there is a risk of routing collapse (Shazeer et al., 2017), i.e., the model always selects only a few experts, preventing other experts from sufficient training. Secondly, if experts are distributed across multiple devices, load imbalance can exacerbate computation bottlenecks.

负载均衡考量

自动学习的路由策略可能会遇到负载不平衡的问题,表现为两个显著缺陷:

  1. 存在路由坍塌(Shazeer等,2017)的风险,即模型总是选择少数几个专家,导致其他专家得不到充分训练。
  2. 如果专家分布在多台设备上,负载不平衡会加剧计算瓶颈。

Expert-Level Balance Loss.

In order to mitigate the risk of routing collapse , we also employ an expert-level balance loss . The computation of the balance loss is as follows:

where α1​ is a hyper-parameter called expert-level balance factor , N′ is equal to (mN−Ks​) and K′ is equal to (mK−Ks​) for brevity. 1(⋅) denotes the indicator function.

    专家级平衡损失

    为了降低路由坍塌的风险,本文引入了专家级平衡损失。其计算公式如下:

    其中,𝛼1是一个超参数,称为专家级平衡因子,𝑁′等于(𝑚𝑁−𝐾𝑠),𝐾′等于(𝑚𝐾−𝐾𝑠)。1(·)表示指示函数。

    Device-Level Balance Loss.

    In addition to the expert-level balance loss , we introduce a device-level balance loss . When aiming to alleviate computation bottlenecks, it becomes unnecessary to enforce strict balance constraints at the expert level, because excessive constraints on load balance will compromise model performance. Instead, our primary objective is to ensure balanced computation across the devices. If we partition all routed experts into D groups {\varepsilon1​,\varepsilon2​,...,\varepsilonD​}, and deploy each group on a single device, the device-level balance loss is computed as follows:

    where α2​ is a hyper-parameter called device-level balance factor . In practice, we set a small expert-level balance factor to mitigate the risk of routing collapse, and meanwhile set a larger device-level balance factor to promote balanced computation across the devices.

    设备级平衡损失

    除了专家级平衡损失外,本文还引入了设备级平衡损失。当目标是缓解计算瓶颈时,严格要求专家级负载均衡变得不必要,因为过度约束负载均衡会损害模型性能。相反,本文的主要目标是确保设备间的计算均衡。如果将所有路由专家划分为𝐷组{\varepsilon1, \varepsilon2, ..., \varepsilon𝐷},并将每组部署在单个设备上,则设备级平衡损失的计算公式如下:

    其中,𝛼2是一个超参数,称为设备级平衡因子。在实践中,本文设置较小的专家级平衡因子以降低路由坍塌风险,同时设置较大的设备级平衡因子以促进设备间的计算均衡。