一种基于滑动层合并的高效深度修剪大模型的方法-EW帮帮网

原论文： A Sliding Layer Merging Method for Efficient Depth-Wise Pruning in LLMs

Abstract

与宽度修剪 width-wise pruning相比，深度修剪 depth-wise pruning 可以显著加快资源受限场景下的推理速度。然而，将整个Transformer层作为最小修剪单元可能会由于不加区分地丢弃层的全部信息而降低模型的性能。本文通过分析再现核希尔伯特空间kernel Hilbert space中各层输出的相关性，揭示了大型语言模型中各层之间的“类patch”特征关系。在此基础上，我们提出了一种滑动层合并方法 sliding layer merging method，该方法根据预定义的相似度阈值从上到下动态地选择和融合连续层 consecutive layers，从而在保持模型性能的同时简化了模型结构。在不同体系结构和不同参数尺度的llm上进行的大量实验表明，我们的方法在zero-shot 推理性能和修剪后的再训练恢复质量方面都优于现有的修剪技术。特别是，在Vicuna-7B模型上进行35%剪枝的实验中，我们的方法在zero-shot 任务上的平均性能比现有方法提高了1.654%。此外，我们进一步揭示了深度修剪与宽度修剪相结合的潜力，以提高修剪效果。

Contribution：

• analyzed the inter-layer correlations in LLMs within a reproducing kernel Hilbert space, observing an interesting Patch-Like correlation distribution, which provides valuable insights for the design of model compression strategies.
• propose the Sliding Layer Merging method, which dynamically merges layers with strong rep-resentational similarity in LLMs. This method can be seamlessly applied to various existing LLM ar-chitectures.

Related Works

Pruning Method on LLMs

wide-wise approach

The width-wise approach reduces the network width by pruning coupled structures, such as attention heads and their associated weight connections, while preserving the number of layers

Voita et al., 2019] and [Michel et al., 2019] introduced pruning and attention head sharing techniques to reduce redundant 多余的 attention heads, thereby decreasing both computational complexity and parameter requirements.

[Nova et al., 2023] and [Santacroce et al., 2023] optimized the feed-forward network by reducing the dimension of the FFN hidden layer, thereby reducing the memory footprint and computational complexity.

Deep-wise approach

the depth-wise ap-proach reduces the network depth by completely removing certain layers，

drawback：1） remains underexplored in terms of analyzing the correlations between Transformer layers at different depths，2） arbitrary removing spe-cific layers may degrade the performance of the pruned model.

Shortened-LLM [Kim et al., 2024] selected Taylor+ and PPL indicators as the importance mea-sure of the Transformer layer, and directly deleted the unimportant Transformer layer

The layer-skipping strategy [Schuster et al., 2022; Del Corro et al., 2023; Raposo et al., 2024] dynamically selecting which layers to skip during execution.

[Song et al., 2024; Tang et al., 2024] reduce the model’s depth by eliminating redundant layers

Motivation

第一次见论文写这个节的欸

CKA vector similarity

Center Kernel Alignment (CKA) is a metric used to compare the internal representations of neural networks.

Its main advantages are its invariance to orthogonal transformations (e.g. changes in neuron arrangement) and its robustness to isotropic scaling achieved through a normalization term [Raghu et al., 2021].

suitable for studying the under-lying relationships between different Transformer layers within large language models.

Step1 计算

Calculate the Gram matrix (kernel matrix) of two representation matrices to measure the similarity of representations.

the output of the two Transformer layers that need to calculate CKA to the representation
matrices where n is the number of ∈samples, and p and∈ q are two representations of the space dimensions.

Step2归一化

Centralize the Gram matrix to eliminate the potential impact of sample distribution deviation.

is the centralization matrix

is the identity matrix of n×n

an all-1 vector of length n

Step3 标准化

Calculate the normalized alignment between the central Gram matrices K and L to get CKA.

<·,·>F：Frobenius inner product ( sum of element-wise products)

|| · ||F： Frobenius norm (square root of the sume of squares of all elements in the matrix)

The final CKA value is between 0 and 1. The closer the value is to 1, the more similar the two representation matrices are.

感觉跟散度比较相似？好像我记得也有用kl散度来衡量layer之间的相似度的(好吧记错了）

核心概念

CKA 相似性：用于衡量两个特征表示（通常是高维向量或矩阵）之间的相似性。主要用于神经网络的表示学习分析。CKA 计算的是中心化核（如线性核或 RBF 核）之间的相似性，并归一化到 [0,1] 之间，其中 1 表示完全相同，0 表示完全不相关。对称的，即 CKA(X,Y)=CKA(Y,X)

KL 散度：用于衡量两个概率分布 P和 Q 之间的差异。它不是对称的，即 DKL(P∣∣Q)≠DKL(Q∣∣P)，可以理解为在分布 P 下观测数据，而用 Q进行编码的“信息损失”。不能直接用作度量距离

应用场景

CKA：
- 常用于深度学习中不同层的特征对比，分析神经网络不同层的表示是否相似。
- 可用于评估不同模型是否学到了相似的特征表示。
- 适用于非概率分布的数据，如神经网络隐藏层的表示。
KL 散度：
- 主要用于概率分布的比较，常用于信息论、贝叶斯推理和机器学习（如变分自编码器 VAE）。
- 在优化过程中，如最大化似然估计（MLE）或变分推断，KL 散度用于衡量目标分布与模型分布的偏差。
- KL 散度也用于 GAN（Generative Adversarial Networks）和其他概率模型。

归一化

CKA：结果归一化在 [0,1] 之间，易于解释。
KL 散度：理论上无上界，最小值为 0（当 P=QP = QP=Q 时），但最大值依赖于具体分布。

特性	CKA 相似性	KL 散度
衡量对象	向量或矩阵的相似性	概率分布的差异
计算方式	基于 HSIC 的核方法	相对熵计算
取值范围	[0,1]	[0, ∞]
是否对称	是	否
适用场景	神经网络特征比较	概率分布优化、信息论

Representation Structure between LLM Transformer Layers

CKA similarity heatmaps for several large language models

Consistency across models: generally exhibit higher redundancy and strong inter-layer correlation in the middle layers

Inter-layer correlation differences: the first two layers and the last layer of the model were less correlated with CKA than other layers. This suggests that the representations of some layers may be relatively independent, weakly functionally related to other layers，and may not be suitable for large-scale compression. In subsequent studies, we ”protect” these layers by not compress-ing the first two layers and the last layer to avoid unnecessary damage to model performance.

Redundancy of intermediate layers 中间层: on intermediate layers, we observed that the correlations between these layers exhibit a clear block-like structure. This structure shows that there is strong continuity and correlation between the middle layers, implying that these layers have high functional redundancy and provide space for compression.

总之就是中间强相关的可以压缩，lower and final layers 差异性较大不可压

Method

Sliding layer merging algorithm

During the compression process, starting from higher layers, we gradually merge adjacent layers downwards.

不过有点好奇既然能直接计算CKA，直接合并CKA大的层就ok了，那么为啥要滑窗呢

step1 Model initialization.

During initialization, we as-sign the original model M to the target compression model M∗, and set the compression layer range to [L, H], where H is the highest layer. L is the lowest level.

Set-ting the compression layer range helps us adopt appro-priate protection mechanisms to ensure that the com-pression model does not lose the functionality of key layers.

总之就是根据CKA结果排序，保护弱相关层

step2 Layer merge.

We take the highest layer H as the initial upper bound of the sliding window, and set its
next layer H 1 as the initial lower bound.By merging mulitiple layers within the sliding window, we get the temporary model Mtmp. We adopt a layer merging strategy based on inter-layer differences (as shown in Fig.2 (b)), adding the differences between the parame-ters of adjacent layers and the base layer, and gradually integrating redundant information. This strategy can not only capture the correlation between layers, but also has the flexibility to adapt to the compression needs of different models.

所以怎么合并啊（？）这里的θ是权重吗？权重相加就能保存这些层的信息吗？？？

step3 Iterative update.

To measure the impact of the merging operation on the model output representation, we calculate the cosing similarity of the last hidden states of the original model M and the merged model Mtmp on the few-shot calibration dataset (see Fig.2 (c)).

If the representation similarity between Mtmp and the original model is greater than the set threshold T , the merging layer is considered to have a small impact on the model performance. At this time, the lower bound of the sliding window is moved down one layer to expand the merging range.

if the representation similar-ity between Mtmp and the original model is less than the threshold T , it is considered that the impact of the merging layer on the model performance is too great, and the sliding window should stop expanding.At this time, the compressed model M∗ is updated to M tmp, and the upper bound of the sliding window is updated to the current lower bound of the sliding window, and then enters the next round of layer merging (see Fig.2(a)).

step4 Termination condition.

The process continues un-til the lowest level L is processed. Ultimately, the pruned model M∗ output by the algorithm reduces redundant computing and storage requirements by retaining the merged representation of key layers.

Performance Recovery with Low-rank Approximation

We use the low-rank approximation technique, LoRA[Hu et al., 2021], to fine-tune the pruned model and recover its performance.

Each learnable weight matrix (including both pruned and unpruned linear projections in LLMs) is represented by W. The update to △W , denoted as △W is factorized as

and

Here, d−, d, and d+ correspond to the dimensions of the input, hidden, and output layers

foward computation:

b represents the bias in the dense layer.

By training only the low-rank matrices P and Q, we considerably reduce both the computational complexity and the dependence on large-scale training data. Further-more, the additional parameters P and Q can be reparameterized as W , thereby introducing no extra parameters in final pruned model.

类似于拆开大矩阵然后分开训练最后合并吗

Experiment

Setup

Model

LLaMA2-{7B, 13B} [Touvron et al., 2023], LLaMA3-{8B} and Vicuna-{7B, 13B -v1.3 [Chiang et al., 2023].

Baseline

For width pruning, we compare with LLM-Pruner [Ma et al., 2023], FLAP [An et al., 2024], and Wanda-sp [An et al., 2024], a structured variant of Wanda [Sun et al., 2024].

For depth pruning, we exam-ine SLEB [Song et al., 2024] and Shortened-LLM [Kim et al., 2024]. We assess all methods under two target pruning levels: 20% and 35%. If the product of the to-tal number of transformer blocks and target sparsity is not an integer, we round up to determine the number of blocks to remove.

所以是不是基于这两种剪枝分别合并？

Benchmarks

Following Touvron et al. (2023), we measure zero-shot accuracy on commonsense reasoning datasets BoolQ， PIQA， HellaSwag [Zellers et al., 2019], WinoGrande [Sakaguchi et al., 2021], ARCeasy [Clark et al., 2018], ARC-challenge [Clark et al., 2018], and OpenbookQA [Mihaylov et al., 2018]) using the lm-evaluation-harness package ([Gao et al., 2024]).

Implementation Details

PyTorch [Paszke et al., 2019]

Hug-gingFace Transformers library [Wolf et al., 2020].

Following [Ma et al., 2023], we randomly select 10 samples from BookCorpus[Zhu, 2015] to calculate the model similarity in the iterative pruning process. 这个是干嘛的？

We also use this calibration dataset as a baseline approach to ensure a fair comparison.

In LoRA retraining, we use 50K samples of refined Alpaca [Taori et al., 2023] for instruction tuning.

All experiments covered in this article were performed on an NVIDIA A100 GPU with 80GB memory.

Results

Conclusion

通过分析再现核希尔伯特空间中各层输出之间的相关性，揭示了大型语言模型中各层之间的 ”patch-like" 关系模式。基于此，我们提出了一种动态选择和合并层参数的深度修剪方法。该方法设置相似阈值，并从上到下合并连续层，在有效保持性能的同时实现快速的模型压缩。实验结果表明，我们的方法在资源受限环境下显著加快了推理速度，并且在零采样任务上优于现有的剪枝技术。此外，我们的方法可以与宽度修剪技术无缝集成，从而产生具有增强性能的修剪模型。我们希望这项研究能够激发对深度修剪方法的进一步研究，并促进结合深度和宽度修剪策略的统一框架的发展，最终有助于在资源受限的环境中有效部署llm。

一种基于滑动层合并的高效深度修剪大模型的方法