CVPR2023论文速览Transformer-EW帮帮网

在这里插入图片描述

Paper1 TrojViT: Trojan Insertion in Vision Transformers

摘要原文: Vision Transformers (ViTs) have demonstrated the state-of-the-art performance in various vision-related tasks. The success of ViTs motivates adversaries to perform backdoor attacks on ViTs. Although the vulnerability of traditional CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are seldom-studied. Compared to CNNs capturing pixel-wise local features by convolutions, ViTs extract global context information through patches and attentions. Naively transplanting CNN-specific backdoor attacks to ViTs yields only a low clean data accuracy and a low attack success rate. In this paper, we propose a stealth and practical ViT-specific backdoor attack TrojViT. Rather than an area-wise trigger used by CNN-specific backdoor attacks, TrojViT generates a patch-wise trigger designed to build a Trojan composed of some vulnerable bits on the parameters of a ViT stored in DRAM memory through patch salience ranking and attention-target loss. TrojViT further uses parameter distillation to reduce the bit number of the Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the vulnerable bits, the ViT model still produces normal inference accuracy with benign inputs. But when the attacker embeds a trigger into an input, the ViT model is forced to classify the input to a predefined target class. We show that flipping only few vulnerable bits identified by TrojViT on a ViT model using the well-known RowHammer can transform the model into a backdoored one. We perform extensive experiments of multiple datasets on various ViT models. TrojViT can classify 99.64% of test images to a target class by flipping 345 bits on a ViT for ImageNet.

中文总结: 这段话主要讨论了Vision Transformers（ViTs）在各种与视觉相关的任务中展示了最先进的性能。ViTs的成功激发了对ViTs进行后门攻击的动机。尽管传统CNN对后门攻击的脆弱性是众所周知的，但对ViTs进行后门攻击的研究却很少。与通过卷积捕获像素级局部特征的CNN相比，ViTs通过补丁和注意力提取全局上下文信息。将CNN特定的后门攻击天真地移植到ViTs只会产生较低的干净数据准确性和较低的攻击成功率。在本文中，我们提出了一种隐蔽且实用的ViT特定后门攻击TrojViT。TrojViT生成了一个基于补丁的触发器，设计用于通过补丁显著性排序和注意力目标损失在DRAM内存中的ViT参数上构建一个由一些易受攻击位组成的特洛伊木马。TrojViT进一步使用参数蒸馏来减少特洛伊木马的位数。一旦攻击者通过翻转易受攻击位将特洛伊木马插入ViT模型，ViT模型仍然能够对良性输入产生正常的推理准确性。但是当攻击者将触发器嵌入输入时，ViT模型被迫将输入分类为预定义的目标类。我们展示了通过使用众所周知的RowHammer仅翻转TrojViT在ViT模型上识别的少数易受攻击位，就可以将模型转变为一个带有后门的模型。我们在多个数据集上对各种ViT模型进行了广泛的实验。TrojViT可以通过在ImageNet上翻转ViT上的345位来将99.64％的测试图像分类到目标类。

Paper2 X-Pruner: eXplainable Pruning for Vision Transformers

摘要原文: Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class, thereby leading to inferior performance. To alleviate this problem, we propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion. Specifically, to measure each prunable unit’s contribution to predicting each target class, a novel explainability-aware mask is proposed and learned in an end-to-end manner. Then, to preserve the most informative units and learn the layer-wise pruning rate, we adaptively search the layer-wise threshold that differentiates between unpruned and pruned units based on their explainability-aware mask values. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation.

中文总结: 最近，视觉Transformer模型已成为各种任务中突出的模型。然而，这些模型通常受到高昂的计算成本和沉重的内存需求的困扰，使它们难以在边缘平台上部署。最近的研究提出了以不可解释的方式修剪Transformer，这种方式忽视了模型内部单元与目标类之间的关系，从而导致性能下降。为了缓解这一问题，我们提出了一种新颖的可解释修剪框架X-Pruner，该框架考虑了修剪标准的可解释性。具体来说，为了衡量每个可修剪单元对预测每个目标类的贡献，提出并学习了一种新颖的可解释性感知掩码，并以端到端的方式进行学习。然后，为了保留最具信息量的单元并学习逐层修剪率，我们根据其可解释性感知掩码值自适应地搜索逐层阈值，以区分未修剪和修剪单元。为了验证和评估我们的方法，我们将X-Pruner应用于代表性的Transformer模型，包括DeiT和Swin Transformer。全面的模拟结果表明，所提出的X-Pruner在显著降低计算成本的同时，性能略有下降，优于最先进的黑盒方法。

Paper3 ViTs for SITS: Vision Transformers for Satellite Image Time Series

摘要原文: In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model’s discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://github.com/michaeltrs/DeepSatModels.

中文总结: 本文介绍了Temporo-Spatial Vision Transformer (TSViT)，这是一种基于Vision Transformer (ViT)的全注意力模型，用于处理一般的卫星图像时间序列（SITS）。TSViT将SITS记录分割成时间和空间上不重叠的补丁，这些补丁被标记化，然后由一个分解的时空编码器进行处理。我们认为，与自然图像相比，对于SITS处理来说，先处理时间再处理空间更直观，并提出了实验证据支持这一观点。此外，我们通过引入两种新机制，即获取时间特定的时间位置编码和多个可学习的类标记，增强了模型的判别能力。通过广泛的消融研究评估了所有新设计选择的影响。我们提出的架构在三个公开可用的SITS语义分割和分类数据集中取得了最先进的性能，明显超过了以前的方法。所有模型、训练和评估代码都可以在https://github.com/michaeltrs/DeepSatModels 找到。

Paper4 NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers

摘要原文: The complicated architecture and high training cost of vision transformers urge the exploration of post-training quantization. However, the heavy-tailed distribution of vision transformer activations hinders the effectiveness of previous post-training quantization methods, even with advanced quantizer designs. Instead of tuning the quantizer to better fit the complicated activation distribution, this paper proposes NoisyQuant, a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. We make a surprising theoretical discovery that for a given quantizer, adding a fixed Uniform noisy bias to the values being quantized can significantly reduce the quantization error under provable conditions. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer. Extensive experiments show NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead. For instance, on linear uniform 6-bit activation quantization, NoisyQuant improves SOTA top-1 accuracy on ImageNet by up to 1.7%, 1.1% and 0.5% for ViT, DeiT, and Swin Transformer respectively, achieving on-par or even higher performance than previous nonlinear, mixed-precision quantization.

中文总结: 这段话主要讨论了视觉transformer的复杂架构和高训练成本促使人们探索后训练量化的必要性。然而，视觉transformer激活的重尾分布阻碍了先前后训练量化方法的有效性，即使使用先进的量化器设计也无济于事。因此，本文提出了NoisyQuant，这是一种对视觉transformer后训练激活量化性能进行增强的量化器不可知的方法。研究发现，对于给定的量化器，向被量化的值添加一个固定的均匀噪声偏差可以在可证条件下显著降低量化误差。基于这一理论洞察，NoisyQuant首次成功地通过添加噪声偏差来改变重尾激活分布以适应给定的量化器。大量实验证明，NoisyQuant大大改善了视觉transformer的后训练量化性能，而且计算开销极小。例如，在线性均匀6位激活量化上，NoisyQuant将ImageNet上的SOTA top-1准确率提高了高达1.7%，分别为ViT、DeiT和Swin Transformer分别提高了1.1%和0.5%，实现了与先前的非线性、混合精度量化相当甚至更高的性能。

Paper5 Dual-Path Adaptation From Image to Video Transformers

摘要原文: In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DUALPATH adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers’ capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DUALPATH. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DUALPATH can be effectively generalized beyond the data domain.

中文总结: 本文旨在高效地将视觉基础模型（如ViT和Swin）的超越表示能力转移到视频理解中，仅需少量可训练参数。先前的适应方法同时考虑了空间和时间建模，使用统一的可学习模块，但仍然未能充分利用图像变换器的代表性能力。我们认为视频模型中流行的双路径（双流）架构可以缓解这个问题。我们提出了一种新颖的DUALPATH适应方法，将其分为空间和时间适应路径，每个变换器块中使用轻量级的瓶颈适配器。特别是对于时间动态建模，我们将连续帧组合成类似网格的帧集，以精确模拟视觉变换器在令牌之间推断关系的能力。此外，我们从统一视角广泛研究了视频理解中的多个基线，并将它们与DUALPATH进行比较。在四个动作识别基准测试上的实验结果证明，使用DUALPATH的预训练图像变换器可以有效地在数据域之外进行泛化。

Paper6 SViTT: Temporal Learning of Sparse Video-Text Transformers

摘要原文: Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.

中文总结: 这段话主要讨论了视频文本变换器是否学会对帧之间的时间关系进行建模。尽管视频文本模型具有巨大的容量和丰富的多模态训练数据，但最近的研究揭示了视频文本模型倾向于基于帧的空间表示，而时间推理仍然未解决。作者在工作中确定了视频文本变换器在时间学习中面临的几个关键挑战：来自有限网络规模的时空权衡；多帧建模的维度诅咒；通过扩展剪辑长度获得的语义信息的收益递减。在这些发现的指导下，作者提出了SViTT，这是一种稀疏视频文本架构，可以进行多帧推理，其成本显著低于密集注意力的朴素变换器。类似于基于图的网络，SViTT采用了两种稀疏形式：限制自注意力中令牌之间的查询-键通信的边稀疏性，丢弃无信息的视觉令牌的节点稀疏性。通过使用一个逐渐增加模型稀疏性的课程进行训练，SViTT在多个视频文本检索和问题回答基准测试中表现优于密集变换器基线，并且计算成本仅为一小部分。项目页面：http://svcl.ucsd.edu/projects/svitt。

Paper7 An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

摘要原文: Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval.

中文总结: 这段话主要讨论了最近证明在视觉预训练中，Masked visual modeling (MVM) 已被证明是有效的。虽然在视频输入上类似的重建目标（例如，遮挡帧建模）已经在视频语言（VidL）预训练中进行了探索，但先前的研究未能找到一个真正有效的 MVM 策略，可以在很大程度上有利于下游性能。在这项工作中，我们系统地研究了 MVM 在 VidL 学习中的潜力。具体来说，我们基于一个完全端到端的 VIdeO-LanguagE Transformer（VIOLET）进行研究，其中来自 MVM 训练的监督可以反向传播到视频像素空间。总共，我们探索了八种不同的 MVM 重建目标，从低级像素值和定向梯度到高级深度图、光流、离散视觉标记和潜在视觉特征。我们进行了全面的实验，并提供了关于导致有效 MVM 训练的因素的见解，从而产生了增强模型 VIOLETv2。从经验上看，我们展示了通过 MVM 目标进行预训练的 VIOLETv2 在 13 个 VidL 基准测试中取得了显著的改进，涵盖视频问题回答、视频字幕生成和文本到视频检索等方面。

Paper8 Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention

摘要原文: Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called reversed attention that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection.

中文总结: 这段话主要讨论了人类具有提取视觉世界结构化表示的多功能机制。当观看图像时，我们能够将场景分解为实体及其部分，并获取它们之间的依赖关系。为了模仿这种能力，提出了一种名为Visual Dependency Transformers（DependencyViT）的方法，可以在没有任何标签的情况下诱导视觉依赖关系。作者们通过一种称为reversed attention的新颖神经操作器实现了这一点，它可以自然地捕捉图像块之间的长程视觉依赖关系。具体来说，作者们将其形式化为一个依赖图，其中reversed attention中的子标记被训练为关注其父标记并根据归一化概率分布发送信息，而不是像传统的自注意力机制那样收集信息。通过这样的设计，层次结构自然地从reversed attention层中出现，并且从叶节点到根节点逐渐诱导出依赖树，实现了无监督学习。DependencyViT提供了几个吸引人的优势：(i) 图像中的实体及其部分由不同的子树表示，使得部分划分与依赖关系成为可能；(ii) 动态视觉池化成为可能。很少发送消息的叶节点可以被修剪，而不会影响模型性能，基于此，提出了轻量级的DependencyViT-Lite以减少计算和内存占用；(iii) DependencyViT在ImageNet上的自监督和弱监督预训练范式上表现良好，并在8个数据集和5个任务上展示了其有效性，如无监督部分和显著性分割、识别和检测。

Paper9 Learning Imbalanced Data With Vision Transformers

摘要原文: The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs’ performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. Although Binary Cross Entropy (BCE) loss performs well with ViTs, it struggles on the LTR tasks. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins for deploying it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

中文总结: 这段话主要讨论了现实世界的数据往往存在严重的不平衡，这严重影响了数据驱动的深度神经网络，使得长尾识别（LTR）成为一个巨大的挑战任务。现有的LTR方法很少使用长尾数据训练视觉变换器（ViTs），而ViTs的现成预训练权重往往会导致不公平的比较。本文系统地研究了ViTs在LTR中的表现，并提出了LiVT，只使用长尾数据从头开始训练ViTs。通过观察到ViTs在LTR问题上更严重的情况，我们进行了掩码生成预训练（MGP）来学习泛化特征。通过充分而可靠的证据，我们表明MGP比监督方式更加稳健。虽然二元交叉熵（BCE）损失在ViTs表现良好，但在LTR任务上却遇到困难。我们进一步提出了平衡的BCE来通过坚实的理论基础来改善它。特别地，我们推导出Sigmoid的无偏扩展，并为其部署额外的logit边际来补偿。我们的Bal-BCE有助于ViTs在短短几个时代内快速收敛。大量实验证明，通过MGP和Bal-BCE，LiVT成功地训练了ViTs，而无需任何额外数据，并在iNaturalist 2018中取得了显著优于可比较的最先进方法的表现，例如，我们的ViT-B在Top-1准确率方面达到了81.0%，而无需任何花哨的技巧。代码可在https://github.com/XuZhengzhuo/LiVT中找到。

Paper10 Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors

摘要原文: Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learn- ing for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms exist- ing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation.

中文总结: 当前流行的计算机视觉骨干网络，如Vision Transformers (ViT)和ResNets，是针对从2D图像感知世界进行训练的。然而，为了更有效地理解2D骨干网络中的3D结构先验，我们提出了Mask3D，利用现有的大规模RGB-D数据进行自监督预训练，将这些3D先验嵌入到2D学习到的特征表示中。与传统的需要3D重建或多视图对应的3D对比学习范式不同，我们的方法很简单：通过在单个RGB-D帧中对RGB和深度补丁进行遮罩，制定一个预文本重建任务。我们展示了Mask3D在将3D先验嵌入到强大的2D ViT骨干网络中方面特别有效，为各种场景理解任务（如语义分割、实例分割和目标检测）提供了改进的表示学习。实验表明，Mask3D在ScanNet、NYUv2和Cityscapes图像理解任务上明显优于现有的自监督3D预训练方法，对ScanNet图像语义分割的mIoU指标比现有的Pri3D技术提升了+6.5%。

Paper11 Vision Transformers Are Good Mask Auto-Labelers

摘要原文: We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations.

中文总结: 这段话主要介绍了一种名为Mask Auto-Labeler (MAL)的高质量基于Transformer的自动标注框架，用于实例分割，仅使用框注释即可。MAL将框裁剪的图像作为输入，并有条件地生成它们的掩码伪标签。研究表明，Vision Transformers是良好的掩码自动标注器。我们的方法显著减小了自动标注和人工标注之间关于掩码质量的差距。使用MAL生成的掩码进行训练的实例分割模型几乎可以与完全监督的对应模型性能相匹配，保留了高达97.4%的完全监督模型性能。最佳模型在COCO实例分割（test-dev 2017）上实现了44.1%的mAP，优于现有的基于框注释的方法。定性结果表明，MAL生成的掩码在某些情况下甚至比人工标注更好。

Paper12 Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers

摘要原文: This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality.

中文总结: 本文介绍了内容感知令牌共享（CTS），这是一种令牌减少方法，可以提高使用Vision Transformers（ViTs）的语义分割网络的计算效率。现有研究已经提出了令牌减少方法来提高基于ViT的图像分类网络的效率，但这些方法并不直接适用于语义分割，我们在这项工作中解决了这个问题。我们观察到，在语义分割中，如果多个图像补丁包含相同的语义类别，则它们可以共享一个令牌，因为它们包含冗余信息。我们的方法利用这一点，通过使用一个高效的、与类别无关的策略网络，预测图像补丁是否包含相同的语义类别，如果是，则让它们共享一个令牌。通过实验，我们探讨了CTS的关键设计选择，并展示了它在ADE20K、Pascal Context和Cityscapes数据集、不同的ViT主干和不同的分割解码器上的有效性。通过内容感知令牌共享，我们能够将处理的令牌数量减少高达44%，而不会降低分割质量。

Paper13 Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves

摘要原文: Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is only 0.5% difference from the top-1 accuracy (84.2%) achieved by the JFT-300M pre-training, even though the scale of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.

中文总结: 这段话主要讨论了基于公式驱动的监督学习（FDSL）作为一种有效的方法用于预训练视觉变换器，其中ExFractalDB-21k被证明超过了ImageNet-21k的预训练效果。这些研究还表明，在预训练视觉变换器时，轮廓比纹理更重要。然而，对于为什么这些以轮廓为导向的合成数据集能够达到与真实数据集相同准确性的系统性调查的缺乏，让人对此持怀疑态度。在本研究中，我们基于圆谐波开发了一种新的方法论，用于系统地研究以轮廓为导向的合成数据集的设计空间。这使我们能够高效地搜索FDSL参数的最佳范围，并最大化数据集中合成图像的多样性，我们发现这是一个关键因素。当使用由此产生的新数据集VisualAtom-21k进行ViT-Base的预训练时，在对ImageNet-1k进行微调时，top-1准确率达到了83.7%。这与JFT-300M预训练实现的84.2%的top-1准确率仅相差0.5%，尽管图像的规模为1/14。与静态数据集JFT-300M不同，合成数据集的质量将继续提高，当前工作证明了这一可能性。FDSL也不受与真实图像相关的常见问题的影响，例如隐私/版权问题、标注成本/错误和伦理偏见。

Paper14 Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution

摘要原文: This paper solves the problem of learning dense visual correspondences between different object instances of the same category with only sparse annotations. We decompose this pixel-level semantic matching problem into two easier ones: (i) First, local feature descriptors of source and target images need to be mapped into shared semantic spaces to get coarse matching flows. (ii) Second, matching flows in low resolution should be refined to generate accurate point-to-point matching results. We propose asymmetric feature learning and matching flow super-resolution based on vision transformers to solve the above problems. The asymmetric feature learning module exploits a biased cross-attention mechanism to encode token features of source images with their target counterparts. Then matching flow in low resolutions is enhanced by a super-resolution network to get accurate correspondences. Our pipeline is built upon vision transformers and can be trained in an end-to-end manner. Extensive experimental results on several popular benchmarks, such as PF-PASCAL, PF-WILLOW, and SPair-71K, demonstrate that the proposed method can catch subtle semantic differences in pixels efficiently. Code is available on https://github.com/YXSUNMADMAX/ACTR.

中文总结: 这篇论文解决了在只有稀疏注释的情况下学习相同类别不同对象实例之间的密集视觉对应的问题。我们将这个像素级语义匹配问题分解为两个较容易的问题：（i）首先，需要将源图像和目标图像的局部特征描述符映射到共享的语义空间中，以获得粗匹配流；（ii）其次，需要对低分辨率中的匹配流进行细化，以生成准确的点对点匹配结果。我们提出了基于视觉变换器的不对称特征学习和匹配流超分辨率来解决上述问题。不对称特征学习模块利用偏向的跨注意机制来编码源图像的标记特征及其目标对应物。然后，通过超分辨率网络增强低分辨率中的匹配流，以获得准确的对应关系。我们的流程建立在视觉变换器之上，并可以进行端到端的训练。在几个流行的基准数据集上进行的大量实验结果表明，所提出的方法能够高效地捕捉像素中微妙的语义差异。代码可在https://github.com/YXSUNMADMAX/ACTR 上找到。

Paper15 Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers

摘要原文: Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a non-local token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-by-layer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet.

中文总结: 这段话主要介绍了一种名为Feature Shrinkage Pyramid Network (FSPNet)的基于Transformer的模型，旨在解决视觉Transformer在伪装目标检测中存在的局部建模和特征聚合不足的问题。该模型通过逐渐缩小的方式，通过层级解码增强邻近Transformer特征，以用于伪装目标检测。具体来说，模型引入了非局部令牌增强模块（NL-TEM）来增强Transformer的局部表示，同时设计了具有邻近交互模块（AIM）的特征缩小解码器（FSD），通过逐层缩小金字塔逐渐聚合邻近Transformer特征，以累积有效但难以察觉的线索，用于目标信息解码。实验结果表明，该模型在三个具有挑战性的伪装目标检测基准数据集上，通过六种常用的评估指标，显著优于现有的24个竞争对手。模型代码已公开发布在https://github.com/ZhouHuang23/FSPNet。

Paper16 Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions

摘要原文: Despite their success, vision transformers still remain vulnerable to image corruptions, such as noise or blur. Indeed, we find that the vulnerability mainly stems from the unstable self-attention mechanism, which is inherently built upon patch-based inputs and often becomes overly sensitive to the corruptions across patches. For example, when we only occlude a small number of patches with random noise (e.g., 10%), these patch corruptions would lead to severe accuracy drops and greatly distract intermediate attention layers. To address this, we propose a new training method that improves the robustness of transformers from a new perspective – reducing sensitivity to patch corruptions (RSPC). Specifically, we first identify and occlude/corrupt the most vulnerable patches and then explicitly reduce sensitivity to them by aligning the intermediate features between clean and corrupted examples. We highlight that the construction of patch corruptions is learned adversarially to the following feature alignment process, which is particularly effective and essentially different from existing methods. In experiments, our RSPC greatly improves the stability of attention layers and consistently yields better robustness on various benchmarks, including CIFAR-10/100-C, ImageNet-A, ImageNet-C, and ImageNet-P.

中文总结: 尽管视觉transformers取得了成功，但仍然容易受到图像污染的影响，比如噪音或模糊。事实上，我们发现这种脆弱性主要源自不稳定的自注意机制，它基于基于patch的输入构建，往往对patch之间的污染过于敏感。例如，当我们仅对少量patch施加随机噪音（例如10%）时，这些patch的污染会导致严重的准确率下降，并严重干扰中间的注意力层。为了解决这个问题，我们提出了一种新的训练方法，从减少对patch污染的敏感性（RSPC）的新角度来提高transformers的鲁棒性。具体来说，我们首先识别并遮挡/污染最脆弱的patch，然后通过调整干净和污染示例之间的中间特征明确降低对它们的敏感性。我们强调，patch污染的构建是通过对接下来的特征对齐过程进行对抗学习的，这种方法特别有效，与现有方法本质上有所不同。在实验中，我们的RSPC显著提高了注意力层的稳定性，并在各种基准测试中始终表现出更好的鲁棒性，包括CIFAR-10/100-C、ImageNet-A、ImageNet-C和ImageNet-P。

Paper17 RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving

摘要原文: Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs’ lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. © We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. The code is available at https://github.com/valeoai/rangevit.

中文总结: 这段话主要讨论了将室外LiDAR点云的语义分割视为二维问题的有效和流行方法。通过范围投影等方法，这些基于投影的方法通常能够从快速计算中受益，并且当与使用其他点云表示的技术结合时，可以实现最先进的结果。目前，基于投影的方法利用二维卷积神经网络（CNNs），但计算机视觉领域的最新进展显示，视觉变换器（ViTs）在许多基于图像的基准测试中取得了最先进的结果。在这项工作中，我们探讨了投影为基础的三维语义分割方法是否可以从ViTs的最新改进中受益。我们得出积极的结论，但只有在将它们与三个关键要素结合使用时才能实现：(a) ViTs训练难度大，需要大量训练数据来学习强大的表示。通过保持与RGB图像相同的骨干架构，我们可以利用对大量图像集合进行长时间训练所获得的知识，这比获取和注释点云要便宜得多。我们在大型图像数据集上预训练的ViTs取得了最佳结果。(b) 我们通过为经典线性嵌入层替换量身定制的卷积干扰来弥补ViTs的缺乏归纳偏差。© 我们使用卷积解码器和来自卷积干扰的跳跃连接来细化像素级预测，将卷积干扰的低级但细粒度特征与ViT编码器的高级但粗糙预测相结合。通过这些要素，我们展示了我们的方法RangeViT在nuScenes和SemanticKITTI上优于现有的基于投影的方法。该代码可在https://github.com/valeoai/rangevit找到。

Paper18 Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring

摘要原文: We present an effective and efficient method that explores the properties of Transformers in the frequency domain for high-quality image deblurring. Our method is motivated by the convolution theorem that the correlation or convolution of two signals in the spatial domain is equivalent to an element-wise product of them in the frequency domain. This inspires us to develop an efficient frequency domain-based self-attention solver (FSAS) to estimate the scaled dot-product attention by an element-wise product operation instead of the matrix multiplication in the spatial domain. In addition, we note that simply using the naive feed-forward network (FFN) in Transformers does not generate good deblurred results. To overcome this problem, we propose a simple yet effective discriminative frequency domain-based FFN (DFFN), where we introduce a gated mechanism in the FFN based on the Joint Photographic Experts Group (JPEG) compression algorithm to discriminatively determine which low- and high-frequency information of the features should be preserved for latent clear image restoration. We formulate the proposed FSAS and DFFN into an asymmetrical network based on an encoder and decoder architecture, where the FSAS is only used in the decoder module for better image deblurring. Experimental results show that the proposed method performs favorably against the state-of-the-art approaches.

中文总结: 本文提出了一种有效且高效的方法，通过在频域中探索Transformer的特性来进行高质量图像去模糊。该方法受到卷积定理的启发，即在空间域中两个信号的相关性或卷积等效于在频域中对它们进行逐元素乘积。这启发我们开发了一种高效的基于频域的自注意力求解器（FSAS），通过逐元素乘积操作来估计缩放点积注意力，而不是在空间域中进行矩阵乘法。此外，我们注意到在Transformer中简单使用朴素前馈网络（FFN）不能生成良好的去模糊结果。为了克服这个问题，我们提出了一种简单而有效的基于鉴别频域的FFN（DFFN），其中我们引入了一个基于JPEG压缩算法的门控机制来鉴别确定应保留哪些特征的低频和高频信息，以进行潜在清晰图像恢复。我们将所提出的FSAS和DFFN构建成一个基于编码器和解码器架构的不对称网络，其中FSAS仅在解码器模块中用于更好的图像去模糊。实验结果表明，所提出的方法在性能上优于现有的方法。

Paper19 Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers

摘要原文: Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed.

中文总结: 这段话主要讨论了自回归变压器在视频生成中取得了显著的成功。然而，由于自注意力的二次复杂度，变压器被禁止直接学习视频中的长期依赖关系，并且由于自回归过程导致的推理时间缓慢和错误传播问题。在这篇论文中，作者提出了一种内存高效的双向变压器（MeBT），用于端到端学习视频中的长期依赖关系和快速推理。基于最近双向变压器的进展，该方法学习以并行方式从部分观察到的补丁中解码视频的整个时空体积。所提出的变压器在编码和解码中都实现了线性时间复杂度，通过将可观察的上下文标记投影到固定数量的潜在标记，并通过交叉注意力来调节它们以解码掩码标记。受线性复杂性和双向建模的启发，该方法在生成中等长度视频的质量和速度方面显著优于自回归变压器。

Paper20 MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

摘要原文: In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods.

中文总结: 本文提出了混合和蒙版自编码器（MixMAE），这是一种简单但高效的预训练方法，适用于各种分层视觉Transformer。现有的用于分层视觉Transformer的蒙版图像建模（MIM）方法会用特殊的[MASK]符号替换输入令牌的随机子集，并旨在从损坏的图像中重建原始图像令牌。然而，我们发现使用[MASK]符号会大大减慢训练速度，并导致预训练和微调不一致，这是由于大量的蒙版比例（例如SimMIM中的60%）。另一方面，MAE在其编码器中根本不引入[MASK]令牌，但不适用于分层视觉Transformer。为了解决这个问题并加速分层模型的预训练，我们用另一个图像的可见令牌替换一个图像的蒙版令牌，即创建一个混合图像。然后进行双重重建，从混合输入中重建两个原始图像，这显著提高了效率。虽然MixMAE可以应用于各种分层Transformer，本文探讨了使用具有大窗口大小并扩展到巨大模型大小（达到600M参数）的Swin Transformer。实证结果表明，MixMAE可以高效地学习高质量的视觉表示。值得注意的是，MixMAE与Swin-B/W14在ImageNet-1K上通过预训练600个时代实现了85.1%的top-1准确率。此外，它在其他6个数据集上的转移性能表明，MixMAE在FLOPs/性能权衡方面比以前流行的MIM方法更好。

Paper21 Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization

摘要原文: Vision transformers (ViTs) have been successfully deployed in a variety of computer vision tasks, but they are still vulnerable to adversarial samples. Transfer-based attacks use a local model to generate adversarial samples and directly transfer them to attack a target black-box model. The high efficiency of transfer-based attacks makes it a severe security threat to ViT-based applications. Therefore, it is vital to design effective transfer-based attacks to identify the deficiencies of ViTs beforehand in security-sensitive scenarios. Existing efforts generally focus on regularizing the input gradients to stabilize the updated direction of adversarial samples. However, the variance of the back-propagated gradients in intermediate blocks of ViTs may still be large, which may make the generated adversarial samples focus on some model-specific features and get stuck in poor local optima. To overcome the shortcomings of existing approaches, we propose the Token Gradient Regularization (TGR) method. According to the structural characteristics of ViTs, TGR reduces the variance of the back-propagated gradient in each internal block of ViTs in a token-wise manner and utilizes the regularized gradient to generate adversarial samples. Extensive experiments on attacking both ViTs and CNNs confirm the superiority of our approach. Notably, compared to the state-of-the-art transfer-based attacks, our TGR offers a performance improvement of 8.8 % on average.

中文总结: 这段话主要讨论了Vision transformers（ViTs）在计算机视觉任务中的成功应用，但仍然容易受到对抗样本的影响。转移攻击利用局部模型生成对抗样本，直接转移攻击目标黑盒模型。转移攻击的高效性使其对基于ViT的应用构成严重安全威胁。因此，在安全敏感场景中设计有效的转移攻击以提前识别ViTs的缺陷至关重要。现有研究通常侧重于规范化输入梯度以稳定对抗样本的更新方向。然而，ViTs中间块的反向传播梯度方差仍可能较大，这可能导致生成的对抗样本专注于某些模型特定特征并陷入较差的局部最优解。为了克服现有方法的缺点，提出了Token Gradient Regularization（TGR）方法。根据ViTs的结构特征，TGR以标记方式减少ViTs每个内部块中反向传播梯度的方差，并利用规范化梯度生成对抗样本。对攻击ViTs和CNNs的广泛实验证实了我们方法的优越性。值得注意的是，与最先进的转移攻击相比，我们的TGR平均性能提升了8.8％。

Paper22 Making Vision Transformers Efficient From a Token Sparsification View

摘要原文: The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecovery) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.

中文总结: 这段话主要讨论了视觉Transformer（ViTs）的实际应用受到令牌数量的二次计算复杂度限制，为了实现高效的ViTs，有几种方法提出了修剪冗余令牌的方案。然而，这些方法通常存在以下问题：（i）显著的准确度下降，（ii）在局部视觉Transformer中的应用困难，以及（iii）对下游任务的非通用网络。在这项工作中，他们提出了一种新颖的语义令牌ViT（STViT），用于高效的全局和局部视觉Transformer，同时也可以作为下游任务的骨干网络。语义令牌代表聚类中心，通过在空间中汇总图像令牌进行初始化，并通过注意力进行恢复，可以自适应地表示全局或局部语义信息。由于聚类属性，少数语义令牌可以实现与大量图像令牌相同的效果，适用于全局和局部视觉Transformer。例如，在DeiT-(Tiny,Small,Base)上仅使用16个语义令牌就可以实现与大幅提高推理速度和减少近60%的FLOPs的相同准确度；在Swin-(Tiny,Small,Base)上，每个窗口中使用16个语义令牌可以进一步提高约20%的速度，并略微提高准确度。除了在图像分类方面取得巨大成功外，他们还将方法扩展到视频识别。此外，他们设计了一个基于STViT的STViT-R（ecovery）网络，用于基于STViT恢复详细的空间信息，使其适用于下游任务，这是之前的令牌稀疏化方法无法做到的。实验证明，他们的方法在目标检测和实例分割方面可以取得与原始网络相媲美的结果，并且在骨干网络中可以减少超过30%的FLOPs。

Paper23 TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers

摘要原文: Head pose estimation (HPE) has been widely used in the fields of human machine interaction, self-driving, and attention estimation. However, existing methods cannot deal with extreme head pose randomness and serious occlusions. To address these challenges, we identify three cues from head images, namely, neighborhood similarities, significant facial changes, and critical minority relationships. To leverage the observed findings, we propose a novel critical minority relationship-aware method based on the Transformer architecture in which the facial part relationships can be learned. Specifically, we design several orientation tokens to explicitly encode the basic orientation regions. Meanwhile, a novel token guide multi-loss function is designed to guide the orientation tokens as they learn the desired regional similarities and relationships. We evaluate the proposed method on three challenging benchmark HPE datasets. Experiments show that our method achieves better performance compared with state-of-the-art methods. Our code is publicly available at https://github.com/zc2023/TokenHPE.

中文总结: 这段话主要讨论了头部姿态估计（HPE）在人机交互、自动驾驶和注意力估计等领域的广泛应用。然而，现有方法无法处理极端头部姿态随机性和严重的遮挡问题。为了应对这些挑战，研究人员从头部图像中识别出三个线索，即邻域相似性、显著的面部变化和关键的少数关系。为了利用这些发现，他们提出了一种基于Transformer架构的新型关键少数关系感知方法，其中可以学习面部部分之间的关系。具体来说，他们设计了几个方向标记，明确编码基本方向区域。同时，设计了一种新颖的标记引导多损失函数，用于指导方向标记在学习所需的区域相似性和关系时。他们在三个具有挑战性的基准HPE数据集上评估了所提出的方法。实验证明，与最先进的方法相比，我们的方法表现更好。我们的代码可以在https://github.com/zc2023/TokenHPE 上公开获取。

Paper24 Recurrent Vision Transformers for Object Detection With Event Cameras

摘要原文: We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local- and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision.

中文总结: 我们提出了循环视觉Transformer（RVTs），这是一种新颖的用于事件相机物体检测的骨干网络。事件相机提供具有亚毫秒延迟、高动态范围以及对运动模糊具有强大鲁棒性的视觉信息。这些独特属性为在时间关键场景下进行低延迟物体检测和跟踪提供了巨大潜力。先前在基于事件的视觉方面的工作已经取得了出色的检测性能，但通常以超过40毫秒的推理时间为代价。通过重新审视循环视觉骨干的高级设计，我们将推理时间缩短了6倍，同时保持了类似的性能。为了实现这一目标，我们探索了一个多阶段设计，每个阶段利用了三个关键概念：首先，一个可以看作是条件位置嵌入的卷积先验。其次，局部和扩张全局自注意力用于空间特征交互。第三，循环时间特征聚合以最小化延迟同时保留时间信息。RVTs可以从头开始训练，在事件驱动物体检测方面达到最先进的性能 - 在Gen1汽车数据集上实现了47.2％的mAP。同时，RVTs提供快速推理（在T4 GPU上<12毫秒）和有利的参数效率（比先前技术少5倍）。我们的研究为有效的设计选择带来了新的见解，这对基于事件的视觉研究具有积极意义。

Paper25 RGB No More: Minimally-Decoded JPEG Vision Transformers

摘要原文: Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements – ViT and data augmentation – we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart.

中文总结: 这段话主要讨论了计算机视觉中神经网络通常使用RGB图像进行推断的问题。然而，这些RGB图像通常在保存到磁盘之前以JPEG格式编码；解码这些图像会给RGB网络带来不可避免的开销。相反，作者的工作集中在直接从JPEG编码的特征训练Vision Transformers (ViT)。这样一来，我们可以避免大部分解码开销，加快数据加载速度。现有研究已经探讨了这一方面，但它们主要集中在CNN上。由于这些编码特征的结构，CNN需要对其架构进行大量修改才能接受这种数据。在这里，作者展示了ViTs不需要进行类似的修改。此外，他们直接处理这些编码特征上的数据增强，据他们所知，在这种设置下进行训练的深入探讨尚未被开展。通过这两个改进 – ViT和数据增强 – 作者展示了他们的ViT-Ti模型相比RGB对应模型实现了高达39.2%的更快训练和17.9%的更快推断速度，而且没有损失准确性。

Paper26 Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers

摘要原文: Vision Transformers (ViT) have shown competitive advantages in terms of performance compared to convolutional neural networks (CNNs), though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT’s multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose an approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module that estimates the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature and therefore provide the opportunity to reduce network FLOPs via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto frontier between FLOPs and top-1 accuracy on ImageNet compared to token sparsity. Our method reduces 48% 69% FLOPs of MHSA while the accuracy drop is within 0.4%. We also show that combining attention and token sparsity reduces ViT FLOPs by over 60%.

中文总结: 这段话主要讨论了视觉Transformer（ViT）相对于卷积神经网络（CNNs）在性能上的竞争优势，尽管它们通常伴随着高计算成本。为了加速ViT的多头自注意力（MHSA）操作，先前的方法探索了不同的注意力模式，通过限制一定数量的空间附近的令牌来实现。然而，这种结构化的注意力模式限制了令牌之间的连接仅限于它们的空间相关性，而忽略了从完整的注意力掩码中学习到的语义连接。在这项工作中，我们提出了一种学习实例相关注意力模式的方法，通过设计一个轻量级的连接性预测模块，来估计每对令牌的连接得分。直观地说，如果两个令牌的特征在空间上或语义上被认为相关，则它们的连接得分会很高。由于每个令牌只与少数其他令牌相关，因此二值化的连接性掩码通常是非常稀疏的，从而提供了通过稀疏计算来减少网络FLOPs的机会。配备了学习到的非结构化注意力模式，稀疏注意力ViT（Sparsifiner）在ImageNet上的FLOPs和top-1准确率之间产生了优越的Pareto前沿，相较于令牌稀疏。我们的方法将MHSA的FLOPs降低了48%至69%，而准确率下降不超过0.4%。我们还展示了将注意力和令牌稀疏结合起来可以将ViT的FLOPs减少超过60%。

Paper27 Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers

摘要原文: Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at https://github.com/yhlleo/MJP.

中文总结: 这段话主要讨论了在视觉Transformer（ViTs）中位置嵌入（PEs）的重要性以及存在的隐私泄漏风险。作者提出了一种名为Masked Jigsaw Puzzle（MJP）位置嵌入方法来解决这些问题。具体来说，MJP首先通过他们的分块随机拼图洗牌算法对选定的补丁进行洗牌，并遮挡它们对应的PEs。同时，对于未遮挡的补丁，PEs保持原始状态，但它们的空间关系通过作者的密集绝对定位回归器得到加强。实验结果表明，PEs明确编码了2D空间关系，并在梯度反转攻击下导致严重的隐私泄漏问题；使用朴素洗牌补丁训练ViTs可以缓解问题，但会损害准确性；在一定的洗牌比例下，提出的MJP不仅提高了在大规模数据集（如ImageNet-1K和ImageNet-C，-A/O）上的性能和鲁棒性，还大幅提高了在典型梯度攻击下的隐私保护能力。

Paper28 IS-GGT: Iterative Scene Graph Generation With Generative Transformers

摘要原文: Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches.

中文总结: 这段话主要讨论了场景图(scene graphs)的重要性以及一种基于生成式变换器的方法来生成场景图的效率。场景图通过以图形格式编码实体（对象）及其空间关系，提供了丰富且结构化的场景表示。该表示在多个任务中被证明是有用的，例如问题回答、字幕生成，甚至目标检测等。目前的方法采用一种逐步分类的方法生成场景图，通过标记场景中所有可能的对象之间的边缘，这增加了方法的计算开销。该工作引入了一种基于生成式变换器的方法，用于生成超出链接预测的场景图。利用两个基于变换器的组件，首先从检测到的对象及其视觉特征中对可能的场景图结构进行采样。然后对采样的边缘进行谓词分类，生成最终的场景图。这种方法使我们能够高效地从图像中生成场景图，而且推断开销最小。对Visual Genome数据集的大量实验表明了该方法的高效性。在不添加任何花哨的功能的情况下，我们在不同设置下的场景图生成（SGG）中获得了平均20.7%的平均召回率（mR@100），优于最先进的SGG方法，同时在无偏的SGG方法中提供了竞争性的性能。

Paper29 Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization

摘要原文: Real-world medical image segmentation has tremendous long-tailed complexity of objects, among which tail conditions correlate with relatively rare diseases and are clinically significant. A trustworthy medical AI algorithm should demonstrate its effectiveness on tail conditions to avoid clinically dangerous damage in these out-of-distribution (OOD) cases. In this paper, we adopt the concept of object queries in Mask transformers to formulate semantic segmentation as a soft cluster assignment. The queries fit the feature-level cluster centers of inliers during training. Therefore, when performing inference on a medical image in real-world scenarios, the similarity between pixels and the queries detects and localizes OOD regions. We term this OOD localization as MaxQuery. Furthermore, the foregrounds of real-world medical images, whether OOD objects or inliers, are lesions. The difference between them is obviously less than that between the foreground and background, resulting in the object queries may focus redundantly on the background. Thus, we propose a query-distribution (QD) loss to enforce clear boundaries between segmentation targets and other regions at the query level, improving the inlier segmentation and OOD indication. Our proposed framework is tested on two real-world segmentation tasks, i.e., segmentation of pancreatic and liver tumors, outperforming previous leading algorithms by an average of 7.39% on AUROC, 14.69% on AUPR, and 13.79% on FPR95 for OOD localization. On the other hand, our framework improves the performance of inlier segmentation by an average of 5.27% DSC compared with nnUNet.

中文总结: 这段话主要讨论了现实世界中医学图像分割的复杂性，尤其是针对稀有疾病相关的尾部情况。值得信赖的医学人工智能算法应该在尾部情况上展示其有效性，以避免在这些分布之外的情况下造成临床上的危险损害。作者在这篇论文中采用了Mask transformers中的对象查询概念，将语义分割表述为软聚类分配。在训练期间，查询与内部特征级聚类中心相匹配。因此，在实际情况下对医学图像进行推断时，像素与查询之间的相似性可检测和定位分布之外的区域。作者将这种分布之外的定位称为MaxQuery。此外，现实世界医学图像的前景，无论是分布之外的对象还是内部对象，都是病变。它们之间的差异明显小于前景和背景之间的差异，导致对象查询可能过度关注背景。因此，作者提出了一个查询分布（QD）损失，以在查询级别上强化分割目标与其他区域之间的清晰边界，从而改善内部对象的分割和分布之外的指示。作者提出的框架在两个现实世界的分割任务上进行了测试，即胰腺和肝脏肿瘤的分割，相对于nnUNet，平均提高了7.39%的AUROC、14.69%的AUPR和13.79%的FPR95以及5.27%的DSC。

Paper30 PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers

摘要原文: Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results.

中文总结: 这段话主要讨论了现有的多人视频3D人体姿态和形状估计方法通常采用两阶段策略，首先在每一帧中检测人体实例，然后使用时间模型执行单人姿态和形状估计。然而，空间实例之间的全局时空上下文无法被捕获。作者提出了一种新的端到端多人3D姿态和形状估计框架，名为PSVT。在PSVT中，空间-时间编码器（STE）捕获了空间对象之间的全局特征依赖关系。然后，空间-时间姿态解码器（STPD）和形状解码器（STSD）分别捕获了姿态查询和特征标记之间的全局依赖关系，形状查询和特征标记之间的全局依赖关系。为了处理随时间变化的对象差异，采用了逐步解码的新方案，在每一帧中更新姿态和形状查询。此外，作者提出了一种新颖的姿态引导注意力（PGA）用于形状解码器，以更好地预测形状参数。这两个组件加强了PSVT的解码器，从而提高了性能。在四个数据集上进行的大量实验表明，PSVT取得了最先进的结果。

Paper31 A Light Touch Approach to Teaching Transformers Multi-View Geometry

摘要原文: Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a “light touch” approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer’s cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time.

中文总结: 这段话主要讨论了Transformer在视觉学习中的强大能力，部分原因在于它们缺乏手动指定的先验知识。这种灵活性在涉及多视角几何任务时可能会出现问题，因为3D形状和视点存在近乎无限的变化可能性（需要灵活性），而投影几何的精确性质又遵循严格的规律。为了解决这一难题，提出了一种“轻触”方法，引导视觉Transformer学习多视角几何，但在需要时允许其自由发挥。通过使用极线来引导Transformer的交叉注意力图，惩罚超出极线范围的注意力值，并鼓励沿着这些线更高的注意力，因为它们包含几何上合理的匹配。与先前的方法不同，我们的提议在测试时不需要任何相机姿态信息。我们专注于姿态不变的对象实例检索，标准Transformer网络在这方面往往表现不佳，因为查询和检索图像之间的视角差异很大。实验证明，我们的方法在对象检索方面优于最先进的方法，而且在测试时不需要姿态信息。

Paper32 Trade-Off Between Robustness and Accuracy of Vision Transformers

摘要原文: Although deep neural networks (DNNs) have shown great successes in computer vision tasks, they are vulnerable to perturbations on inputs, and there exists a trade-off between the natural accuracy and robustness to such perturbations, which is mainly caused by the existence of robust non-predictive features and non-robust predictive features. Recent empirical analyses find Vision Transformers (ViTs) are inherently robust to various kinds of perturbations, but the aforementioned trade-off still exists for them. In this work, we propose Trade-off between Robustness and Accuracy of Vision Transformers (TORA-ViTs), which aims to efficiently transfer ViT models pretrained on natural tasks for both accuracy and robustness. TORA-ViTs consist of two major components, including a pair of accuracy and robustness adapters to extract predictive and robust features, respectively, and a gated fusion module to adjust the trade-off. The gated fusion module takes outputs of a pretrained ViT block as queries and outputs of our adapters as keys and values, and tokens from different adapters at different spatial locations are compared with each other to generate attention scores for a balanced mixing of predictive and robust features. Experiments on ImageNet with various robust benchmarks show that our TORA-ViTs can efficiently improve the robustness of naturally pretrained ViTs while maintaining competitive natural accuracy. Our most balanced setting (TORA-ViTs with lambda = 0.5) can maintain 83.7% accuracy on clean ImageNet and reach 54.7% and 38.0% accuracy under FGSM and PGD white-box attacks, respectively. In terms of various ImageNet variants, it can reach 39.2% and 56.3% accuracy on ImageNet-A and ImageNet-R and reach 34.4% mCE on ImageNet-C.

中文总结: 尽管深度神经网络（DNNs）在计算机视觉任务中取得了巨大成功，但它们对输入的扰动很容易受到影响，存在自然准确性和对这些扰动的稳健性之间的权衡，主要是由于存在稳健的非预测特征和非稳健的预测特征。最近的实证分析发现Vision Transformers（ViTs）天生对各种扰动具有稳健性，但对于它们仍然存在上述的权衡。在这项工作中，我们提出了Vision Transformers（ViTs）的稳健性和准确性之间的权衡（TORA-ViTs），旨在有效地将在自然任务上预训练的ViT模型转移到准确性和稳健性。TORA-ViTs包括两个主要组件，包括一对准确性和稳健性适配器，分别提取预测性和稳健性特征，以及一个门控融合模块来调整权衡。门控融合模块以预训练的ViT块的输出作为查询，以我们的适配器的输出作为键和值，并将不同空间位置的不同适配器的标记进行比较，生成注意力分数，以平衡混合预测性和稳健性特征。在ImageNet上进行的各种稳健基准实验表明，我们的TORA-ViTs可以有效提高自然预训练的ViTs的稳健性，同时保持竞争力的自然准确性。我们最平衡的设置（TORA-ViTs，λ=0.5）在干净的ImageNet上可以保持83.7%的准确率，在FGSM和PGD白盒攻击下分别达到54.7%和38.0%的准确率。在各种ImageNet变体方面，它可以在ImageNet-A和ImageNet-R上分别达到39.2%和56.3%的准确率，并在ImageNet-C上达到34.4%的mCE。

Paper33 Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

摘要原文: Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-oriented fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://github.com/megvii-research/TPS-CVPR2023.

中文总结: 尽管视觉变换器（ViTs）最近在各种计算机视觉任务中显示出有希望的结果，但它们的高计算成本限制了它们的实际应用。先前通过修剪多余标记的方法在性能和计算成本之间取得了良好的平衡。然而，修剪策略引起的错误可能导致重要信息的丢失。我们的定量实验表明，修剪标记对性能的影响应该是显著的。为了解决这个问题，我们提出了一种新颖的联合标记修剪和挤压模块（TPS）用于更高效地压缩视觉变换器。首先，TPS采用修剪来获取保留和修剪子集。其次，TPS通过单向最近邻匹配和基于相似性的融合步骤将修剪标记的信息挤压到部分保留标记中。与最先进的方法相比，我们的方法在所有标记修剪强度下表现优异。特别是在将DeiT-tiny&small的计算预算缩小到35%时，与ImageNet分类基线相比，它将准确性提高了1%-6%。所提出的方法可以加速DeiT-small的吞吐量，超越DeiT-tiny的准确性高出4.78%。对各种变换器的实验证明了我们方法的有效性，而分析实验证明了我们对标记修剪策略错误的更高鲁棒性。源代码可在https://github.com/megvii-research/TPS-CVPR2023获取。

Paper34 Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

摘要原文: We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with “mixed” supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

中文总结: 本文主要讨论了利用自监督预训练的Vision Transformer（ViT）来解决弱监督少样本图像分类和分割任务。作者提出的方法通过利用自监督ViT生成的token表示，并通过自注意力机制来处理这些表示之间的相关性，从而通过独立的任务头部生成分类和分割预测。该模型能够在训练过程中有效地学习执行分类和分割任务，而无需像素级别的标签，仅使用图像级别的标签。为了实现这一点，模型利用由自监督ViT骨干生成的token产生的注意力图作为像素级伪标签。此外，作者还探讨了一种“混合”监督的实际设置，其中少量训练图像包含地面实际像素级标签，其余图像仅具有图像级别标签。对于这种混合设置，作者提出利用一个使用可用地面实际像素级标签训练的伪标签增强器来改进伪标签。在Pascal-5i和COCO-20i数据集上的实验表明，在各种监督设置下，尤其是在几乎没有像素级标签可用时，该方法实现了显著的性能提升。

Paper35 Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers

摘要原文: We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) – a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models.

中文总结: 这段话主要介绍了Region-aware Open-vocabulary Vision Transformers (RO-ViT)这一对比图像-文本预训练方法，旨在弥合图像级预训练和开放词汇目标检测之间的差距。在预训练阶段，他们提出随机裁剪和调整位置嵌入的区域，而不是使用整个图像的位置嵌入，以更好地匹配检测微调阶段的区域级位置嵌入的使用。此外，他们将对比学习中常见的softmax交叉熵损失替换为focal loss，以更好地学习具有信息量但难度较大的示例。最后，他们利用了最近的新颖目标提议的进展，以改善开放词汇检测的微调。他们在LVIS和COCO开放词汇检测基准以及零样本迁移上评估了他们的完整模型。RO-ViT在LVIS上取得了32.1的APr，超越了最佳现有方法5.8个点，同时具有竞争力的零样本迁移检测。令人惊讶的是，RO-ViT还改进了图像级表示，并在COCO和Flickr图像-文本检索基准的12个指标中的9个上达到了最先进水平，胜过了具有更大模型的竞争性方法。

Paper36 AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers

摘要原文: 3D object detection techniques commonly follow a pipeline that aggregates predicted object central point features to compute candidate points. However, these candidate points contain only positional information, largely ignoring the object-level shape information. This eventually leads to sub-optimal 3D object detection. In this work, we propose AShapeFormer, a semantics-guided object-level shape encoding module for 3D object detection. This is a plug-n-play module that leverages multi-head attention to encode object shape information. We also propose shape tokens and object-scene positional encoding to ensure that the shape information is fully exploited. Moreover, we introduce a semantic guidance sub-module to sample more foreground points and suppress the influence of background points for a better object shape perception. We demonstrate a straightforward enhancement of multiple existing methods with our AShapeFormer. Through extensive experiments on the popular SUN RGB-D and ScanNetV2 dataset, we show that our enhanced models are able to outperform the baselines by a considerable absolute margin of up to 8.1%. Code will be available at https://github.com/ZechuanLi/AShapeFormer

中文总结: 这段话主要介绍了关于3D目标检测技术的内容。传统的3D目标检测技术通常遵循一个流程，即聚合预测的目标中心点特征来计算候选点。然而，这些候选点只包含位置信息，很大程度上忽略了对象级别的形状信息，最终导致了3D目标检测的次优性能。在这项工作中，提出了AShapeFormer，这是一个基于语义引导的对象级形状编码模块，用于3D目标检测。这是一个即插即用的模块，利用多头注意力来编码对象的形状信息。同时，提出了形状标记和对象-场景位置编码，以确保充分利用形状信息。此外，引入了一个语义引导子模块，用于采样更多前景点并抑制背景点的影响，以更好地感知对象的形状。通过在流行的SUN RGB-D和ScanNetV2数据集上进行大量实验，展示了使用AShapeFormer对多种现有方法进行简单增强的效果。结果表明，我们增强的模型能够以高达8.1%的绝对边际优势击败基线模型。代码将在https://github.com/ZechuanLi/AShapeFormer 上提供。

Paper37 Learning Expressive Prompting With Residuals for Vision Transformers

摘要原文: Prompt learning is an efficient approach to adapt transformers by inserting learnable set of parameters into the input and intermediate representations of a pre-trained model. In this work, we present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT). Out method constructs downstream representations via learnable “output” tokens, that are akin to the learned class tokens of the ViT. Further for better steering of the downstream representation processed by the frozen transformer, we introduce residual learnable tokens that are added to the output of various computations. We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark. In addition to strong performance, we observe that our approach is an order of magnitude more prompt efficient than existing visual prompting baselines. We analytically show the computational benefits of our approach over weight space adaptation techniques like finetuning. Lastly we systematically corroborate the architectural design of our method via a series of ablation experiments.

中文总结: 这段话主要讨论了一种称为EXPRES的方法，它修改了prompt learning范式，专门用于有效地适应视觉transformers（ViT）。该方法通过可学习的“输出” tokens 构建下游表示，类似于ViT的学习类别 tokens。为了更好地引导由冻结transformer处理的下游表示，引入了残差可学习 tokens，这些 tokens 被添加到各种计算的输出中。作者将EXPRES应用于图像分类、少样本学习和语义分割，并展示了该方法在VTAB基准测试的3/3类别上能够实现最先进的prompt调整。除了强大的性能外，作者观察到他们的方法比现有的视觉提示基线高一个数量级的prompt效率。作者通过分析展示了他们的方法相比微调等权重空间适应技术的计算优势。最后，作者通过一系列消融实验系统地证实了他们方法的架构设计。

Paper38 Supervised Masked Knowledge Distillation for Few-Shot Transformers

摘要原文: Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD.

中文总结: Vision Transformers（ViTs）出现以在许多数据丰富的计算机视觉任务中取得令人印象深刻的性能，通过捕捉局部特征之间的长距离依赖关系。然而，在仅有少量标记数据的小数据集上进行少样本学习（FSL）时，ViT往往会出现过拟合，并且由于其缺乏类似CNN的归纳偏差而遭受严重性能下降。以往在FSL领域的研究通过自监督辅助损失的帮助，或者通过在监督设置下灵活使用标签信息来避免这一问题。但是自监督和监督少样本Transformer之间的差距仍未填补。受最近自监督知识蒸馏和遮罩图像建模（MIM）的进展启发，我们提出了一种新颖的监督遮罩知识蒸馏模型（SMKD）用于少样本Transformer，将标签信息融入自蒸馏框架中。与以往的自监督方法相比，我们允许在类和补丁令牌上进行类内知识蒸馏，并引入了跨类内图像的遮罩补丁令牌重建的挑战性任务。在四个少样本分类基准数据集上的实验结果表明，我们的方法设计简单，表现优于以往方法很多，并取得了新的最先进水平。详细的消融研究证实了我们模型的每个组件的有效性。本文代码可在此处获取：https://github.com/HL-hanlin/SMKD。

Paper39 DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality

摘要原文: Vector font synthesis is a challenging and ongoing problem in the fields of Computer Vision and Computer Graphics. The recently-proposed DeepVecFont achieved state-of-the-art performance by exploiting information of both the image and sequence modalities of vector fonts. However, it has limited capability for handling long sequence data and heavily relies on an image-guided outline refinement post-processing. Thus, vector glyphs synthesized by DeepVecFont still often contain some distortions and artifacts and cannot rival human-designed results. To address the above problems, this paper proposes an enhanced version of DeepVecFont mainly by making the following three novel technical contributions. First, we adopt Transformers instead of RNNs to process sequential data and design a relaxation representation for vector outlines, markedly improving the model’s capability and stability of synthesizing long and complex outlines. Second, we propose to sample auxiliary points in addition to control points to precisely align the generated and target Bezier curves or lines. Finally, to alleviate error accumulation in the sequential generation process, we develop a context-based self-refinement module based on another Transformer-based decoder to remove artifacts in the initially synthesized glyphs. Both qualitative and quantitative results demonstrate that the proposed method effectively resolves those intrinsic problems of the original DeepVecFont and outperforms existing approaches in generating English and Chinese vector fonts with complicated structures and diverse styles.

中文总结: 这段话主要讨论了矢量字体合成在计算机视觉和计算机图形领域中的挑战和持续问题。最近提出的DeepVecFont通过利用矢量字体的图像和序列模态的信息，实现了最先进的性能。然而，DeepVecFont在处理长序列数据方面能力有限，并且严重依赖于图像引导的轮廓细化后处理，因此由DeepVecFont合成的矢量字形经常包含一些失真和伪影，无法与人工设计的结果相媲美。为解决上述问题，本文提出了DeepVecFont的增强版本，主要通过以下三项创新技术贡献来改进。首先，我们采用Transformer代替RNN来处理序列数据，并设计了一个放松表示来标记矢量轮廓，显著提高了模型合成长而复杂轮廓的能力和稳定性。其次，我们提出在控制点之外采样辅助点，以精确对齐生成的和目标的贝塞尔曲线或直线。最后，为减轻序列生成过程中的误差累积，我们开发了一个基于上下文的自我细化模块，基于另一个基于Transformer的解码器，以消除最初合成的字形中的伪影。定性和定量结果表明，所提出的方法有效解决了原始DeepVecFont的固有问题，并在生成具有复杂结构和多样风格的英文和中文矢量字体方面优于现有方法。

Paper40 Teaching Matters: Investigating the Role of Supervision in Vision Transformers

摘要原文: Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models.

中文总结: Vision Transformers（ViTs）在近年来获得了显著的流行度，并广泛应用于许多领域。然而，它们在不同学习范式下的行为尚未得到充分探讨。我们比较了通过不同监督方法训练的ViTs，并展示它们在注意力、表示和下游性能方面学习了多样化的行为。我们还发现了在不同监督下一致的ViT行为，包括出现了Offset Local Attention Heads。这些是自注意力头，它们关注当前令牌旁边具有固定方向偏移的令牌，这一现象据我们所知在任何先前的工作中尚未得到突出。我们的分析表明，ViTs非常灵活，学会根据其训练方法以不同顺序处理局部和全局信息。我们发现，对比自监督方法学习的特征与显式监督特征具有竞争力，并且在部分级任务中甚至可以更优。我们还发现，基于重建的模型的表示与对比自监督模型显示出非平凡的相似性。

Paper41 Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

摘要原文: Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserve the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%.

中文总结: 这段话主要讨论了视觉transformers在各种视觉任务上取得了显著的改进，但是它们之间的token之间的二次交互显著降低了计算效率。最近提出了许多剪枝方法来移除冗余token以实现高效的视觉transformers。然而，现有研究主要关注保留本地关注token的重要性，但完全忽视了全局token的多样性。本文强调了全局语义多样性的重要性，并提出了一种有效的token解耦和合并方法，可以同时考虑token的重要性和多样性进行token剪枝。根据类token的注意力，我们解耦了关注和不关注的token。除了保留最具有区分性的本地token外，我们还合并了相似的不关注的token，并匹配同质的关注token以最大化token的多样性。尽管方法简单，但我们的方法在模型复杂性和分类准确性之间取得了有希望的平衡。在DeiT-S上，我们的方法将FLOPs降低了35%，仅减少了0.2%的准确率。值得注意的是，由于保持了token的多样性，我们的方法甚至在将DeiT-T的FLOPs降低了40%后，可以提高0.1%的准确率。

Paper42 You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks?

摘要原文: Vision Transformers (ViTs), which made a splash in the field of computer vision (CV), have shaken the dominance of convolutional neural networks (CNNs). However, in the process of industrializing ViTs, backdoor attacks have brought severe challenges to security. The success of ViTs benefits from the self-attention mechanism. However, compared with CNNs, we find that this mechanism of capturing global information within patches makes ViTs more sensitive to patch-wise triggers. Under such observations, we delicately design a novel backdoor attack framework for ViTs, dubbed BadViT, which utilizes a universal patch-wise trigger to catch the model’s attention from patches beneficial for classification to those with triggers, thereby manipulating the mechanism on which ViTs survive to confuse itself. Furthermore, we propose invisible variants of BadViT to increase the stealth of the attack by limiting the strength of the trigger perturbation. Through a large number of experiments, it is proved that BadViT is an efficient backdoor attack method against ViTs, which is less dependent on the number of poisons, with satisfactory convergence, and is transferable for downstream tasks. Furthermore, the risks inside of ViTs to backdoor attacks are also explored from the perspective of existing advanced defense schemes.

中文总结: Vision Transformers（ViTs）在计算机视觉领域引起轰动，动摇了卷积神经网络（CNNs）的主导地位。然而，在工业化ViTs的过程中，后门攻击给安全性带来了严重挑战。ViTs的成功得益于自注意力机制。然而，与CNNs相比，我们发现这种在补丁内捕获全局信息的机制使ViTs对补丁级触发器更敏感。在这些观察下，我们精心设计了一个新颖的ViTs后门攻击框架，称为BadViT，利用通用的补丁级触发器来引起模型对有益于分类的补丁到带有触发器的补丁的关注，从而操纵ViTs生存的机制以使其混淆自己。此外，我们提出了BadViT的隐形变体，通过限制触发扰动的强度来增加攻击的隐蔽性。通过大量实验证明，BadViT是一种有效的ViTs后门攻击方法，不太依赖于毒素数量，具有令人满意的收敛性，并且可转移到下游任务。此外，还从现有先进防御方案的角度探讨了ViTs内部对后门攻击的风险。

Paper43 PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers

摘要原文: Vision Transformers (ViTs) are built on the assumption of treating image patches as “visual tokens” and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at https://github.com/iVMCL/PaCaViT.

中文总结: 这段话主要介绍了关于Vision Transformers (ViTs) 的一些问题和解决方法。ViTs基于将图像补丁视为“视觉标记”，并学习补丁到补丁的注意力。基于补丁嵌入的标记器与文本标记器存在语义差距。补丁到补丁的注意力存在二次复杂度问题，且使得解释学习的ViTs变得非常困难。为了解决这些问题，本文提出了在ViT中学习Patch-to-Cluster attention (PaCa)。我们的PaCa-ViT中的查询从补丁开始，而键和值直接基于聚类（具有预定义的小型聚类数）。这些聚类是端到端学习的，导致更好的标记器，并引入联合聚类-注意力和注意力-聚类，以获得更好且可解释的模型。二次复杂度被放松为线性复杂度。所提出的PaCa模块用于设计高效且可解释的ViT主干和语义分割头网络。在实验中，所提出的方法在ImageNet-1k图像分类、MS-COCO目标检测和实例分割以及MIT-ADE20k语义分割上进行了测试。与现有技术相比，在ImageNet-1k和MIT-ADE20k三个基准测试中，它在所有三个基准测试中的性能都优于SWin和PVTs，具有显著优势。由于线性复杂度，它在MS-COCO和MIT-ADE20k中也比PVT模型更高效。学习到的聚类具有语义上的意义。代码和模型检查点可在 https://github.com/iVMCL/PaCaViT 上找到。

Paper44 Vision Transformers Are Parameter-Efficient Audio-Visual Learners

摘要原文: Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/

中文总结: 这段话主要讨论了视觉变换器（ViTs）在过去几年在各种计算机视觉任务上取得了令人印象深刻的成果。研究人员探讨了仅在视觉数据上预训练的冻结ViTs在不对其原始参数进行微调的情况下，如何推广到音频-视觉数据。为此，他们提出了一种潜在的音频-视觉混合（LAVISH）适配器，通过向冻结的ViTs的每一层注入少量可训练参数来适应预训练的ViTs到音频-视觉任务。为了有效融合视觉和音频线索，他们的LAVISH适配器使用了一小组潜在标记，形成了一个注意力瓶颈，从而消除了标准交叉注意力的二次成本。与现有的特定于模态的音频-视觉方法相比，他们的方法在各种音频-视觉任务上实现了竞争性或甚至更好的性能，同时使用更少的可调参数，并且不依赖昂贵的音频预训练或外部音频编码器。他们的代码可在https://genjib.github.io/project_page/LAVISH/找到。

Paper45 Efficient Movie Scene Detection Using State-Space Transformers

摘要原文: The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2x faster and requiring 3x less GPU memory than standard Transformer models. We will release our code and models.

中文总结: 这段话主要讨论了识别不同电影场景的重要性以及目前在准确检测电影场景方面的挑战。作者提出了一种名为TranS4mer的模型，该模型结合了结构化状态空间序列（S4）和自注意力（A）层，能够有效捕捉长电影视频中的依赖关系，从而实现准确的电影场景检测。该模型通过S4A块进行短范围内镜头依赖关系的捕捉，并利用状态空间操作来聚合长范围镜头间线索。TranS4mer模型在三个电影场景检测数据集（MovieNet、BBC和OVSD）中表现优异，同时比标准Transformer模型快2倍，GPU内存需求减少3倍。作者将发布他们的代码和模型。

Paper46 BAEFormer: Bi-Directional and Early Interaction Transformers for Bird’s Eye View Semantic Segmentation

摘要原文: Bird’s Eye View (BEV) semantic segmentation is a critical task in autonomous driving. However, existing Transformer-based methods confront difficulties in transforming Perspective View (PV) to BEV due to their unidirectional and posterior interaction mechanisms. To address this issue, we propose a novel Bi-directional and Early Interaction Transformers framework named BAEFormer, consisting of (i) an early-interaction PV-BEV pipeline and (ii) a bi-directional cross-attention mechanism. Moreover, we find that the image feature maps’ resolution in the cross-attention module has a limited effect on the final performance. Under this critical observation, we propose to enlarge the size of input images and downsample the multi-view image features for cross-interaction, further improving the accuracy while keeping the amount of computation controllable. Our proposed method for BEV semantic segmentation achieves state-of-the-art performance in real-time inference speed on the nuScenes dataset, i.e., 38.9 mIoU at 45 FPS on a single A100 GPU.

中文总结: 这段话主要讨论了在自动驾驶中，鸟瞰视角（BEV）语义分割是一项关键任务。然而，现有基于Transformer的方法在将透视视角（PV）转换为BEV时面临困难，因为它们的单向和后向交互机制。为了解决这个问题，他们提出了一种新颖的双向和早期交互Transformer框架，命名为BAEFormer，包括（i）早期交互PV-BEV管道和（ii）双向交叉注意机制。此外，他们发现交叉注意模块中图像特征图的分辨率对最终性能影响有限。在这一关键观察下，他们建议扩大输入图像的大小并对多视图图像特征进行下采样，以进行交叉交互，进一步提高准确性同时保持可控的计算量。他们提出的BEV语义分割方法在nuScenes数据集上实现了实时推理速度的最新性能，即在单个A100 GPU上以45 FPS的速度达到38.9 mIoU。

Paper47 N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution

摘要原文: While some studies have proven that Swin Transformer (Swin) with window self-attention (WSA) is suitable for single image super-resolution (SR), the plain WSA ignores the broad regions when reconstructing high-resolution images due to a limited receptive field. In addition, many deep learning SR methods suffer from intensive computations. To address these problems, we introduce the N-Gram context to the low-level vision with Transformers for the first time. We define N-Gram as neighboring local windows in Swin, which differs from text analysis that views N-Gram as consecutive characters or words. N-Grams interact with each other by sliding-WSA, expanding the regions seen to restore degraded pixels. Using the N-Gram context, we propose NGswin, an efficient SR network with SCDP bottleneck taking multi-scale outputs of the hierarchical encoder. Experimental results show that NGswin achieves competitive performance while maintaining an efficient structure when compared with previous leading methods. Moreover, we also improve other Swin-based SR methods with the N-Gram context, thereby building an enhanced model: SwinIR-NG. Our improved SwinIR-NG outperforms the current best lightweight SR approaches and establishes state-of-the-art results. Codes are available at https://github.com/rami0205/NGramSwin.

中文总结: 这段话主要介绍了针对单图像超分辨率（SR）任务，研究表明Swin Transformer（Swin）与窗口自注意力（WSA）结合适用，但纯粹的WSA在重建高分辨率图像时忽略了广泛区域，因为受限于有限的感受野。此外，许多深度学习SR方法存在计算量大的问题。为了解决这些问题，作者首次将N-Gram上下文引入到低级别视觉任务中与Transformer相结合。作者将N-Gram定义为Swin中的相邻局部窗口，与文本分析中将N-Gram视为连续字符或单词的概念不同。N-Gram通过滑动WSA相互交互，扩展所见区域以恢复受损像素。作者提出了NGswin，一个高效的SR网络，采用SCDP瓶颈，利用分层编码器的多尺度输出。实验结果表明，NGswin在保持高效结构的同时取得了竞争性表现，与先前的领先方法相比。此外，作者还通过N-Gram上下文改进了其他基于Swin的SR方法，构建了增强型模型SwinIR-NG。我们改进的SwinIR-NG优于当前最佳的轻量级SR方法，建立了最新的技术成果。源代码可在https://github.com/rami0205/NGramSwin 上找到。

Paper48 Semi-DETR: Semi-Supervised Object Detection With Detection Transformers

摘要原文: We analyze the DETR-based framework on semi-supervised object detection (SSOD) and observe that (1) the one-to-one assignment strategy generates incorrect matching when the pseudo ground-truth bounding box is inaccurate, leading to training inefficiency; (2) DETR-based detectors lack deterministic correspondence between the input query and its prediction output, which hinders the applicability of the consistency-based regularization widely used in current SSOD methods. We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector, to tackle these problems. Specifically, we propose a Stage-wise Hybrid Matching strategy that com- bines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage. Besides, we introduce a Cross-view Query Consistency method to learn the semantic feature invariance of object queries from different views while avoiding the need to find deterministic query correspondence. Furthermore, we propose a Cost-based Pseudo Label Mining module to dynamically mine more pseudo boxes based on the matching cost of pseudo ground truth bounding boxes for consistency training. Extensive experiments on all SSOD settings of both COCO and Pascal VOC benchmark datasets show that our Semi-DETR method outperforms all state-of-the-art methods by clear margins.

中文总结: 本文分析了基于DETR的半监督目标检测（SSOD）框架，并观察到：（1）当伪标注边界框不准确时，一对一分配策略会产生错误匹配，导致训练效率低下；（2）基于DETR的检测器缺乏输入查询和其预测输出之间的确定性对应关系，这阻碍了当前SSOD方法中广泛使用的基于一致性的正则化的适用性。我们提出了Semi-DETR，这是第一个基于Transformer的端到端半监督目标检测器，以解决这些问题。具体来说，我们提出了一种阶段混合匹配策略，将一对多分配和一对一分配策略相结合，以提高第一阶段的训练效率，从而为第二阶段的训练提供高质量的伪标签。此外，我们引入了一种交叉视图查询一致性方法，以学习来自不同视图的对象查询的语义特征不变性，同时避免了寻找确定性查询对应的需要。此外，我们提出了一种基于成本的伪标签挖掘模块，根据伪标注边界框的匹配成本动态挖掘更多的伪框，用于一致性训练。在COCO和Pascal VOC基准数据集的所有SSOD设置上进行了大量实验，结果表明我们的Semi-DETR方法在性能上明显优于所有最先进的方法。

Paper49 CompletionFormer: Depth Completion With Convolutions and Vision Transformers

摘要原文: Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a joint convolutional attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Especially when the captured depth is highly sparse, the performance gap with other methods gets much larger.

中文总结: 这段话主要讨论了深度完成（depth completion）的问题。深度完成旨在通过空间传播稀疏深度测量数据到整个图像，从而得到密集深度预测。尽管基于深度学习的深度完成方法取得了巨大进展，但卷积层或图模型的局部性使网络难以建模像素之间的长距离关系。最近基于完全Transformer架构的方法取得了令人鼓舞的结果，但由于其局部特征细节的恶化，与成熟的CNN模型之间仍存在性能和效率差距。本文提出了一种联合卷积注意力和Transformer块（JCAT）的深度完成模型，将卷积注意力层和Vision Transformer深度耦合为一个块，构建成金字塔结构的深度完成模型。这种混合架构自然地结合了卷积的局部连接性和Transformer的全局上下文，使得我们的CompletionFormer在室外KITTI深度完成基准和室内NYUv2数据集上表现优于最先进的基于CNN的方法，同时实现了显著更高的效率（近1/3的FLOPs），相比纯Transformer方法。特别是在深度数据高度稀疏的情况下，与其他方法的性能差距更大。

Paper50 SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers

摘要原文: Advances in representation learning have led to great success in understanding and generating data in various domains. However, in modeling vector graphics data, the pure data-driven approach often yields unsatisfactory results in downstream tasks as existing deep learning methods often require the quantization of SVG parameters and cannot exploit the geometric properties explicitly. In this paper, we propose a transformer-based representation learning model (SVGformer) that directly operates on continuous input values and manipulates the geometric information of SVG to encode outline details and long-distance dependencies. SVGfomer can be used for various downstream tasks: reconstruction, classification, interpolation, retrieval, etc. We have conducted extensive experiments on vector font and icon datasets to show that our model can capture high-quality representation information and outperform the previous state-of-the-art on downstream tasks significantly.

中文总结: 表示学习的进展在理解和生成各个领域的数据方面取得了巨大成功。然而，在建模矢量图形数据时，纯数据驱动方法通常会在下游任务中产生令人不满意的结果，因为现有的深度学习方法通常需要对SVG参数进行量化，并且无法明确利用几何属性。在本文中，我们提出了一种基于Transformer的表示学习模型（SVGformer），该模型直接操作连续输入值，并操纵SVG的几何信息以编码轮廓细节和长距离依赖关系。SVGformer可用于各种下游任务：重建、分类、插值、检索等。我们在矢量字体和图标数据集上进行了大量实验，结果表明我们的模型能够捕捉高质量的表示信息，并在下游任务中明显优于先前的最先进技术。

Paper51 Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning

摘要原文: Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs’ potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs’ data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug’s effectiveness: 0.04% 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.

中文总结: 尽管越来越多的需求在下游任务中调整基础视觉变换器（FViTs），但在数据有限的情况下（例如，少样本调整），要充分发挥FViTs的潜力仍然是一个挑战，这是由于FViTs对数据的需求很大。在这种情况下，常见的数据增强技术表现不佳，因为少样本调整数据中包含的特征有限。为了解决这一挑战，我们首先确定了FViTs在少样本调整中的机会：预训练的FViTs已经从大规模预训练数据中学习到了高度代表性的特征，并且这些特征在广泛使用的参数高效调整过程中得到了完全保留。因此，我们假设利用这些学习到的特征来增强调整数据可以提高少样本FViT调整的效果。为此，我们提出了一个名为基于提示的数据增强（Hint-Aug）的框架，旨在通过使用预训练FViTs的学习特征来增强调整样本的过拟合部分，从而提高少样本调整中FViT的性能。具体来说，Hint-Aug集成了两个关键组件：（1）一个用于检测基础ViTs的过度自信区域的注意力过拟合检测器（AOD），以减轻它们在少样本调整数据上的过拟合，以及（2）一个混淆特征注入（CFI）模块，用于将预训练FViTs的易混淆特征与上述AOD检测到的过度自信区域相结合，以增强调整过程中的特征多样性。对五个数据集和三种参数高效调整技术的广泛实验和消融研究一致验证了Hint-Aug的有效性：在各种低样本设置下，其准确率比最先进的数据增强方法高出0.04%至32.91%。例如，在Pet数据集上，Hint-Aug在比最先进的数据增强方法少50%的训练数据的情况下实现了2.22%的准确率提升。

Paper52 D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers

摘要原文: Establishing pixel-level matches between image pairs is vital for a variety of computer vision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discriminative ability in texture-less regions and keypoint detectors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel image matching method is proposed by Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers (D2Former), including a contextual feature descriptor learning (CFDL) module and a hierarchical keypoint detector learning (HKDL) module. The proposed D2Former enjoys several merits. First, the proposed CFDL module can model long-range contexts efficiently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods.

中文总结: 这段话主要讲述了建立图像对之间的像素级匹配对于各种计算机视觉应用至关重要。然而，由于CNN提取的描述符通常在无纹理区域缺乏区分能力，关键点检测器只擅长识别具有特定结构水平的关键点，因此实现稳健的图像匹配仍然具有挑战性。为了解决这些问题，提出了一种新颖的图像匹配方法，通过基于代理的变换器（D2Former）联合学习分层检测器和上下文描述符，包括上下文特征描述符学习（CFDL）模块和分层关键点检测器学习（HKDL）模块。提出的D2Former具有几个优点。首先，提出的CFDL模块可以借助设计的描述符代理有效地高效地建模长距离上下文。其次，HKDL模块可以以分层方式生成关键点检测器，有助于检测具有不同结构水平的关键点。在四个具有挑战性的基准测试上的广泛实验结果表明，我们提出的方法明显优于最先进的图像匹配方法。