ICCV2023论文速览目标检测相关

发布于:2024-07-02 ⋅ 阅读:(10) ⋅ 点赞:(0)

在这里插入图片描述

Paper1 Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning

摘要原文: As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. For the second problem, we propose an end-to-end learnable method, which takes advantage of image self-consistency properties. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability.

中文总结: 随着先进的图像处理技术的出现,检测图像篡改变得越来越重要。尽管最近基于学习的图像篡改检测方法取得了成功,但它们通常需要昂贵的像素级注释来进行训练,在测试时对于与训练图像不同方式篡改的图像表现出性能下降。为了解决这些限制,我们提出了弱监督图像篡改检测方法,只需要用于训练目的的二元图像级标签(真实或被篡改)。这种弱监督设置可以利用更多的训练图像,并且有潜力快速适应新的篡改技术。为了提高泛化能力,我们提出了弱监督自一致性学习(WSCL)来利用弱标注的图像。对于第二个问题,我们提出了一种端到端可学习的方法,利用图像自一致性属性。具体来说,学习了两种一致性属性:多源一致性(MSC)和跨补丁一致性(IPC)。MSC利用不同的内容无关信息,通过在线伪标签生成和细化过程实现跨源学习。IPC执行全局成对补丁-补丁关系推理,以发现完整的篡改区域。大量实验证明,我们的WSCL,尽管是弱监督的,与完全监督的对应方法在分布内和分布外评估下表现出竞争性能,以及合理的篡改定位能力。

Paper2 Periodically Exchange Teacher-Student for Source-Free Object Detection

摘要原文: Source-free object detection (SFOD) aims to adapt the source detector to unlabeled target domain data in the absence of source domain data. Most SFOD methods follow the same self-training paradigm using mean-teacher (MT) framework where the student model is guided by only one single teacher model. However, such paradigm can easily fall into a training instability problem that when the teacher model collapses uncontrollably due to the domain shift, the student model also suffers drastic performance degradation. To address this issue, we propose the Periodically Exchange Teacher-Student (PETS) method, a simple yet novel approach that introduces a multiple-teacher framework consisting of a static teacher, a dynamic teacher, and a student model. During the training phase, we periodically exchange the weights between the static teacher and the student model. Then, we update the dynamic teacher using the moving average of the student model that has already been exchanged by the static teacher. In this way, the dynamic teacher can integrate knowledge from past periods, effectively reducing error accumulation and enabling a more stable training process within the MT-based framework. Further, we develop a consensus mechanism to merge the predictions of two teacher models to provide higher-quality pseudo labels for student model. Extensive experiments on multiple SFOD benchmarks show that the proposed method achieves state-of-the-art performance compared with other related methods, demonstrating the effectiveness and superiority of our method on SFOD task.

中文总结: 无源目标检测(SFOD)旨在在没有源域数据的情况下,将源检测器适应未标记的目标域数据。大多数SFOD方法遵循相同的自训练范式,使用均值教师(MT)框架,其中学生模型仅由一个单一教师模型指导。然而,这种范式很容易陷入训练不稳定问题,当由于域转移而导致教师模型不受控制地崩溃时,学生模型也会遭受严重的性能下降。为了解决这个问题,我们提出了周期性交换教师-学生(PETS)方法,这是一种简单而新颖的方法,引入了一个由静态教师、动态教师和学生模型组成的多教师框架。在训练阶段,我们周期性地在静态教师和学生模型之间交换权重。然后,我们使用已经被静态教师交换的学生模型的移动平均值来更新动态教师。通过这种方式,动态教师可以整合来自过去时期的知识,有效减少误差积累,并在基于MT的框架内实现更稳定的训练过程。此外,我们开发了一种共识机制,将两个教师模型的预测合并,为学生模型提供更高质量的伪标签。在多个SFOD基准测试上进行的广泛实验表明,所提出的方法在SFOD任务上与其他相关方法相比取得了最先进的性能,展示了我们的方法在SFOD任务上的有效性和优越性。

Paper3 SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

摘要原文: LiDAR-Camera fusion-based 3D detection is a critical task for automatic driving. In recent years, many LiDAR-Camera fusion approaches sprung up and gained promising performances compared with single-modal detectors, but always lack carefully designed and effective supervision for the fusion process. In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector’s abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors. Our code is available at https://github.com/IranQin/SupFusion.

中文总结: 这段话主要讨论了基于LiDAR-相机融合的三维检测对于自动驾驶的重要性,以及近年来出现的许多LiDAR-相机融合方法相较于单模态检测器表现出有希望的性能,但在融合过程中常常缺乏精心设计和有效的监督。作者提出了一种名为SupFusion的新型训练策略,为有效的LiDAR-相机融合提供辅助特征级别的监督,显著提升了检测性能。他们的策略包括一种名为极坐标采样的数据增强方法,用于密集化稀疏对象并训练一个辅助模型生成高质量特征作为监督。这些特征然后用于训练LiDAR-相机融合模型,其中融合特征被优化以模拟生成的高质量特征。此外,作者提出了一个简单而有效的深度融合模块,与SupFusion策略相比,连续地获得了更优异的性能。总的来说,作者的提议具有以下优势:首先,SupFusion引入了辅助特征级别的监督,可以提升LiDAR-相机检测性能而不增加额外的推理成本。其次,所提出的深度融合可以持续改进检测器的能力。作者的SupFusion和深度融合模块是即插即用的,作者进行了大量实验以证明其有效性。具体而言,在多个LiDAR-相机三维检测器的KITTI基准上,我们获得了约2%的三维mAP改进。他们的代码可在https://github.com/IranQin/SupFusion上找到。

Paper4 ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion

摘要原文: Abstract not available

中文总结: 抱歉,无法提供准确的摘要,因为没有具体的内容提供。

Paper5 The Devil is in the Crack Orientation: A New Perspective for Crack Detection

摘要原文: Cracks are usually curve-like structures that are the focus of many computer-vision applications (e.g., road safety inspection and surface inspection of industrial facilities). The existing pixel-based crack segmentation methods rely on time-consuming and costly pixel-level annotations. And the object-based crack detection methods exploit the horizontal box to detect the crack without considering crack orientation, resulting in scale variation and intra-class variation. Considering this, we provide a new perspective for crack detection that models the cracks as a series of sub-cracks with the corresponding orientation. However, the vanilla adaptation of the existing oriented object detection methods to the crack detection tasks will result in limited performance, due to the boundary discontinuity issue and the ambiguities in sub-crack orientation. In this paper, we propose a first-of-its-kind oriented sub-crack detector, dubbed as CrackDet, which is derived from a novel piecewise angle definition, to ease the boundary discontinuity problem. And then, we propose a multi-branch angle regression loss for learning sub-crack orientation and variance together. Since there are no related benchmarks, we construct three fully annotated datasets, namely, ORC, ONPP, and OCCSD, which involve various cracks in road pavement and industrial facilities. Experiments show that our approach outperforms state-of-the-art crack detectors.

中文总结: 这段话主要内容是关于裂缝检测的方法。传统的基于像素的裂缝分割方法依赖耗时且昂贵的像素级标注,而基于对象的裂缝检测方法则利用水平框来检测裂缝,但忽略了裂缝的方向,导致尺度变化和类内变化。为了解决这一问题,提出了一种新的裂缝检测视角,将裂缝建模为一系列具有相应方向的子裂缝。然而,将现有的定向对象检测方法直接应用于裂缝检测任务会导致性能受限,因为存在边界不连续问题和子裂缝方向的歧义。因此,提出了一种首创的定向子裂缝检测器CrackDet,通过一种新颖的分段角度定义来解决边界不连续问题,并提出了多分支角度回归损失,同时学习子裂缝的方向和方差。为了评估方法的性能,构建了三个完全注释的数据集ORC、ONPP和OCCSD,涉及道路路面和工业设施中的各种裂缝。实验证明,该方法优于现有的裂缝检测器。

Paper6 Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection

摘要原文: In incremental learning, replaying stored samples from previous tasks together with current task samples is one of the most efficient approaches to address catastrophic forgetting. However, unlike incremental classification, image replay has not been successfully applied to incremental object detection (IOD). In this paper, we identify the overlooked problem of foreground shift as the main reason for this. Foreground shift only occurs when replaying images of previous tasks and refers to the fact that their background might contain foreground objects of the current task. To overcome this problem, a novel and efficient Augmented Box Replay (ABR) method is developed that only stores and replays foreground objects and thereby circumvents the foreground shift problem. In addition, we propose an innovative Attentive RoI Distillation loss that uses spatial attention from region-of-interest (RoI) features to constrain current model to focus on the most important information from old model. ABR significantly reduces forgetting of previous classes while maintaining high plasticity in current classes. Moreover, it considerably reduces the storage requirements when compared to standard image replay. Comprehensive experiments on Pascal-VOC and COCO datasets support the state-of-the-art performance of our model.

中文总结: 这段话主要讨论了增量学习中的重放策略,即将存储的先前任务样本与当前任务样本一起重播是解决灾难性遗忘的有效方法之一。然而,与增量分类不同,图像重放尚未成功应用于增量目标检测(IOD)。作者指出前景移位是导致这一问题的主要原因,前景移位仅在重播先前任务的图像时发生,指的是它们的背景可能包含当前任务的前景对象。为了解决这个问题,作者提出了一种新颖高效的增强框重放(ABR)方法,只存储和重播前景对象,从而规避了前景移位问题。此外,作者提出了一种创新的注意力RoI蒸馏损失,利用感兴趣区域(RoI)特征的空间注意力来约束当前模型专注于旧模型中最重要的信息。ABR显著减少了对先前类别的遗忘,同时保持了对当前类别的高可塑性。此外,与标准图像重放相比,它大大减少了存储需求。在Pascal-VOC和COCO数据集上进行的综合实验支持了我们模型的最先进性能。

Paper7 KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection

摘要原文: Achieving a reliable LiDAR-based object detector in autonomous driving is paramount, but its success hinges on obtaining large amounts of precise 3D annotations. Active learning (AL) seeks to mitigate the annotation burden through algorithms that use fewer labels and can attain performance comparable to fully supervised learning. Although AL has shown promise, current approaches prioritize the selection of unlabeled point clouds with high aleatoric and/or epistemic uncertainty, leading to the selection of more instances for labeling and reduced computational efficiency. In this paper, we resort to a novel kernel coding rate maximization (KECOR) strategy which aims to identify the most informative point clouds to acquire labels through the lens of information theory. Greedy search is applied to seek desired point clouds that can maximize the minimal number of bits required to encode the latent features. To determine the uniqueness and informativeness of the selected samples from the model perspective, we construct a proxy network of the 3D detector head and compute the outer product of Jacobians from all proxy layers to form the empirical neural tangent kernel (NTK) matrix. To accommodate both one-stage (i.e., SECOND) and two-stage detectors (i.e., PV-RCNN), we further incorporate the classification entropy maximization and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Extensive experiments conducted on two 3D benchmarks and a 2D detection dataset evidence the superiority and versatility of the proposed approach. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art AL method, without compromising detection performance.

中文总结: 这段话主要讨论了在自动驾驶中实现可靠的基于LiDAR的物体检测器至关重要,但其成功取决于获取大量精确的3D标注。主动学习(AL)通过使用更少的标签并达到与完全监督学习相媲美的性能的算法,旨在减轻标注负担。尽管AL显示出了潜力,但当前方法优先选择具有高aleatoric和/或epistemic不确定性的未标记点云,导致选择更多实例进行标记并降低了计算效率。在本文中,我们采用了一种新颖的核编码率最大化(KECOR)策略,旨在通过信息理论的视角确定获取标签的最具信息量的点云。贪婪搜索被应用于寻找可以最大化编码潜在特征所需的最小比特数的期望点云。为了从模型的角度确定所选样本的独特性和信息量,我们构建了一个3D检测器头的代理网络,并计算所有代理层的雅可比矩阵的外积以形成经验神经切线核(NTK)矩阵。为了适应一阶(例如SECOND)和二阶检测器(例如PV-RCNN),我们进一步结合了分类熵最大化和检测性能与选择用于注释的边界框总数之间的良好权衡。在两个3D基准测试和一个2D检测数据集上进行的大量实验证明了所提方法的优越性和多功能性。我们的结果显示,与最先进的AL方法相比,大约减少了44%的框级注释成本和26%的计算时间,而不会影响检测性能。

Paper8 PNI : Industrial Anomaly Detection using Position and Neighborhood Information

摘要原文: Because anomalous samples cannot be used for training,

many anomaly detection and localization methods use
pre-trained networks and non-parametric modeling to estimate
encoded feature distribution. However, these methods
neglect the impact of position and neighborhood information
on the distribution of normal features. To overcome
this, we propose a new algorithm, PNI, which estimates
the normal distribution using conditional probability given
neighborhood features, modeled with a multi-layer perceptron
network. Moreover, position information is utilized by
creating a histogram of representative features at each position.
Instead of simply resizing the anomaly map, the proposed
method employs an additional refine network trained
on synthetic anomaly images to better interpolate and account
for the shape and edge of the input image. We conducted
experiments on the MVTec AD benchmark dataset
and achieved state-of-the-art performance, with 99.56%
and 98.98% AUROC scores in anomaly detection and localization,
respectively. Code is available at https://github.com/wogur110/PNI_Anomaly_Detection.

中文总结: 这段话主要内容是介绍了针对异常样本无法用于训练的情况,许多异常检测和定位方法使用预训练网络和非参数建模来估计编码特征分布。然而,这些方法忽略了位置和邻域信息对正常特征分布的影响。为了克服这一问题,提出了一种新算法PNI,该算法利用多层感知器网络模型对给定邻域特征的条件概率进行正常分布估计。此外,通过在每个位置创建代表性特征的直方图来利用位置信息。该方法不仅仅是调整异常图,还使用了额外的经过训练的合成异常图像的精化网络,以更好地插值和考虑输入图像的形状和边缘。在MVTec AD基准数据集上进行了实验,并在异常检测和定位方面取得了最先进的性能,AUROC分别为99.56%和98.98%。源代码可在https://github.com/wogur110/PNI_Anomaly_Detection找到。

Paper9 Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

摘要原文: Current semi-supervised object detection (SSOD) algorithms typically assume class balanced datasets (PASCAL VOC etc.) or slightly class imbalanced datasets (MSCOCO, etc). This assumption can be easily violated since real world datasets can be extremely class imbalanced in nature, thus making the performance of semi-supervised object detectors far from satisfactory. Besides, the research for this problem in SSOD is severely under-explored. To bridge this research gap, we comprehensively study the class imbalance problem for SSOD under more challenging scenarios, thus forming the first experimental setting for class imbalanced SSOD (CI-SSOD). Moreover, we propose a simple yet effective gradient-based sampling framework that tackles the class imbalance problem from the perspective of two types of confirmation biases. To tackle confirmation bias towards majority classes, the gradient-based reweighting and gradient-based thresholding modules leverage the gradients from each class to fully balance

the influence of the majority and minority classes. To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module. Experiments on three proposed sub-tasks, namely MS-COCO, MS-COCO- Object365 and LVIS, suggest that our method outperforms
current class imbalanced object detectors by clear margins, serving as a baseline for future research in CISSOD. Code will be available at https://github.com/nightkeepers/CI-SSOD.

中文总结: 当前的半监督目标检测(SSOD)算法通常假设数据集是类平衡的(如PASCAL VOC等)或者稍微类别不平衡的(如MSCOCO等)。这一假设很容易被违反,因为真实世界的数据集可能在类别上极度不平衡,从而使得半监督目标检测器的性能远非令人满意。此外,在SSOD中对这一问题的研究严重不足。为了弥补这一研究空白,我们全面研究了SSOD中类别不平衡问题,考虑了更具挑战性的情况,从而形成了第一个类别不平衡SSOD(CI-SSOD)的实验设置。此外,我们提出了一个简单而有效的基于梯度的采样框架,从两种类型的确认偏见的角度解决类别不平衡问题。为了解决对多数类别的确认偏见,基于梯度的重新加权和基于梯度的阈值模块利用来自每个类别的梯度来充分平衡多数类别和少数类别的影响。为了解决对少数类别的错误伪标签的确认偏见,类别重新平衡采样模块根据基于梯度的重新加权模块的指导重新采样未标记数据。在三个提出的子任务上进行的实验,即MS-COCO、MS-COCO-Object365和LVIS,表明我们的方法在性能上明显优于当前的类别不平衡目标检测器,为未来CISSOD研究提供了基准。代码将在https://github.com/nightkeepers/CI-SSOD 上提供。

Paper10 MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

摘要原文: Abstract not available

中文总结: 抱歉,由于没有提供摘要,无法提供主要内容的概述。

Paper11 Object-aware Gaze Target Detection

摘要原文: Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects. The code of the proposed method is publicly available.

中文总结: 这段话主要内容是关于注视目标检测的,旨在预测人物注视的图像位置以及注视可能超出场景的概率。一些研究已经通过回归以注视位置为中心的注视热图来处理这一任务,但它们忽视了人物与注视对象之间的关系解码。本文提出了一种基于Transformer的架构,自动检测场景中的对象(包括头部),以建立每个头部与注视头部/对象之间的关联,从而实现了包括注视目标区域、注视像素点、注视对象的类别和图像位置在内的全面可解释的注视分析。在野外基准测试评估中,我们的方法在所有指标上取得了最先进的结果(AUC提高了2.91%,注视距离减少了50%,超出画面的平均精度提高了9%),并且在注视对象的分类和定位方面平均精度提高了11-13%。所提出方法的代码已公开可用。

Paper12 Nearest Neighbor Guidance for Out-of-Distribution Detection

摘要原文: Detecting out-of-distribution (OOD) samples are crucial for machine learning models deployed in open-world environments. Classifier-based scores are a standard approach for OOD detection due to their fine-grained detection capability. However, these scores often suffer from overconfidence issues, misclassifying OOD samples distant from the in-distribution region. To address this challenge, we propose a method called Nearest Neighbor Guidance (NNGuide) that guides the classifier-based score to respect the boundary geometry of the data manifold. NNGuide reduces the overconfidence of OOD samples while preserving the fine-grained capability of the classifier-based score. We conduct extensive experiments on ImageNet OOD detection benchmarks under diverse settings, including a scenario where the ID data undergoes natural distribution shift. Our results demonstrate that NNGuide provides a significant performance improvement on the base detection scores, achieving state-of-the-art results on both AUROC, FPR95, and AUPR metrics.

中文总结: 这段话主要讨论了在部署在开放世界环境中的机器学习模型中,检测到ODD(out-of-distribution)样本的重要性。基于分类器的得分是一种标准方法用于ODD检测,因为它们具有细粒度的检测能力。然而,这些得分通常存在过度自信的问题,会误将远离分布区域的ODD样本错误分类。为了解决这一挑战,提出了一种名为最近邻引导(NNGuide)的方法,该方法引导分类器得分尊重数据流形的边界几何结构。NNGuide减少了ODD样本的过度自信,同时保留了基于分类器得分的细粒度能力。作者在ImageNet ODD检测基准测试上进行了大量实验,包括ID数据经历自然分布转变的情景。实验结果表明,NNGuide在基础检测得分上提供了显著的性能改进,实现了在AUROC、FPR95和AUPR指标上的最先进结果。

Paper13 Open-Vocabulary Object Detection With an Open Corpus

摘要原文: Existing open vocabulary object detection (OVD) works expand the object detector toward open categories by replacing the classifier with the category text embeddings and optimizing the region-text alignment on data of the base categories. However, both the class-agnostic proposal generator and the classifier are biased to the seen classes as demonstrated by the gaps of objectness and accuracy assessment between base and novel classes. In this paper, an open corpus, composed of a set of external object concepts and clustered to several centroids, is introduced to improve the generalization ability in the detector. We propose the generalized objectness assessment (GOAT) in the proposal generator based on the visual-text alignment, where the similarities of visual feature to the cluster centroids are summarized as the objectness. This simple heuristic evaluates objectness with concepts in open corpus and is thus generalized to open categories. We further propose category expanding (CE) with open corpus in two training tasks, which enables the detector to perceive more categories in the feature space and get more reasonable optimization direction. For the classification task, we introduce an open corpus classifier by reconstructing original classifier with similar words in text space. For the image-caption alignment task, the open corpus centroids are incorporated to enlarge the negative samples in the contrastive loss. Extensive experiments demonstrate the effectiveness of GOAT and CE, which greatly improve the performance on novel classes and get new state-of-the-art on the OVD benchmarks.

中文总结: 这段话主要讨论了现有的开放词汇目标检测(OVD)方法如何通过将分类器替换为类别文本嵌入并在基本类别数据上优化区域-文本对齐来扩展目标检测器以涵盖更多开放类别。然而,独立于类别的提议生成器和分类器都存在对已知类别的偏见,这在基本和新颖类别之间的目标性和准确性评估差距中得到了证明。文中介绍了一个开放语料库,由一组外部目标概念组成,并被聚类到几个中心,以提高检测器的泛化能力。作者提出了基于视觉-文本对齐的广义目标性评估(GOAT)方法,在提议生成器中使用开放语料库中的概念对目标性进行评估,从而推广到开放类别。此外,作者还提出了在两个训练任务中使用开放语料库的类别扩展(CE)方法,使检测器能够在特征空间中感知更多类别并获得更合理的优化方向。在分类任务中,作者通过在文本空间中使用相似词重新构建原始分类器,引入了一个开放语料库分类器。在图像-标题对齐任务中,开放语料库中心被纳入以扩大对比损失中的负样本。大量实验证明了GOAT和CE的有效性,极大地提高了对新颖类别的性能,并在OVD基准测试中取得了新的最先进水平。

Paper14 Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection

摘要原文: Anchor-based detectors have been continuously developed for object detection. However, the individual anchor box makes it difficult to predict the boundary’s offset accurately. Instead of taking each bounding box as a closed individual, we consider using multiple boxes together to get prediction boxes. To this end, this paper proposes the Box Decouple-Couple(BDC) strategy in the inference, which no longer discards the overlapping boxes, but decouples the corner points of these boxes. Then, according to each corner’s score, we couple the corner points to select the most accurate corner pairs. To meet the BDC strategy, a simple but novel model is designed named the Anchor-Intermediate Detector(AID), which contains two head networks, i.e., an anchor-based head and an anchor-free Corner-aware head. The corner-aware head is able to score the corners of each bounding box to facilitate the coupling between corner points. Extensive experiments on MS COCO show that the proposed anchor-intermediate detector respectively outperforms their baseline RetinaNet and GFL method by 2.4 and 1.2 AP on the MS COCO test-dev dataset without any bells and whistles.

中文总结: 这段话主要讨论了针对目标检测不断发展的基于锚点的检测器。然而,单个锚框使得准确预测边界偏移困难。为了解决这个问题,提出了Box Decouple-Couple(BDC)策略,该策略在推断中不再丢弃重叠的框,而是解耦这些框的角点,然后根据每个角点的得分,将角点配对以选择最准确的角点对。为了实现BDC策略,设计了一个简单但新颖的模型,名为Anchor-Intermediate Detector(AID),其中包含两个头网络,即基于锚点的头部和基于角点的头部。角点感知头部能够评分每个边界框的角点,以促进角点之间的配对。在MS COCO数据集上进行的大量实验表明,所提出的Anchor-Intermediate Detector分别在MS COCO test-dev数据集上比它们的基准RetinaNet和GFL方法提高了2.4和1.2的AP,而没有任何花哨的技巧。

Paper15 FemtoDet: An Object Detection Baseline for Energy Versus Performance Tradeoffs

摘要原文: Efficient detectors for edge devices are often optimized for parameters or speed count metrics, which remain in weak correlation with the energy of detectors.

However, some vision applications of convolutional neural networks, such as always-on surveillance cameras, are critical for energy constraints.
This paper aims to serve as a baseline by designing detectors to reach tradeoffs between energy and performance from two perspectives:

  1. We extensively analyze various CNNs to identify low-energy architectures, including selecting activation functions, convolutions operators, and feature fusion structures on necks. These underappreciated details in past work seriously affect the energy consumption of detectors;
  2. To break through the dilemmatic energy-performance problem, we propose a balanced detector driven by energy using discovered low-energy components named FemtoDet.
    In addition to the novel construction, we improve FemtoDet by considering convolutions and training strategy optimizations.
    Specifically, we develop a new instance boundary enhancement (IBE) module for convolution optimization to overcome the contradiction between the limited capacity of CNNs and detection tasks in diverse spatial representations, and propose a recursive warm-restart (RecWR) for optimizing training strategy to escape the sub-optimization of light-weight detectors by considering the data shift produced in popular augmentations.
    As a result, FemtoDet with only 68.77k parameters achieves a competitive score of 46.3 AP50 on PASCAL VOC and 1.11 W & 64.47 FPS on Qualcomm Snapdragon 865 CPU platforms.
    Extensive experiments on COCO and TJU-DHD datasets indicate that the proposed method achieves competitive results in diverse scenes.

中文总结: 这段话主要讨论了针对边缘设备的高效检测器通常针对参数或速度计量指标进行优化,而这些指标与检测器的能量消耗之间关联性较弱。然而,一些卷积神经网络的视觉应用,如始终开启的监控摄像头,对能量约束至关重要。本文旨在作为一个基准,通过设计检测器从两个角度实现能量和性能之间的权衡:1)我们广泛分析各种卷积神经网络,以确定低能耗架构,包括选择激活函数、卷积操作符和在"neck"上的特征融合结构。过去工作中未被充分重视的这些细节严重影响了检测器的能量消耗;2)为了突破能量-性能问题的困境,我们提出了一种以能量驱动的平衡检测器,使用发现的低能耗组件命名为FemtoDet。除了新颖的构造,我们通过考虑卷积和训练策略优化来改进FemtoDet。具体来说,我们为卷积优化开发了一个新的实例边界增强(IBE)模块,以克服CNN的有限容量与不同空间表示中的检测任务之间的矛盾,并提出了一种递归热重启(RecWR)来优化训练策略,以避免因考虑到流行的数据增强中产生的数据偏移而导致轻量级检测器的次优化。结果,仅具有68.77k参数的FemtoDet在PASCAL VOC上实现了46.3 AP50的竞争得分,在Qualcomm Snapdragon 865 CPU平台上实现了1.11瓦和64.47帧每秒的竞争性结果。对COCO和TJU-DHD数据集的大量实验表明,所提出的方法在各种场景中实现了竞争性结果。

Paper16 CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

摘要原文: Out-of-distribution (OOD) detection refers to training the model on in-distribution (ID) dataset to classify if the input images come from unknown classes. Considerable efforts have been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, Zero-shot OOD detection methods driven by CLIP, which require only class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers “no” logic within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples via positive-semantic prompts and negation-semantic prompts. To be specific, we design a novel learnable “no” prompt and a “no” text encoder to capture the negation-semantic with images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with “no” prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection via using negation semantics from “no” prompts and text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN outperforms 7 well-used algorithms by at least 1.1% and 7.37% on AUROC and FPR95 on zero-shot OOD detection of ImageNet-1K. Our CLIPN can serve as a solid foundation for leveraging CLIP effectively in downstream OOD tasks.

中文总结: 这段话主要讨论了Out-of-distribution (OOD) detection,即在使用in-distribution (ID)数据集训练模型以分类输入图像是否来自未知类别。文章介绍了一种新颖的方法,名为CLIP saying no (CLIPN),它通过在CLIP中加入"no"逻辑来区分OOD和ID样本。作者设计了一个可学习的"no"提示和一个"no"文本编码器,用于捕捉图像中的否定语义。此外,引入了两种损失函数:图像-文本二进制相反损失和文本语义相反损失,用于教导CLIPN将图像与"no"提示相关联,从而使其能够识别未知样本。实验结果表明,CLIPN在ImageNet-1K的零样本OOD检测上优于7种常用算法,AUROC和FPR95至少提高了1.1%和7.37%。CLIPN可作为在下游OOD任务中有效利用CLIP的坚实基础。

Paper17 MapFormer: Boosting Change Detection by Using Pre-change Information

摘要原文: Change detection in remote sensing imagery is essential for a variety of applications such as urban planning, disaster management, and climate research. However, existing methods for identifying semantically changed areas overlook the availability of semantic information in the form of existing maps describing features of the earth’s surface. In this paper, we leverage this information for change detection in bi-temporal images. We show that the simple integration of the additional information via concatenation of latent representations suffices to significantly outperform state-of-the-art change detection methods. Motivated by this observation, we propose the new task of Conditional Change Detection, where pre-change semantic information is used as input next to bi-temporal images. To fully exploit the extra information, we propose MapFormer, a novel architecture based on a multi-modal feature fusion module that allows for feature processing conditioned on the available semantic information. We further employ a supervised, cross-modal contrastive loss to guide the learning of visual representations. Our approach outperforms existing change detection methods by an absolute 11.7% and 18.4% in terms of binary change IoU on DynamicEarthNet and HRSCD, respectively. Furthermore, we demonstrate the robustness of our approach to the quality of the pre-change semantic information and the absence pre-change imagery. The code is available at https://github.com/mxbh/mapformer.

中文总结: 这段话主要讨论了在遥感图像中进行变化检测的重要性,以及如何利用地球表面特征的语义信息来提高变化检测的效果。作者指出现有的方法忽视了现有地图中描述地球表面特征的语义信息,而他们的方法通过简单地整合额外信息,如连接潜在表示,显著优于现有的变化检测方法。作者提出了一种新的条件变化检测任务,其中使用预变化语义信息作为输入,以充分利用额外信息,提出了一种基于多模态特征融合模块的新架构MapFormer,以允许根据可用的语义信息进行特征处理。作者进一步使用了监督的跨模态对比损失来引导视觉表示的学习。他们的方法在DynamicEarthNet和HRSCD上的二元变化IoU方面分别比现有的变化检测方法提高了绝对值11.7%和18.4%。此外,作者还展示了他们的方法对预变化语义信息质量和缺少预变化图像的鲁棒性。代码可在https://github.com/mxbh/mapformer找到。

Paper18 ALWOD: Active Learning for Weakly-Supervised Object Detection

摘要原文: Object detection (OD), a crucial vision task, remains challenged by the lack of large training datasets with precise object localization labels. In this work, we propose ALWOD, a new framework that addresses this problem by fusing active learning (AL) with weakly and semi-supervised object detection paradigms. Because the performance of AL critically depends on the model initialization, we propose a new auxiliary image generator strategy that utilizes an extremely small labeled set, coupled with a large weakly tagged set of images, as a warm-start for AL. We then propose a new AL acquisition function, another critical factor in AL success, that leverages the student-teacher OD pair disagreement and uncertainty to effectively propose the most informative images to annotate. Finally, to complete the AL loop, we introduce a new labeling task delegated to human annotators, based on selection and correction of model-proposed detections, which is both rapid and effective in labeling the informative images. We demonstrate, across several challenging benchmarks, that ALWOD significantly narrows the gap between the ODs trained on few partially labeled but strategically selected image instances and those that rely on the fully-labeled data. Our code is publicly available on https://github.com/seqam-lab/ALWOD.

中文总结: 这段话主要内容是介绍了一种名为ALWOD的新框架,旨在通过将主动学习(AL)与弱监督和半监督目标检测范式相结合,解决目标检测中缺乏具有精确目标定位标签的大型训练数据集的问题。该框架利用一个极小的标记集合和一个大型的弱标记图像集合作为AL的热启动,提出了一种新的辅助图像生成策略。同时,提出了一种新的AL获取函数,利用学生-教师目标检测对的差异和不确定性,有效地提出最具信息量的图像进行注释。最后,介绍了一个新的标注任务,委托人类标注员根据模型提出的检测结果的选择和修正,迅速而有效地标记具有信息量的图像。作者展示了ALWOD在多个具有挑战性的基准测试中显著缩小了依赖于少量部分标记但经过策略选择的图像实例训练的目标检测器与依赖于完全标记数据的目标检测器之间的差距。他们的代码公开在https://github.com/seqam-lab/ALWOD。

Paper19 Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss

摘要原文: Deep learning models need to detect out-of-distribution (OOD) data in the inference stage because they are trained to estimate the train distribution and infer the data sampled from the distribution. Many methods have been proposed, but they have some limitations, such as requiring additional data, input processing, or high computational cost. Moreover, most methods have hyperparameters to be set by users, which have a significant impact on the detection rate. We propose a simple and effective OOD detection method by combining the feature norm and the Mahalanobis distance obtained from classification models trained with the cosine-based softmax loss. Our method is practical because it does not use additional data for training, is about three times faster when inferencing than the methods using the input processing, and is easy to apply because it does not have any hyperparameters for OOD detection. We confirm that our method is superior to or at least comparable to state-of-the-art OOD detection methods through the experiments.

中文总结: 这段话的主要内容是:深度学习模型在推断阶段需要检测出分布之外的数据,因为它们是训练来估计训练分布并推断从该分布中抽样的数据。已经提出了许多方法,但它们存在一些限制,比如需要额外的数据、输入处理或高计算成本。此外,大多数方法都需要用户设置超参数,这对检测率有重要影响。我们提出了一种简单有效的OOD检测方法,通过结合基于余弦的Softmax损失训练的分类模型得到的特征范数和马氏距离。我们的方法实用性强,因为不需要额外的训练数据,推断速度比使用输入处理的方法快大约三倍,并且易于应用,因为不需要任何OOD检测的超参数。通过实验证实我们的方法优于或至少与最先进的OOD检测方法相媲美。

Paper20 Anomaly Detection using Score-based Perturbation Resilience

摘要原文: Unsupervised anomaly detection is widely studied for industrial applications since it is difficult to obtain anomalous data. In particular, reconstruction-based anomaly detection can be a feasible solution if there is no option to use external knowledge, such as extra datasets or pre-trained models. However, reconstruction-based methods have limited utility due to poor detection performance. A score-based model, also known as a denoising diffusion model, recently has shown a high sample quality in the generation task. In this paper, we propose a novel unsupervised anomaly detection method leveraging the score-based model. This method promises good performance without external knowledge. The score, a gradient of the log-likelihood, has a property that is available for anomaly detection. The samples on the data manifold can be restored instantly by the score, even if they are randomly perturbed. We call this a score-based perturbation resilience. On the other hand, the samples that deviate from the manifold cannot be restored in the same way. The variation of resilience depending on the sample position can be an indicator to discriminate anomalies. We derive this statement from a geometric perspective. Our method shows superior performance on three benchmark datasets for industrial anomaly detection. Specifically, on MVTec AD, we achieve image-level AUROC of 97.7% and pixel-level AUROC of 97.4% outperforming previous works that do not use external knowledge.

中文总结: 无监督异常检测在工业应用中得到广泛研究,因为很难获得异常数据。特别是,基于重建的异常检测可以是一个可行的解决方案,如果没有使用外部知识的选项,如额外数据集或预训练模型。然而,基于重建的方法由于检测性能差而受到限制。最近,一种得分模型,也称为去噪扩散模型,在生成任务中表现出了高样本质量。在本文中,我们提出了一种利用得分模型的新型无监督异常检测方法。这种方法承诺在没有外部知识的情况下具有良好的性能。得分,即对数似然的梯度,具有适用于异常检测的特性。数据流形上的样本可以通过得分立即恢复,即使它们被随机扰动。我们称之为基于得分的扰动韧性。另一方面,偏离流形的样本无法以同样的方式恢复。根据样本位置的韧性变化可以作为区分异常的指标。我们从几何角度推导出这个陈述。我们的方法在三个工业异常检测基准数据集上表现出优越的性能。具体来说,在MVTec AD上,我们实现了97.7%的图像级AUROC和97.4%的像素级AUROC,超过了以前不使用外部知识的作品。

Paper21 Object as Query: Lifting Any 2D Object Detector to 3D Detection

摘要原文: 3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research.

中文总结: 这段话主要介绍了近年来多视角图像中的3D物体检测引起了广泛关注。现有方法主要是从多视角图像建立3D表示,并采用密集检测头进行物体检测,或者使用在3D空间中分布的物体查询来定位物体。本文设计了Multi-View 2D Objects guided 3D Object Detector (MV2D),可以将任何2D物体检测器提升到多视角3D物体检测。由于2D检测可以为物体存在提供有价值的先验知识,MV2D利用2D检测器生成基于丰富图像语义条件的物体查询。这些动态生成的查询帮助MV2D在视野范围内召回物体,并展现出强大的3D物体定位能力。针对生成的查询,我们设计了一个稀疏交叉注意力模块,以使它们专注于特定物体的特征,从而抑制干扰噪声。在nuScenes数据集上的评估结果表明,动态物体查询和稀疏特征聚合可以促进3D检测能力。MV2D还展示了在现有方法中表现出的最先进性能。我们希望MV2D能成为未来研究的新基准。

Paper22 Revisit PCA-based Technique for Out-of-Distribution Detection

摘要原文: Out-of-distribution (OOD) detection is a desired ability to ensure the reliability and safety of intelligent systems. A scoring function is often designed to measure the degree of any new data being an OOD sample. While most designed scoring functions are based on a single source of information (e.g., the classifier’s output, logits, or feature vector), recent studies demonstrate that fusion of multiple sources

may help better detect OOD data. In this study, after detailed analysis of the issue in OOD detection by the conventional principal component analysis (PCA), we propose fusing a simple regularized PCA-based reconstruction error with other source of scoring function to further improve OOD detection performance. In particular, when combined with a strong energy score-based OOD method, the regularized reconstruction error helps achieve new state-of-the-art OOD detection results on multiple standard benchmarks. The code is available at https://github.com/SYSU-MIA-GROUP/pca-based-out-of-distribution-detection.

中文总结: 这段话主要讨论了离群分布(OOD)检测是确保智能系统可靠性和安全性的一个重要能力。为了衡量新数据是否为OOD样本,通常会设计一个评分函数。大多数设计的评分函数基于单一信息源(例如分类器的输出、logits或特征向量),但最近的研究表明融合多个信息源可能有助于更好地检测OOD数据。在这项研究中,通过对传统主成分分析(PCA)在OOD检测中的问题进行详细分析,我们提出将基于简单正则化PCA的重构误差与其他评分函数源融合,以进一步提高OOD检测性能。特别是,当与强能量评分为基础的OOD方法相结合时,正则化重构误差有助于在多个标准基准测试中实现新的OOD检测结果的最新水平。可在https://github.com/SYSU-MIA-GROUP/pca-based-out-of-distribution-detection找到代码。

Paper23 Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection

摘要原文: Humans recognize anomalies through two aspects: larger patch-wise representation discrepancies and weaker patch-to-normal-patch correlations. However, the previous AD methods didn’t sufficiently combine the two complementary aspects to design AD models. To this end, we find that Transformer can ideally satisfy the two aspects as its great power in the unified modeling of patchwise representations and patch-to-patch correlations. In this paper, we propose a novel AD framework: FOcus-the- Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies. The major characteristic of our method is that we renovate the self attention maps in transformers to Intra-Inter-Correlation (I2Correlation). The I2Correlation contains a two-branch structure to first explicitly establish intraand inter-image correlations, and then fuses the features of two-branch to spotlight the abnormal patterns. To learn the intra- and inter-correlations adaptively, we propose the RBF-kernel-based target-correlations as learning targets for self-supervised learning. Besides, we introduce an entropy constraint strategy to solve the mode collapse issue in optimization and further amplify the normal abnormal distinguishability. Extensive experiments on three unsupervised real-world AD benchmarks show the superior performance of our approach. Code will be available at https://github.com/xcyao00/FOD.

中文总结: 这段话主要讨论了人类如何通过两个方面识别异常:较大的基于补丁的表示差异和较弱的补丁与正常补丁之间的相关性。然而,先前的异常检测方法并未充分结合这两个互补的方面来设计异常检测模型。为此,研究人员发现Transformer可以理想地满足这两个方面,因为它在统一建模补丁表示和补丁之间的相关性方面具有强大的能力。在这篇论文中,他们提出了一种新颖的异常检测框架:FOcus-the- Discrepancy(FOD),可以同时发现异常的基于补丁的、内部和间接差异。我们方法的主要特点是将Transformer中的自注意力图重新设计为Intra-Inter-Correlation(I2Correlation)。I2Correlation包含一个两分支结构,首先明确建立内部和图像之间的相关性,然后融合两个分支的特征以突出异常模式。为了自适应地学习内部和间接相关性,他们提出了基于RBF核的目标相关性作为自监督学习的学习目标。此外,他们引入了熵约束策略来解决优化中的模式坍塌问题,并进一步增强正常异常的可区分性。在三个无监督的真实世界异常检测基准测试上进行的大量实验表明,他们的方法具有优越的性能。代码将在https://github.com/xcyao00/FOD 上提供。

Paper24 RecursiveDet: End-to-End Region-Based Recursive Object Detection

摘要原文: End-to-end region-based object detectors like Sparse R-CNN usually have multiple cascade bounding box decoding stages, which refine the current predictions according to their previous results. Model parameters within each stage are independent, evolving a huge cost. In this paper, we find the general setting of decoding stages is actually redundant. By simply sharing parameters and making a recursive decoder, the detector already obtains a significant improvement. The recursive decoder can be further enhanced by positional encoding (PE) of the proposal box, which makes it aware of the exact locations and sizes of input bounding boxes, thus becoming adaptive to proposals from different stages during the recursion. Moreover, we also design centerness-based PE to distinguish the RoI feature element and dynamic convolution kernels at different positions within the bounding box. To validate the effectiveness of the proposed method, we conduct intensive ablations and build the full model on three recent mainstream region-based detectors. The RecusiveDet is able to achieve obvious performance boosts with even fewer model parameters and slightly increased computation cost.

中文总结: 这篇论文研究了端到端的基于区域的目标检测器,如Sparse R-CNN通常具有多个级联边界框解码阶段,根据其先前的结果对当前预测进行微调。每个阶段内的模型参数是独立的,进而导致巨大的成本。作者发现解码阶段的一般设置实际上是多余的。通过简单共享参数并构建递归解码器,检测器已经获得了显著的改进。递归解码器可以通过提案框的位置编码(PE)进一步增强,使其了解输入边界框的确切位置和大小,从而在递归过程中对来自不同阶段的提案进行自适应。此外,作者还设计了基于中心性的位置编码,以区分RoI特征元素和边界框内不同位置处的动态卷积核。为了验证所提出方法的有效性,作者进行了大量消融实验,并在三个最新的主流基于区域的检测器上构建了完整模型。递归检测器能够在更少的模型参数和稍微增加的计算成本下显著提升性能。

Paper25 Cascade-DETR: Delving into High-Quality Universal Object Detection

摘要原文: Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments.

We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and pretrained models are at https://github.com/SysCV/cascade-detr.

中文总结: 这段话主要讨论了在一般环境中目标定位是视觉系统的基本组成部分。虽然在COCO基准上占据主导地位,但最近基于Transformer的检测方法在多样化领域中并不具竞争力。此外,这些方法仍然在复杂环境中准确估计目标边界框方面存在困难。

作者介绍了Cascade-DETR用于高质量通用目标检测。他们通过提出级联注意力层来共同解决到多样化领域的泛化和定位准确性问题,该层通过将对象中心信息明确集成到检测解码器中,通过将注意力限制在先前的框预测上来实现。为了进一步提高准确性,他们重新审视了查询的评分。他们不再依赖于分类分数,而是预测查询的预期IoU,从而导致更加良好校准的置信度。最后,他们引入了一个通用目标检测基准,UDB10,其中包含来自多样化领域的10个数据集。虽然也在COCO上推动了最新技术,但Cascade-DETR在UDB10中的所有数据集上都显著改善了基于DETR的检测器,有些情况下甚至提高了10个mAP。在严格的质量要求下,改进更为显著。他们的代码和预训练模型可在https://github.com/SysCV/cascade-detr 找到。

Paper26 WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis

摘要原文: Deep neural networks are susceptible to generating overconfident yet erroneous predictions when presented with data beyond known concepts. This challenge underscores the importance of detecting out-of-distribution (OOD) samples in the open world. In this work, we propose a novel feature-space OOD detection score based on class-specific and class-agnostic information. Specifically, the approach utilizes Whitened Linear Discriminant Analysis to project features into two subspaces - the discriminative and residual subspaces - for which the in-distribution (ID) classes are maximally separated and closely clustered, respectively. The OOD score is then determined by combining the deviation from the input data to the ID pattern in both subspaces. The efficacy of our method, named WDiscOOD, is verified on the large-scale ImageNet-1k benchmark, with six OOD datasets that cover a variety of distribution shifts. WDiscOOD demonstrates superior performance on deep classifiers with diverse backbone architectures, including CNN and vision transformer. Furthermore, we also show that WDiscOOD more effectively detects novel concepts in representation spaces trained with contrastive objectives, including supervised contrastive loss and multi-modality contrastive loss.

中文总结: 这段话主要讨论了深度神经网络在面对超出已知概念的数据时容易产生过于自信但错误的预测,并强调了在开放世界中检测超出分布(OOD)样本的重要性。研究提出了一种基于特征空间的OOD检测分数,该方法基于类特定和类不可知信息。具体来说,该方法利用白化线性判别分析将特征投影到两个子空间中 - 区分性子空间和残差子空间 - 其中内部分布(ID)类被最大化分开和紧密聚类。然后,OOD分数通过组合输入数据与两个子空间中ID模式的偏差来确定。我们的方法,命名为WDiscOOD,在大规模ImageNet-1k基准测试中得到验证,在六个覆盖各种分布变化的OOD数据集上展现出卓越的性能。WDiscOOD在具有不同骨干架构的深度分类器上表现出优越性能,包括CNN和视觉变换器。此外,我们还展示了WDiscOOD在训练具有对比目标的表示空间时更有效地检测新概念,包括监督对比损失和多模态对比损失。

Paper27 Anomaly Detection Under Distribution Shift

摘要原文: Anomaly detection (AD) is a crucial machine learning task that aims to learn patterns from a set of normal training samples to identify abnormal samples in test data. Most existing AD studies assume that the training and test data are drawn from the same data distribution, but the test data can have large distribution shifts arising in many real-world applications due to different natural variations such as new lighting conditions, object poses, or background appearances, rendering existing AD methods ineffective in such cases. In this paper, we consider the problem of anomaly detection under distribution shift and establish performance benchmarks on four widely-used AD and out-of-distribution (OOD) generalization datasets. We demonstrate that simple adaptation of state-of-the-art OOD generalization methods to AD settings fails to work effectively due to the lack of labeled anomaly data. We further introduce a novel robust AD approach to diverse distribution shifts by minimizing the distribution gap between in-distribution and OOD normal samples in both the training and inference stages in an unsupervised way. Our extensive empirical results on the four datasets show that our approach substantially outperforms state-of-the-art AD methods and OOD generalization methods on data with various distribution shifts, while maintaining the detection accuracy on in-distribution data. Code and data are available at https://github.com/mala-lab/ADShift.

中文总结: 异常检测(AD)是一项关键的机器学习任务,旨在从一组正常训练样本中学习模式,以识别测试数据中的异常样本。大多数现有的AD研究假设训练和测试数据来自相同的数据分布,但测试数据可能会出现大的分布偏移,这在许多现实世界的应用中是由于不同的自然变化,比如新的光照条件、物体姿势或背景外观,导致现有的AD方法在这种情况下失效。在本文中,我们考虑了在分布偏移下的异常检测问题,并在四个广泛使用的AD和超出分布(OOD)泛化数据集上建立了性能基准。我们证明了将最先进的OOD泛化方法简单地调整为AD设置无法有效工作,因为缺乏标记的异常数据。我们进一步介绍了一种新颖的鲁棒AD方法,通过在训练和推断阶段以无监督方式最小化正常样本之间的分布差异,以应对各种分布偏移。我们在四个数据集上的广泛实证结果表明,我们的方法在具有各种分布偏移的数据上显著优于最先进的AD方法和OOD泛化方法,同时保持对内部分布数据的检测准确性。代码和数据可在https://github.com/mala-lab/ADShift获得。

Paper28 Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection

摘要原文: In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. Specifically, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in vision-based 3D detection. Our codes will be released at https://github.com/OpenGVLab/3dodi.

中文总结: 在这项工作中,我们构建了一个模块化设计的代码库,制定了强大的训练配方,设计了一个错误诊断工具箱,并讨论了基于图像的3D目标检测的当前方法。具体来说,与其他高度成熟的任务(如2D目标检测)不同,基于图像的3D目标检测社区仍在发展中,其中方法通常采用不同的训练配方和技巧,导致评估和比较不公平。更糟糕的是,这些技巧可能会压倒其所提出的设计在性能上,甚至导致错误的结论。为了解决这个问题,我们构建了一个模块化设计的代码库,并为社区制定了统一的训练标准。此外,我们还设计了一个错误诊断工具箱,用于测量检测模型的详细特征。利用这些工具,我们在不同设置下深入分析了当前方法,并就一些开放性问题进行了讨论,例如在KITTI-3D和nuScenes数据集上对结论的差异,这导致了这些数据集的不同主导方法。我们希望这项工作能促进未来视觉3D检测研究。我们的代码将在https://github.com/OpenGVLab/3dodi 上发布。

Paper29 Reconciling Object-Level and Global-Level Objectives for Long-Tail Detection

摘要原文: Large vocabulary object detectors are often faced with the long-tailed label distributions, seriously degrading their ability to detect rarely seen categories. On one hand, the rare objects are prone to be misclassified as frequent categories. On the other hand, due to the limitation on the total number of detections per image, detectors usually rank all the confidence scores globally and filter out the lower-ranking ones. This may result in missed detection during inference, especially for the rare categories that naturally come with lower scores. Existing methods mainly focus on the former problem and design various classification loss to enhance the object-level classification accuracy, but largely overlook the global-level ranking task. In this paper, we propose a novel framework that Reconciles Object-level and Global-level (ROG) objectives to address both problems. As a multi-task learning framework, ROG simultaneously trains the model with two tasks: classifying each object proposal individually and ranking all the confidence scores globally. Specifically, complementary to the object-level classification loss for model discrimination, we design a generalized average precision (GAP) loss to explicitly optimize the global-level score ranking across different objects. For each category, GAP loss generates balanced gradients to rectify the ranking errors. In experiments, we show that GAP loss is highly versatile to be plugged into various advanced methods and brings considerable benefits.

中文总结: 这段话主要讨论了大词汇量目标检测器在面对长尾标签分布时的问题。由于罕见类别的存在,检测器的能力受到严重影响。一方面,罕见对象容易被错误分类为常见类别。另一方面,由于每张图像的总检测次数有限,检测器通常会对所有置信度得分进行全局排名,并过滤掉排名较低的。这可能导致在推断过程中漏检,特别是对于自然得分较低的罕见类别。现有方法主要集中在解决前一问题,并设计各种分类损失来提高对象级分类准确性,但很大程度上忽视了全局级排名任务。本文提出了一个新的框架,即Reconciles Object-level and Global-level (ROG) objectives,以解决这两个问题。作为一个多任务学习框架,ROG同时训练模型进行两个任务:分别对每个对象提议进行分类和对所有置信度得分进行全局排名。具体来说,为了优化全局级别的得分排名,我们设计了一个广义平均精度(GAP)损失,与对象级分类损失相辅相成。在实验中,我们展示了GAP损失非常灵活,可以嵌入到各种先进方法中,并带来显著的好处。

Paper30 PolicyCleanse: Backdoor Detection and Mitigation for Competitive Reinforcement Learning

摘要原文: While real-world applications of reinforcement learning (RL) are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in multi-agent RL systems, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their bad impact. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agent’s accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that

the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments.

中文总结: 这段话主要讨论了强化学习(RL)在现实世界应用中日益流行,但RL系统的安全性和稳健性值得更多关注和探索。特别是,最近的研究揭示了在多智能体RL环境中,可以向受害智能体(也称为特洛伊智能体)注入后门触发行为,这可能导致一旦受害智能体看到后门触发行为就会发生灾难性故障。为了确保RL智能体免受恶意后门的影响,在这项工作中,我们提出了在多智能体RL系统中检测后门的问题,其目标是检测特洛伊智能体以及相应的潜在触发行为,并进一步尝试减轻它们的恶劣影响。为了解决这个问题,我们提出了基于激活的特洛伊智能体累积奖励在几个时间步之后明显下降的PolicyCleanse。除了PolicyCleanse,我们还设计了一种基于机器遗忘的方法,可以有效减轻检测到的后门。大量实验证明,所提出的方法可以准确检测特洛伊智能体,并在各种类型的智能体和环境中的获胜率上至少比现有的后门减轻基线方法高出3%。

Paper31 DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection

摘要原文: In this paper, we study the problem of semi-supervised 3D object detection, which is of great importance considering the high annotation cost for cluttered 3D indoor scenes. We resort to the robust and principled framework of self-teaching, which has triggered notable progress for semi-supervised learning recently. While this paradigm is natural for image-level or pixel-level prediction, adapting it to the detection problem is challenged by the issue of proposal matching. Prior methods are based upon two-stage pipelines, matching heuristically selected proposals generated in the first stage and resulting in spatially sparse training signals. In contrast, we propose the first semi-supervised 3D detection algorithm that works in the single-stage manner and allows spatially dense training signals. A fundamental issue of this new design is the quantization error caused by point-to-voxel discretization, which inevitably leads to misalignment between two transformed views in the voxel domain. To this end, we derive and implement closed-form rules that compensate this misalignment on-the-fly. Our results are significant, e.g., promoting ScanNet mAP@0.5 from 35.2% to 48.5% using 20% annotation. Codes and data are publicly available.

中文总结: 本文研究了半监督3D物体检测的问题,考虑到在混乱的3D室内场景中进行标注的高成本,这是非常重要的。我们采用了自学习的稳健和原则性框架,这一框架最近在半监督学习方面取得了显著进展。虽然这种范式在图像级或像素级预测方面很自然,但将其调整到检测问题上受到了提议匹配问题的挑战。以往的方法基于两阶段流程,通过在第一阶段生成的启发式选择的提议进行匹配,导致空间稀疏的训练信号。相比之下,我们提出了第一个在单阶段方式下工作并允许空间密集训练信号的半监督3D检测算法。这种新设计的一个基本问题是由于点到体素的离散化而引起的量化误差,这不可避免地导致了体素域中两个转换视图之间的错位。为此,我们推导并实现了实时补偿这种错位的闭式规则。我们的结果是显著的,例如,使用20%的标注将ScanNet mAP@0.5从35.2%提升到48.5%。代码和数据已公开发布。

Paper32 Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection

摘要原文: Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals, which further benefits final detection. Extensive experiments on the prevalent PASCAL VOC 2007 & 2012 and COCO datasets demonstrate the superior performance of our CBL framework.

中文总结: 这段话主要讨论了最近在弱监督目标检测方面取得的进展,其中特点是结合了多实例检测网络(MIDN)和序数在线细化。然而,仅有图像级别注释的情况下,MIDN在生成伪标签时不可避免地会给一些意外的区域提议分配高分。这些不准确的高分区域提议会误导后续细化模块的训练,从而影响检测性能。为了改善MIDN中伪标签的质量,研究者提出了循环引导标记(CBL)方法,通过优化MIDN并利用可靠的教师网络的排名信息。具体来说,他们通过引入加权指数移动平均策略来获取这个教师网络,以利用各种细化模块。他们提出了一种新颖的类别特定排名蒸馏算法,用于利用加权集成教师网络的输出来蒸馏MIDN的排名信息。因此,MIDN被引导为给准确的提议分配更高的分数,进而有助于最终的检测结果。在流行的PASCAL VOC 2007和2012以及COCO数据集上进行的大量实验证明了他们的CBL框架具有卓越的性能。


**下面关键词是object


Paper1 Interactive Class-Agnostic Object Counting

摘要原文: We propose a novel framework for interactive class-agnostic object counting, where a human user can interactively provide feedback to improve the accuracy of a counter. Our framework consists of two main components: a user-friendly visualizer to gather feedback and an efficient mechanism to incorporate it. In each iteration, we produce a density map to show the current prediction result, and we segment it into non-overlapping regions with an easily verifiable number of objects. The user can provide feedback by selecting a region with obvious counting errors and specifying the range for the estimated number of objects within it. To improve the counting result, we develop a novel adaptation loss to force the visual counter to output the predicted count within the user-specified range. For effective and efficient adaptation, we propose a refinement module that can be used with any density-based visual counter, and only the parameters in the refinement module will be updated during adaptation. Our experiments on two challenging class-agnostic object counting benchmarks, FSCD-LVIS and FSC-147, show that our method can reduce the mean absolute error of multiple state-of-the-art visual counters by roughly 30% to 40% with minimal user input. Our project can be found at https://yifehuang97.github.io/ICACountProjectPage/.

中文总结: 这段话主要介绍了一个新颖的交互式无类别对象计数框架,其中用户可以通过交互式提供反馈来提高计数器的准确性。该框架包括两个主要组件:一个用户友好的可视化工具用于收集反馈,以及一个高效的机制来整合反馈。在每次迭代中,我们生成一个密度图来展示当前的预测结果,并将其分割成不重叠的区域,其中包含易于验证的对象数量。用户可以通过选择具有明显计数错误的区域并指定其内估计对象数量范围来提供反馈。为了改进计数结果,我们开发了一种新颖的适应损失,以强制视觉计数器在用户指定的范围内输出预测计数。为了有效和高效地适应,我们提出了一个精炼模块,可以与任何基于密度的视觉计数器一起使用,只有精炼模块中的参数在适应期间会被更新。我们在两个具有挑战性的无类别对象计数基准数据集FSCD-LVIS和FSC-147上进行的实验表明,我们的方法可以将多个最先进的视觉计数器的平均绝对误差减少大约30%至40%,并且只需极少的用户输入。我们的项目可在https://yifehuang97.github.io/ICACountProjectPage/找到。

Paper2 Vox-E: Text-Guided Voxel Editing of 3D Objects

摘要原文: Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts.

This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works. Our code and data will be made publicly available.

中文总结: 这段话主要介绍了大规模文本引导扩散模型因其能够合成传达复杂视觉概念的多样化图像而受到重视。最近,这种生成能力被利用于进行文本到3D合成。作者提出了一种利用潜在扩散模型的技术来编辑现有3D对象的方法。该方法以3D对象的定向2D图像作为输入,并学习其基于网格的体积表示。为了引导体积表示符合目标文本提示,作者遵循无条件文本到3D方法,并优化了得分蒸馏抽样(SDS)损失。然而,作者发现将这种扩散引导损失与鼓励表示不过分偏离输入对象的基于图像的正则化损失相结合是具有挑战性的,因为它要在只查看结构和外观耦合的2D投影的情况下实现两个相互冲突的目标。因此,作者引入了一种在3D空间直接操作的新颖体积正则化损失,利用我们的3D表示的显式特性来强化原始和编辑对象的全局结构之间的相关性。此外,作者提出了一种优化交叉注意力体积网格以细化编辑的空间范围的技术。大量实验和比较表明我们的方法在创建无法通过先前工作实现的各种编辑方面的有效性。我们的代码和数据将公开提供。

Paper3 Zero-1-to-3: Zero-shot One Image to 3D Object

摘要原文: We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this underconstrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms stateof- the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training.

中文总结: 这段话主要介绍了一个名为Zero-1-to-3的框架,用于在仅给定单个RGB图像的情况下改变对象的摄像机视角。为了在这种不受约束的情况下执行新视角合成,他们利用大规模扩散模型对自然图像学习的几何先验知识。他们的条件扩散模型利用合成数据集学习相对摄像机视角的控制,从而允许生成同一对象在指定摄像机变换下的新图像。尽管它是在合成数据集上训练的,但我们的模型保留了对超出分布数据集以及野外图像(包括印象派绘画)的强零样本泛化能力。我们的视角条件扩散方法还可用于从单个图像进行三维重建的任务。定性和定量实验表明,我们的方法通过利用互联网规模的预训练显着优于最先进的单视图三维重建和新视角合成模型。

Paper4 Chop & Learn: Recognizing and Generating Object-State Compositions

摘要原文: Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io.

中文总结: 本文研究了识别和生成物体状态组合在泛化到未见组合时是一项具有挑战性的任务。我们研究了以不同风格切割物体的任务以及由此产生的物体状态变化。我们提出了一个新的基准套件Chop & Learn,以满足使用多个视角学习物体和不同切割风格的需求。我们还提出了一个新的任务——组合图像生成,它可以将学习到的切割风格转移到不同的物体,通过生成新颖的物体状态图像。此外,我们还将视频用于组合动作识别,并展示了该数据集在多个视频任务中的有价值的用途。项目网站:https://chopnlearn.github.io。

Paper5 MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects

摘要原文: Inferring affordance for 3D articulated objects is a challenging and practical problem. It is a primary problem for applying robots to real-world scenarios. The exploration can be summarized as figuring out where to act and how to act. Correspondingly, the task mainly requires producing actionability scores, action proposals, and success likelihood scores according to the given 3D object information and robotic information. Current works usually directly process multi-modal inputs with early fusion and apply critic networks to produce scores, which leads to insufficient multi-modal learning ability and inefficiently iterative training in multiple stages. This paper proposes a novel Multimodality-Aware Autoencoder-based affordance Learning (MAAL) for the 3D object affordance problem. It is an efficient pipeline, trained in one go, and only requires a few positive samples in training data. More importantly, MAAL contains a MultiModal Energized Encoder (MME) for better multi-modal learning. It comprehensively models all multi-modal inputs from 3D objects and robotic actions. Jointly considering information from multiple modalities, the encoder further learns interactions between robots and objects. MME empowers the better multi-modal learning ability for understanding object affordance. Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem.

中文总结: 这段话主要讨论了推断3D关节对象的可用性是一个具有挑战性且实际的问题,对于将机器人应用于真实场景是一个主要问题。该研究总结为确定何处行动以及如何行动。相应地,该任务主要需要根据给定的3D对象信息和机器人信息生成可行性分数、行动建议和成功概率分数。目前的研究通常直接处理多模态输入,采用早期融合并应用评论网络生成分数,这导致多模态学习能力不足且在多个阶段进行的训练效率低下。本文提出了一种新颖的基于自动编码器的多模态感知可用性学习(MAAL)方法来解决3D对象可用性问题。这是一个高效的流程,一次性训练,并且只需要少量正样本数据。更重要的是,MAAL包含一个用于更好的多模态学习的多模态能量化编码器(MME)。它全面地对来自3D对象和机器人动作的所有多模态输入进行建模。通过联合考虑来自多个模态的信息,编码器进一步学习机器人和对象之间的交互。MME增强了对理解对象可用性的更好的多模态学习能力。基于大规模数据集PartNet-Mobility的实验结果和可视化展示了MAAL在学习多模态数据和解决3D关节对象可用性问题方面的有效性。

Paper6 Fan-Beam Binarization Difference Projection (FB-BDP): A Novel Local Object Descriptor for Fine-Grained Leaf Image Retrieval

摘要原文: Fine-grained leaf image retrieval (FGLIR) aims to search similar leaf images in subspecies level which involves very high interclass visual similarity and accordingly poses great challenges to leaf image description. In this study, we introduce a new concept, named fan-beam binarization difference projection (FB-BDP) to address this challenging issue. It is designed based on the theory of fan-beam projection (FBP) which is a mathematical tool originally used for computed tomographic reconstruction of objects and has the merits of capturing the inner structure information of objects in multiple directions and excellent ability to suppress image noise. However, few studies have been made to apply FBP to the description of texture patterns. Rather than calculating ray integrals over the whole object area, FB-BDP restricts its ray integrals calculated over local patches to guarantee the locality of the extracted features. By binarizing the intensity-differences between the off-center and center rays, FB-BDP enable its ray integrals insensitive to illumination change and more discriminative in the characterization of texture patterns. In additional, due to inheriting the merits of FBP, the proposed FB-BDP is superior over the existing local image descriptors by its invariance to scaling transformation, robustness to noise, and strong ability to capture direction and structure texture patterns. The results of extensive experiments on FGLIR show its higher retrieval accuracy over the benchmark methods, promising generalization power and strong complementarity to deep features.

中文总结: 细粒度叶片图像检索(FGLIR)旨在在亚种级别搜索相似的叶片图像,其中涉及非常高的类间视觉相似性,因此对叶片图像描述提出了巨大挑战。在本研究中,我们引入了一个新概念,称为扇形束二值化差异投影(FB-BDP),以解决这一具有挑战性的问题。它基于扇形束投影(FBP)理论设计,FBP是一种数学工具,最初用于对象的计算机断层重建,具有捕捉对象内部结构信息的优点,可以在多个方向上抑制图像噪声。然而,很少有研究将FBP应用于纹理模式的描述。与计算整个对象区域上的射线积分不同,FB-BDP将其射线积分限制在局部补丁上计算,以确保提取特征的局部性。通过对中心和偏离中心的射线之间的强度差异进行二值化,FB-BDP使其射线积分对光照变化不敏感,并且在纹理模式的表征中更具区分性。此外,由于继承了FBP的优点,所提出的FB-BDP相对于现有的局部图像描述符具有更好的不变性,对噪声的鲁棒性以及捕捉方向和结构纹理模式的强大能力。对FGLIR的大量实验结果显示,与基准方法相比,其检索准确性更高,具有良好的泛化能力和对深度特征的强大补充性。

Paper7 Multi3DRefer: Grounding Text Description to Multiple 3D Objects

摘要原文: We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark.

中文总结: 本文介绍了使用自然语言描述在真实世界的3D场景中定位灵活数量的对象的任务。现有的3D视觉定位任务侧重于在给定文本描述的情况下定位一个唯一的对象。然而,这样严格的设置在真实世界的场景和机器人任务中定位可能存在多个对象是一个常见需求。为了解决这一设置,我们提出了Multi3DRefer,推广了ScanRefer数据集和任务。我们的数据集包含11609个对象的61926个描述,每个描述引用了零个、单个或多个目标对象。我们还引入了一种新的评估指标和基准方法,以便进一步研究多模态3D场景理解。此外,我们通过在线渲染对象提案与对比学习结合利用CLIP的2D特征,开发了一个更好的基准线,该基准线在ScanRefer基准测试中胜过了现有技术水平。

Paper8 Random Boxes Are Open-world Object Detectors

摘要原文: We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architecture trained on random proposals at each training iteration, surpassing existing Faster R-CNN and Transformer based OWOD. Its effectiveness stems from the following two benefits introduced by randomness. First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects. Second, the unbiased training encourages more proposal explorations by using our proposed matching score that does not penalize the random proposals whose prediction scores do not match the known objects. On two benchmarks: Pascal-VOC/MS-COCO and LVIS, RandBox significantly outperforms the previous state-of-the-art in all metrics. We also detail the ablations on randomization and loss designs. Codes and other details are in Appendix.

中文总结: 本文研究表明,使用随机区域提议训练的分类器实现了最先进的开放世界目标检测(OWOD):它们不仅可以保持已知对象(带有训练标签)的准确性,还可以显著提高未知对象(无训练标签)的召回率。具体来说,我们提出了RandBox,这是一个基于Fast R-CNN的架构,在每个训练迭代中使用随机提议进行训练,超越了现有的基于Faster R-CNN和Transformer的OWOD。其有效性源于随机性引入的两个好处。首先,由于随机化与有限已知对象的分布无关,随机提议成为防止训练受已知对象混淆的工具变量。其次,无偏训练通过使用我们提出的匹配分数,鼓励更多的提议探索,不惩罚那些预测分数与已知对象不匹配的随机提议。在两个基准测试中:Pascal-VOC/MS-COCO和LVIS,RandBox在所有指标上明显优于先前的最先进技术。我们还详细介绍了关于随机化和损失设计的消融实验。代码和其他详细信息请参见附录。