Towards Open World Object Detection概述(论文)-EW帮帮网

论文：https://arxiv.org/abs/2103.02603
代码：https://github.com/JosephKJ/OWOD

Towards Open World Object Detection

迈向开放世界目标检测

Abstract 摘要

Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterising unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-ofthe-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.¹
人类天生具备识别环境中未知物体实例的本能。当相关知识最终可获得时，对这些未知实例的内在好奇心有助于人们认知它们。这促使我们提出一个名为"开放世界目标检测"的新型计算机视觉问题，该模型需要完成两项任务：1）在没有明确监督的情况下，将未接触过的物体识别为"未知"；2）在逐步获得相应标签时，能够持续学习这些已识别的未知类别而不遗忘先前习得的类别。我们构建了问题框架，制定了严格的评估标准，并提出基于对比聚类和能量检测的未知识别新方法ORE（开放世界目标检测器）。通过实验评估和消融研究，我们分析了ORE在实现开放世界目标方面的有效性。一个有趣的副产品是，我们发现识别和表征未知实例有助于减少增量目标检测中的混淆现象——在不增加方法复杂度的情况下，该方法实现了最先进的性能表现。我们希望这项工作能吸引更多学者投身这一新发现但至关重要的研究方向。¹

1. Introduction 引言

Deep learning has accelerated progress in Object Detection research ², ³, ⁴, ⁵, ⁶, where a model is tasked to identify and localise objects in an image. All existing approaches work under a strong assumption that all the classes that are to be detected would be available at training phase. Two challenging scenarios arises when we relax this assumption: 1) A test image might contain objects from unknown classes, which should be classified as unknown. 2) As and when information (labels) about such identified unknowns become available, the model should be able to incrementally learn the new class. Research in developmental psychology ⁷, ⁸ finds out that the ability to identify what one doesn’t know, is key in captivating curiosity. Such a curiosity fuels the desire to learn new things ⁹, ¹⁰. This motivates us to propose a new problem where a model should be able to identify instances of unknown objects as unknown and subsequently learns to recognise them when training data progressively arrives, in a unified way. We call this problem setting as Open World Object Detection.
深度学习加速了目标检测研究的进展², ³, ⁴, ⁵, ⁶，其任务是让模型识别并定位图像中的对象。现有方法都基于一个强假设：所有待检测类别在训练阶段都是已知的。当我们放宽这个假设时，会面临两个挑战性场景：1）测试图像可能包含未知类别的对象，这些对象应被分类为"未知"；2）当这些被识别的未知对象信息（标签）可用时，模型应能增量学习新类别。发展心理学研究表明 ⁷, ⁸，识别未知事物的能力是激发好奇心的关键。这种好奇心激发了对学习新事物的渴望 ⁹, ¹⁰。这促使我们提出一个新问题：在统一的方式下，模型应能够将未知物体的实例识别为未知，并在训练数据逐步到达时学会识别它们。我们将这个问题的设定称为“开放世界目标检测”。
The number of classes that are annotated in standard vision datasets like Pascal VOC ¹¹ and MS-COCO ¹² are very low (20 and 80 respectively) when compared to the infinite number of classes that are present in the open world. Recognising an unknown as an unknown requires strong generalization. Scheirer et al. ¹³ formalise this as Open Set classification problem. Henceforth, various methodologies (using 1-vs-rest SVMs and deep learning models) has been formulated to address this challenging setting. Bendale et al. ¹⁴ extend Open Set to an Open World classification setting by additionally updating the image classifier to recognise the identified new unknown classes. Interestingly, as seen in Fig. 1, Open World object detection is unexplored, owing to the difficulty of the problem setting.
与开放世界中存在的无限类别相比，像Pascal VOC¹¹ 和MS-COCO¹² 这样的标准视觉数据集中标注的类别数量非常少（分别为20和80）。将未知类别识别为未知需要强大的泛化能力。Scheirer等人 ¹³ 将这一问题形式化为开放集分类问题。此后，人们提出了各种方法（使用一对多支持向量机和深度学习模型）来应对这一具有挑战性的场景。Bendale等人 ¹⁴ 将开放集扩展到开放世界分类场景，其方法还包括更新图像分类器以识别已确认的新未知类别。有趣的是，如图1 所示，由于问题场景的复杂性，开放世界目标检测领域尚未被探索。

在这里插入图片描述

图1: 开放世界目标检测（F）是一个尚未被正式定义和解决的新问题。尽管与开放集和开放世界分类相关，但开放世界目标检测提出了自身独特的挑战，解决这些挑战将提高目标检测器的实用性。

The advances in Open Set and Open World image classification cannot be trivially adapted to Open Set and Open World object detection, because of a fundamental difference in the problem setting: The object detector is trained to detect unknown objects as background. Instances of many unknown classes would have been already introduced to the object detector along with known objects. As they are not labelled, these unknown instances would be explicitly learned as background, while training the detection model. Dhamija et al. ¹⁵ find that even with this extra training signal, the state-of-the-art object detectors result in false positive detections, where the unknown objects end up being classified as one of the known classes, often with very high probability. Miller et al. ¹⁶ propose to use dropout sampling to get an estimate of the uncertainty of the object detection prediction. This is the only peer-reviewed research work in the open set object detection literature. Our proposed Open World Object Detection goes a step further to incrementally learn the new classes, once they are detected as unknown and an oracle provides labels for the objects of interest among all the unknowns. To the best of our knowledge this has not been tried in the literature.
开放集与开放世界图像分类领域的进展无法直接套用于开放集与开放世界目标检测，因为问题设定存在本质差异：目标检测器被训练成将未知物体识别为背景。许多未知类别的实例早已伴随着已知物体被输入目标检测器。由于未被标注，这些未知实例会在训练检测模型时被明确学习为背景。Dhamija等人¹⁵ 发现，即便存在这种额外的训练信号，最先进的目标检测器仍会产生误检——未知物体最终会被归类为某个已知类别，且往往伴随极高概率。Miller等人¹⁶提出使用Dropout采样来估计目标检测预测的不确定性，这是开放集目标检测文献中唯一经过同行评审的研究工作。我们提出的开放世界目标检测更进一步：当新类别被检测为未知物体且人工标注者提供所有未知物体中目标对象的标签后，系统将逐步学习这些新类别。据我们所知，这一方法尚未在现有文献中被尝试过。
The Open World Object Detection setting is much more natural than the existing closed-world, static-learning setting. The world is diverse and dynamic in the number, type and configurations of novel classes. It would be naive to assume that all the classes to expect at inference are seen during training. Practical deployments of detection systems in robotics, self-driving cars, plant phenotyping, healthcare and surveillance cannot afford to have complete knowledge on what classes to expect at inference time, while being trained in-house. The most natural and realistic behavior that one can expect from an object detection algorithm deployed in such settings would be to confidently predict an unknown object as unknown, and known objects into the corresponding classes. As and when more information about the identified unknown classes becomes available, the system should be able to incorporate them into its existing knowledge base. This would define a smart object detection system, and ours is an effort towards achieving this goal. The key contributions of our work are:
开放世界物体检测的设置比现有的封闭世界静态学习设置更加自然。世界在新颖类别的数量、类型和配置方面是多样且动态的。若假设推理时预期的所有类别都在训练阶段见过，这种想法未免过于天真。在机器人、自动驾驶汽车、植物表型分析、医疗保健和监控等实际应用场景中部署检测系统时，我们无法在内部训练阶段就完全掌握推理时可能遇到的所有类别。在这些场景下，人们对物体检测算法最自然且现实的期望是：它能自信地将未知物体识别为"未知"，将已知物体归类到对应类别。当已识别未知类别的更多信息可用时，该系统应能将其纳入现有知识库。这将定义一个智能的物体检测系统，而我们的工作正是为实现这一目标而努力。本研究的核心贡献包括：

We introduce a novel problem setting, Open World Object Detection, which models the real-world more closely.
我们提出了一种新颖的问题设置——开放世界物体检测，它能更贴近地模拟现实世界。
We develop a novel methodology, called ORE, based on contrastive clustering, an unknown-aware proposal network and energy based unknown identification to address the challenges of open world detection.
我们开发了一种名为ORE的新方法，该方法基于对比聚类、未知感知提案网络和基于能量的未知识别技术，以解决开放世界检测面临的挑战。
We introduce a comprehensive experimental setting, which helps to measure the open world characteristics of an object detector, and benchmark ORE on it against competitive baseline methods.
我们引入了一种全面的实验设置，有助于衡量目标检测器的开放世界特性，并在此基础上将ORE与竞争性基线方法进行基准测试。
As an interesting by-product, the proposed methodology achieves state-of-the-art performance on Incremental Object Detection, even though not primarily designed for it.
作为一个有趣的副产品，所提出的方法在增量目标检测任务中取得了最先进性能，尽管该方法并非专门为此设计。

2. Related Work 相关工作

Open Set Classification: The open set setting considers knowledge acquired through training set to be incomplete, thus new unknown classes can be encountered during testing. Scheirer et al. ¹⁷ developed open set classifiers in a one-vs-rest setting to balance the performance and the risk of labeling a sample far from the known training examples (termed as open space risk). Follow up works ¹⁸, ¹⁹ extended the open set framework to multi-class classifier setting with probabilistic models to account for the fading away classifier confidences in case of unknown classes.
开放集分类：开放集设定认为通过训练集获得的知识是不完整的，因此在测试过程中可能会遇到新的未知类别。Scheirer等人¹⁷ 在一对多分类场景中开发了开放集分类器，以平衡性能与远离已知训练样本的样本被标记的风险（称为开放空间风险）。后续研究¹⁸, ¹⁹将开放集框架扩展至多分类器场景，通过概率模型来解决面对未知类别时分类器置信度衰减的问题。
Bendale and Boult ²⁰ identified unknowns in the feature space of deep networks and used a Weibull distribution to estimate the set risk (called OpenMax classifier). A generative version of OpenMax was proposed in ²¹ by synthesizing novel class images. Liu et al. ²² considered a long-tailed recognition setting where majority, minority and unknown classes coexist. They developed a metric learning framework identify unseen classes as unknown. In similar spirit, several dedicated approaches target on detecting the out of distribution samples ²³ or novelties ²⁴. Recently, self-supervised learning ²⁵ and unsupervised learning with reconstruction ²⁶ have been explored for open set recognition. However, while these works can recognize unknown instances, they cannot dynamically update themselves in an incremental fashion over multiple training episodes. Further, our energy based unknown detection approach has not been explored before.
Bendale和Boult²⁰ 识别出深度网络特征空间中的未知类别，并采用威布尔分布估算集合风险（称为OpenMax分类器）。文献²¹ 提出生成式OpenMax方法，通过合成新类别图像实现拓展。Liu等人²² 研究了一个多数类、少数类和未知类共存的开放长尾识别场景，开发出基于度量学习的框架来将未见类别识别为未知。类似地，多篇专项研究致力于检测分布外样本²³ 或新颖样本²⁴ 。近期，自监督学习²⁵ 与基于重构的无监督学习²⁶ 也被探索用于开放集识别。然而，这些方法虽然能识别未知实例，却无法在多轮训练中以增量方式动态更新模型。此外，我们基于能量的未知检测方法此前尚未被探索过。
Open World Classification: ¹⁴ first proposed the open world setting for image recognition. Instead of a static classifier trained on a fixed set of classes, they proposed a more flexible setting where knowns and unknowns both coexist. The model can recognize both types of objects and adaptively improve itself when new labels for unknown are provided. Their approach extends Nearest Class Mean classifier to operate in an open world setting by re-calibrating the class probabilities to balance open space risk. ²⁷ studies open world face identity learning while ²⁸ proposed to use an exemplar set of seen classes to match them against a new sample, and rejects it in case of a low match with all previously known classes. However, they don’t test on image classification benchmarks and study product classification in e-commerce applications.
开放世界分类：¹⁴首次提出了面向图像识别的开放世界设定。与在固定类别集上训练的静态分类器不同，他们提出了一种更灵活的设定——已知类别与未知类别共存。该模型能同时识别两类对象，并在提供未知类别新标签时自适应优化。¹⁴通过重新校准类别概率以平衡开放空间风险，将最近类均值分类器扩展至开放世界场景。²⁷研究了开放世界人脸身份学习，而²⁸提出使用已见类别的范例集与新样本进行匹配，若与所有已知类别匹配度均较低则予以拒绝。但两者均未在图像分类基准测试中进行验证，而是针对电子商务应用中的商品分类展开研究。
Open Set Detection: Dhamija et al. ¹⁵ formally studied the impact of open set setting on popular object detectors. They noticed that the state of the art object detectors often classify unknown classes with high confidence to seen classes. This is despite the fact that the detectors are explicitly trained with a background class ²⁹, ², ³⁰ and/or apply one-vs-rest classifiers to model each class ³¹, ⁵. A dedicated body of work ¹⁶, ³², ³³ focuses on developing measures of (spatial and semantic) uncertainty in object detectors to reject unknown classes. E.g., ¹⁶, ³² uses Monte Carlo Dropout ³⁴ sampling in a SSD detector to obtain uncertainty estimates. These methods, however, cannot incrementally adapt their knowledge in a dynamic world.
开放集检测：Dhamija等人¹⁵首次系统研究了开放集设定对主流目标检测器的影响。他们发现，即便检测器已通过背景类训练²⁹, ², ³⁰和/或采用一对多分类器建模每个类别³¹, ⁵，当前最优检测器仍会以高置信度将未知类别误判为已知类别。为此，系列研究¹⁶, ³², ³³致力于构建目标检测中的（空间与语义）不确定性度量以排除未知类别。例如¹⁶, ³²在SSD检测器中采用蒙特卡洛Dropout采样³⁴来获取不确定性估计。但这类方法尚无法在动态环境中实现知识增量更新。

3. Open World Object Detection 开放世界目标检测

Let us formalise the definition of Open World Object Detection in this section. At any time $t$ , we consider the set of known object classes as $\mathcal{K}^t = \{1, 2, .., C\} ⊂ \mathcal{N}^+$ where $\mathcal{N}^+$ denotes the set of positive integers. In order to realistically model the dynamics of real world, we also assume that their exists a set of unknown classes $\mathcal{U} = \{C + 1, ...\}$ , which may be encountered during inference. The known object classes $K_t$ are assumed to be labeled in the dataset $D^t = \{X^t, Y^t\}$ where $X$ and $Y$ denote the input images and labels respectively. The input image set comprises of $M$ training images, $Xt = \{I_1, . . . , I_M\}$ and associated object labels for each image forms the label set $Y^t = \{Y_1, . . . , Y_M \}$ . Each $Y_i = \{y_1, y_2, .., y_K \}$ encodes a set of $K$ object instances with their class labels and locations i.e., $y_k = [l_k, x_k, y_k, w_k, h_k]$ , where $l_k ∈ K_t$ and $x_k, y_k, w_k, h_k$ denote the bounding box center coordinates, width and height respectively.
让我们在此节正式定义开放世界目标检测。在任意时刻 $t$ ，已知物体类别集合定义为 $\mathcal{K}^t = \{1, 2, .., C\} ⊂ \mathcal{N}^+$ ，其中 $\mathcal{N}^+$ 表示正整数集。为真实模拟现实世界的动态性，我们同时假设存在未知类别集合 $\mathcal{U} = \{C + 1, ...\}$ ，这些类别可能在推理过程中遇到。已知物体类别 $K_t$ 在数据集 $D^t = \{X^t, Y^t\}$ 中被标记，其中 $X$ 和 $Y$ 分别表示输入图像和标签。输入图像集包含 $M$ 张训练图像 $Xt = \{I_1, . . . , I_M\}$ ，每幅图像的关联物体标签构成标签集 $Y^t = \{Y_1, . . . , Y_M \}$ 。每个 $Y_i = \{y_1, y_2, .., y_K \}$ 编码了 $K$ 个带有类别标签和位置信息的物体实例，即 $y_k = [l_k, x_k, y_k, w_k, h_k]$ ，其中 $l_k ∈ K_t$ ，而 $x_k, y_k, w_k, h_k$ 分别表示边界框中心坐标、宽度和高度。
The Open World Object Detection setting considers an object detection model $\mathcal{M}_C$ that is trained to detect all the previously encountered $C$ object classes. Importantly, the model $\mathcal{M}_C$ is able to identify a test instance belonging to any of the known $C$ classes, and can also recognize a new or unseen class instance by classifying it as an unknown, denoted by a label zero (0). The unknown set of instances $U^t$ can then be forwarded to a human user who can identifyn new classes of interest (among a potentially large number of unknowns) and provide their training examples. The learner incrementally adds $n$ new classes and updates itself to produce an updated model $\mathcal{M}_{C+n}$ without retraining from scratch on the whole dataset. The known class set is also updated $\mathcal{K}_{t+1} = \mathcal{K}_t + \{C + 1, . . . , C + n\}$ . This cycle continues over the life of the object detector, where it adaptively updates itself with new knowledge. The problem setting is illustrated in the top row of Fig. 2.
开放世界目标检测设定考虑了一个目标检测模型 $\mathcal{M}_C$ ，该模型经过训练可检测所有先前遇到的 $C$ 个目标类别。值得注意的是，模型 $\mathcal{M}_C$ 不仅能识别属于已知 $C$ 类中的测试实例，还能通过将其分类为未知（标记为零(0)）来识别新的或未见过的类别实例。未知实例集合 $U^t$ 随后可转发给人类用户，用户可在潜在的大量未知对象中识别出感兴趣的新类别并提供训练样本。学习器逐步添加 $n$ 个新类别并自我更新，从而生成升级后的模型 $\mathcal{M}_{C+n}$ ，而无需在整个数据集上从头开始重新训练。已知类别集也同步更新为 $\mathcal{K}_{t+1} = \mathcal{K}_t + \{C + 1, ..., C + n\}$ 。这种循环在目标检测器的生命周期中持续进行，使其能自适应地更新新知识。该问题设定如图2顶行所示。
在这里插入图片描述

图2：方法概述：
第一行：在每一个增量学习步骤中，模型识别未知对象（用“？”表示），这些对象逐步被标注（如蓝色圆圈所示）并加入现有知识库（绿色圆圈）。
第二行：我们的开放世界目标检测模型通过基于能量的分类头和未知感知的RPN识别潜在未知对象。此外，我们在特征空间进行对比学习以形成可区分的聚类，并能以持续学习的方式灵活添加新类别，同时避免遗忘已学类别。

4. ORE: Open World Object Detector ORE：开放世界物体检测器

A successful approach for Open World Object Detection should be able to identify unknown instances without explicit supervision and defy forgetting of earlier instances when labels of these identified novel instances are presented to the model for knowledge upgradation (without retraining from scratch). We propose a solution, ORE which addresses both these challenges in a unified manner.
开放世界目标检测的成功方法应当能够在不明确监督的情况下识别未知实例，并在将已识别新实例的标签提供给模型进行知识升级时（无需从头重新训练），防止对先前实例的遗忘。我们提出了一种解决方案ORE，它以统一的方式应对这两大挑战。
Neural networks are universal function approximators ³⁵, which learn a mapping between an input and the output through a series of hidden layers. The latent representation learned in these hidden layers directly controls how each function is realised. We hypothesise that learning clear discrimination between classes in the latent space of object detectors could have two fold effect. First, it helps the model to identify how the feature representation of an unknown instance is different from the other known instances, which helps identify an unknown instance as a novelty. Second, it facilitates learning feature representations for the new class instances without overlapping with the previous classes in the latent space, which helps towards incrementally learning without forgetting. The key component that helps us realise this is our proposed contrastive clusteringin the latent space, which we elaborate in Sec. 4.1.
神经网络是通用的函数逼近器³⁵，它们通过一系列隐藏层学习输入与输出之间的映射关系。这些隐藏层学习到的潜在表征直接控制着每个函数的具体实现方式。我们假设，在目标检测器的潜在空间中学习清晰的类别区分可能具有双重效应：首先，这有助于模型识别未知实例的特征表征与已知实例的差异，从而将未知实例识别为新类别；其次，这能促进新类别实例的特征表征学习，避免与潜在空间中已有类别发生重叠，进而实现持续学习而不遗忘。实现这一目标的关键组件是我们提出的潜在空间对比聚类方法，详见第4.1节阐述。
To optimally cluster the unknowns using contrastive clustering, we need to have supervision on what an unknown instance is. It is infeasible to manually annotate even a small subset of the potentially infinite set of unknown classes. To counter this, we propose an auto-labelling mechanism based on the Region Proposal Network ³ to pseudo-label unknown instances, as explained in Sec. 4.2. The inherent separation of auto-labelled unknown instances in the latent space helps our energy based classification head to differentiate between the known and unknown instances. As elucidated in Sec. 4.3, we find that Helmholtz free energy is higher for unknown instances.
为了利用对比聚类最优地对未知类别进行聚类，我们需要对未知实例的定义建立监督机制。然而，在潜在无限的未知类别集合中，即使对很小一部分进行人工标注也是不可行的。为此，我们提出基于区域建议网络³的自动标注机制（如第4.2节所述）来对未知实例进行伪标注。自动标注的未知实例在潜在空间中的固有分离性，有助于我们基于能量的分类头区分已知和未知实例。如第4.3节所述，我们发现未知实例具有更高的亥姆霍兹自由能。
Fig. 2 shows the high-level architectural overview of ORE. We choose Faster R-CNN ³ as the base detector as Dhamija et al. ¹⁵ has found that it has better open set performance when compared against one-stage RetinaNet detector ⁵ and objectness based YOLO detector ⁶. Faster R-CNN ³ is a two stage object detector. In the first stage, a class-agnostic Region Proposal Network (RPN) proposes potential regions which might have an object from the feature maps coming from a shared backbone network. The second stage classifies and adjusts the bounding box coordinates of each of the proposed region. The features that are generated by the residual block in the Region of Interest (RoI) head are contrastively clustered. The RPN and the classification head is adapted to auto-label and identify unknowns respectively. We explain each of these coherent constituent components, in the following subsections:
图2展示了ORE的高级架构概述。我们选择Faster R-CNN ³作为基础检测器，因为Dhamija等人¹⁵发现与单阶段RetinaNet检测器⁵和基于目标性的YOLO检测器⁶相比，它在开放集场景下表现更优。Faster R-CNN³是一种两阶段目标检测器。第一阶段，类别无关的区域提议网络(RPN)从共享主干网络生成的特征图中提出可能包含物体的潜在区域。第二阶段对每个提议区域进行分类并调整其边界框坐标。在感兴趣区域(RoI)头部残差块生成的特征会进行对比聚类。RPN和分类头部分别被改造用于自动标注和识别未知类别。我们将在以下小节详细解释这些协同工作的组成模块：

4.1. Contrastive Clustering 对比聚类

Class separation in the latent space would be an ideal characteristic for an Open World methodology to identify unknowns. A natural way to enforce this would be to model it as a contrastive clustering problem, where instances of same class would be forced to remain close-by, while instances of dissimilar class would be pushed far apart.
潜在空间中的类别分离对于开放世界方法识别未知类别来说是一个理想特性。强制实现此特性的自然方法是将其建模为对比聚类问题：同类实例会被迫保持接近，而异类实例则会被推远。
For each known class $\mathcal{K}^t$ , we maintain a prototype vector $p_i$ . Let $f_c ∈ \mathcal{R}^d$ be a feature vector that is generated by an intermediate layer of the object detector, for an object of class $c$ . We define the contrastive loss as follows:
$L_{cont}(f_c) = ∑^C_{i=0}\mathcal{l}(f_c, {\,} {\,} p_i), {\,} {\,} where, \tag1$
$\mathcal{l}(f_c, p_i) = \begin{cases} \mathcal{D}(f_c, {\,} {\,} p_i) & i = c \\ max\{0, {\,} {\,} ∆ − D(f_c, {\,} {\,} p_i)\} & otherwise \end{cases}$
where $\mathcal{D}$ is any distance function and $∆$ defines how close a similar and dissimilar item can be. Minimizing this loss would ensure the desired class separation in the latent space.
对于每个已知类别 $\mathcal{K}^t$ ，我们维护一个原型向量 $p_i$ 。设 $f_c ∈ \mathcal{R}^d$ 为目标检测器中间层为类别 $c$ 的目标生成的特征向量。我们将对比损失定义如下：
$\tag1$
其中， $\mathcal{D}$ 是任意距离函数， $∆$ 定义了相似项与不相似项之间的最小间隔。通过最小化该损失函数，可以确保潜在空间中实现所需的类别分离。
Mean of feature vectors corresponding to each class is used to create the set of class prototypes: $\mathcal{P} = \{p_0 · · · p_C\}$ . Maintaining each prototype vector is a crucial component of ORE. As the whole network is trained end-to-end, the class prototypes should also gradually evolve, as the constituent features change gradually (as stochastic gradient descent updates weights by a small step in each iteration). We maintain a fixed-length queue $q_i$ , per class for storing the corresponding features. A feature store $\mathcal{F}_{store} = \{q_0 · · · q_C\}$ , stores the class specific features in the corresponding queues. This is a scalable approach for keeping track of how the feature vectors evolve with training, as the number of feature vectors that are stored is bounded by $C \times Q$ , where $Q$ is the maximum size of the queue.
每个类别对应的特征向量均值用于创建类别原型集合： $\mathcal{P} = \{p_0 · · · p_C\}$ 。维护每个原型向量是ORE方法的关键组成部分。由于整个网络是端到端训练的，随着构成特征的逐渐变化（随机梯度下降在每次迭代中以小步长更新权重），类别原型也应逐步演变。我们为每个类别维护一个固定长度的队列 $q_i$ 来存储相应特征。特征存储器 $\mathcal{F}_{store} = \{q_0 · · · q_C\}$ 将特定类别特征存储于对应队列中。这是一种可扩展的方法，用于追踪特征向量随训练演变的轨迹，因为存储的特征向量数量被限制为 $C \times Q$ ，其中 $Q$ 是队列的最大容量。
Algorithm 1 provides an overview on how class prototypes are managed while computing the clustering loss. We start computing the loss only after a certain number of burnin iterations ( $I_b$ ) are completed. This allows the initial feature embeddings to mature themselves to encode class information. Since then, we compute the clustering loss using Eqn. 1. After every $I_p$ iterations, a set of new class prototypes $\mathcal{P}_{new}$ is computed (line 8). Then the existing prototypes $\mathcal P$ are updated by weighing $\mathcal P$ and $\mathcal{P}_{new}$ with a momentum parameter $η$ . This allows the class prototypes to evolve gradually keeping track of previous context. The computed clustering loss is added to the standard detection loss and back-propagated to learn the network end-to-end.
算法1展示了在计算聚类损失过程中如何管理类别原型的概览。我们仅在完成一定数量的预热迭代( $I_b$ )后开始计算损失，这使得初始特征嵌入能够充分成熟以编码类别信息。此后，我们使用公式1计算聚类损失。每经过 $I_p$ 次迭代，就会计算一组新的类别原型 $\mathcal{P}{new}$ (第8行)。随后通过动量参数 $η$ 对现有原型 $\mathcal P$ 和 $\mathcal{P}{new}$ 进行加权更新，使类别原型能够逐步演化并保持对先前上下文的追踪。最终将计算得到的聚类损失与标准检测损失相加，并通过反向传播实现网络的端到端学习。
在这里插入图片描述

4.2. Auto-labelling Unknowns with RPN 使用RPN自动标注未知项

While computing the clustering loss with Eqn. 1, we contrast the input feature vector $f_c$ against prototype vectors, which include a prototype for unknown objects too ( $c ∈ \{0, 1, .., C\}$ where 0 refers to the unknown class). This would require unknown object instances to be labelled withunknown ground truth class, which is not practically feasible owing to the arduous task of re-annotating all instances of each image in already annotated large-scale datasets.
在使用公式1计算聚类损失时，我们将输入特征向量 $f_c$ 与原型向量进行对比，这些原型向量也包括未知对象的原型（ $c ∈ \{0, 1, .., C\}$ ，其中0表示未知类别）。这就要求为未知物体实例标注未知的真实类别，但由于对已标注的大规模数据集中每张图像的所有实例重新进行标注是一项艰巨的任务，这在实际操作中是不可行的。
As a surrogate, we propose to automatically label some of the objects in the image as a potential unknown object. For this, we rely on the fact that Region Proposal Network (RPN) is class agnostic. Given an input image, the RPN generates a set of bounding box predictions for foreground and background instances, along with the corresponding objectness scores. We label those proposals that have high objectness score, but do not overlap with a ground-truth object as a potential unknown object. Simply put, we select the top-k background region proposals, sorted by its objectness scores, as unknown objects. This seemingly simple heuristic achieves good performance as demonstrated in Sec. 5.
我们提出了一种替代方法：自动将图像中的部分对象标注为潜在未知对象。其原理在于区域提议网络（RPN）具有类别无关性。给定输入图像后，RPN会生成针对前景和背景实例的边界框预测集合及其对应的目标性分数。我们将那些具有高目标性分数、但不与真实标注对象重叠的提议区域标记为潜在未知对象。简而言之，我们按照目标性分数排序，选择前k个背景区域提议作为未知对象。如第5节所示，这种看似简单的启发式方法取得了良好的性能表现。

4.3. Energy Based Unknown Identifier 基于能量的未知标识符

Given the features ( $f \in F$ ) in the latent space F and their corresponding labels $l \in L$ , we seek to learn an energy function $E (F, L)$ . Our formulation is based on the Energy based models (EBMs) ³⁶ that learn a function $E (\cdot)$ to estimates the compatibility between observed variables $F$ and possible set of output variables L using a single output scalar i.e., $\mathcal{R}^d → \mathcal{R}$ . The intrinsic capability of EBMs to assign low energy values to in-distribution data and vice-versa motivates us to use an energy measure to characterize whether a sample is from an unknown class.
给定潜在空间F中的特征（ $f \in F$ ）及其对应标签 $l \in L$ ，我们旨在学习一个能量函数 $E (F, L)$ 。该公式基于能量模型（EBMs）³⁶，通过学习函数 $E (\cdot)$ 来评估观测变量 $F$ 与可能输出变量集L之间的兼容性，最终输出单一标量值即 $\mathcal{R}^d → \mathcal{R}$ 。能量模型具有为分布内数据分配低能量值的本质特性，反之亦然，这促使我们采用能量度量来判定样本是否属于未知类别。
Specifically, we use the Helmholtz free energy formulation where energies for all values in L are combined,
具体而言，我们采用亥姆霍兹自由能公式，将所有L值的能量进行合并。
$log∫_{l′}exp(− \frac{E(f , l^′)}{T}), \tag2$
where $T$ is the temperature parameter. There exists a simple relation between the network outputs after the softmax layer and the Gibbs distribution of class specific energy values ³⁷. This can be formulated as,
其中 $T$ 为温度参数。经过softmax层后的网络输出与类别特定能量值的吉布斯分布存在简单关系³⁷。其公式可表述为：
$\frac { exp (\frac{ g_l (f )} {T} ) } { ∑^C_{i=1} exp( \frac{g_i(f )}{T} )} = exp(− E(f ,l)T ) exp(− E(f )T ) \tag3$
where $p (l ∣ f)$ is the probability density for a label $l$ , $g l (f)$ is the $l^{th}$ classification logit of the classification head $g (.)$ . Using this correspondence, we define free energy of our classification models in terms of their logits as follows:
其中 $p (l ∣ f)$ 是标签 $l$ 的概率密度， $g l (f)$ 是分类头 $g (.)$ 的第 $l$ 个分类逻辑值。利用这种对应关系，我们根据分类模型的逻辑值定义其自由能如下：
${\,} log∑^C_{i=1}exp( \frac {g_i(f )}{T} ). \tag4$
The above equation provides us a natural way to transform the classification head of the standard Faster R-CNN ³ to an energy function. Due to the clear separation that we enforce in the latent space with the contrastive clustering, we see a clear separation in the energy level of the known class data-points and unknown data-points as illustrated in Fig. 3. In light of this trend, we model the energy distribution of the known and unknown energy values $ξ_{kn}(f )$ and $ξ_{unk}(f )$ , with a set of shifted Weibull distributions. These distributions were found to fit the energy data of a small held out validation set (with both knowns and unknowns instances) very well, when compared to Gamma, Exponential and Normal distributions. The learned distributions can be used to label a prediction as unknown if $ξ_{kn}(f ) < ξ_{unk}(f )$ .
上述方程为我们提供了一种将标准Faster R-CNN³分类头转换为能量函数的自然方法。由于我们在潜在空间中通过对比聚类强制实现了清晰分离，如图3所示，我们观察到已知类数据点和未知数据点在能量水平上存在明显区分。基于这一趋势，我们使用一组平移威布尔分布对已知能量值 $ξ_{kn}(f)$ 和未知能量值 $ξ_{unk}(f)$ 的能量分布进行建模。与伽马分布、指数分布和正态分布相比，这些分布被发现能很好地拟合小规模保留验证集（包含已知和未知实例）的能量数据。当满足 $ξ_{kn}(f) < ξ_{unk}(f)$ 条件时，习得的分布可用于将预测标记为未知类别。
在这里插入图片描述

图3：如上所示，已知和未知数据点的能量值呈现出明显区分。我们对两者分别拟合了威布尔分布，并利用这些分布来识别未见过的已知和未知样本，具体方法在第4.3节中说明。

4.4. Alleviating Forgetting 缓解遗忘

After the identification of unknowns, an important requisite for an open world detector is to be able to learn new classes, when the labeled examples of some of the unknown classes of interest are provided. Importantly, the training data for the previous tasks will not be present at this stage since retraining from scratch is not a feasible solution. Training with only the new class instances will lead to catastrophic forgetting ³⁸, ³⁹ of the previous classes. We note that a number of involved approaches have been developed to alleviate such forgetting, including methods based on parameter regularization ⁴⁰, ⁴¹, ⁴², ⁴³, exemplar replay ⁴⁴, ⁴⁵, ⁴⁶, ⁴⁷, dynamically expanding networks ⁴⁸, ⁴⁹, ⁵⁰ and meta-learning ⁵¹, ⁵².
在未知类别被识别后，开放世界检测器的一项重要要求是：当某些关注未知类别的标注样本被提供时，能够学习新类别。值得注意的是，由于从头开始重新训练并非可行方案，先前任务的训练数据在此阶段将不可用。仅用新类别实例进行训练会导致对先前类别的灾难性遗忘³⁸, ³⁹。我们注意到，目前已开发出多种复合方法以缓解此类遗忘现象，包括基于参数正则化的方法⁴⁰, ⁴¹, ⁴², ⁴³、样本回放法⁴⁴, ⁴⁵, ⁴⁶, ⁴⁷、动态扩展网络法⁴⁸, ⁴⁹, ⁵⁰以及元学习法⁵¹, ⁵²。
We build on the recent insights from ⁵³, ⁵⁴, ⁵⁵ which compare the importance of example replay against other more complex solutions. Specifically, Prabhu et al. ⁵³ retrospects the progress made by the complex continual learning methodologies and show that a greedy exemplar selection strategy for replay in incremental learning consistently outperforms the state-of-the-art methods by a large margin. Knoblauch et al. ⁵⁴ develops a theoretical justification for the unwarranted power of replay methods. They prove that an optimal continual learner solves an NP-hard problem and requires infinite memory. The effectiveness of storing few examples and replaying has been found effective in the related few-shot object detection setting by Wang et al. ⁵⁵. These motivates us to use a relatively simple methodology for ORE to mitigate forgetting i.e., we store a balanced set of exemplars and finetune the model after each incremental step on these. At each point, we ensure that a minimum of $N_{ex}$ instances for each class are present in the exemplar set.
我们借鉴了⁵³、⁵⁴、⁵⁵的最新研究成果，这些研究对比了样本回放策略与其他复杂解决方案的重要性。具体而言，Prabhu等人⁵³回溯了复杂持续学习方法的进展，表明增量学习中贪心样本选择策略的回放方法始终以显著优势超越最先进方法。Knoblauch等人⁵⁴从理论上论证了回放方法被低估的效能，证明最优持续学习器需要解决NP难问题且需无限内存。Wang等人⁵⁵在相关的小样本目标检测场景中发现，存储少量样本并进行回放具有显著效果。这些发现促使我们在ORE（开放世界目标识别）中采用相对简单的方法来缓解遗忘问题：即存储平衡的样本集，并在每个增量步骤后对这些样本进行模型微调。我们始终确保样本集中每类至少保留 $N_{ex}$ 个实例。

5. Experiments and Results 实验与结果

We propose a comprehensive evaluation protocol to study the performance of an open world detector to identify unknowns, detect known classes and progressively learn new classes when labels are provided for some unknowns.
我们提出了一套全面的评估方案，用于研究开放世界检测器在以下方面的性能：识别未知类别、检测已知类别，以及在部分未知样本获得标注时逐步学习新类别。

5.1. Open World Evaluation Protocol 开放世界评估协议

Data split: We group classes into a set of tasks $\mathcal{T} =\{T_1, {\,}· · · {\,}T_t, {\,} · · · \}$ . All the classes of a specific task will be introduced to the system at a point of time $t$ . While learning $T_t$ , all the classes of $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ will be treated as known and $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ would be treated as unknown. For a concrete instantiation of this protocol, we consider classes from Pascal VOC ¹¹ and MS-COCO ¹². We group all VOC classes and data as the first task $T_1$ . The remaining60 classes of MS-COCO ¹² are grouped into three successive tasks with semantic drifts (see Tab. 1). All images which correspond to the above split from Pascal VOC and MS-COCO train-sets form the training data. For evaluation, we use the Pascal VOC test split and MS-COCO val split. 1k images from training data of each task is kept aside for validation. Data splits and codes can be found at https://github.com/JosephKJ/OWOD.
数据划分：我们将类别分组为一系列任务集合 $\mathcal{T} =\{T_1, {\,}· · · {\,}T_t, {\,} · · · \}$ 。特定任务的所有类别将在时间点 $t$ 引入系统。在学习 $T_t$ 时， $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ 的所有类别将被视为已知类别，而 $\{T_\mathcal{τ} : \mathcal{τ} < t \}$ 将被视为未知类别。为具体说明该方案，我们采用 Pascal VOC ¹¹ 和 MS-COCO ¹² 的类别数据。将所有 VOC 类别及数据归为第一个任务 $T_1$ ，MS-COCO ¹² 剩余的60个类别按语义偏移分为三个连续任务（见表1）。来自 Pascal VOC 和 MS-COCO 训练集的上述划分图像构成训练数据。评估时使用 Pascal VOC 测试集划分和 MS-COCO 验证集划分。每个任务的训练数据中保留1千张图像作为验证集。数据划分与代码详见 https://github.com/JosephKJ/OWOD。
在这里插入图片描述

表1：该表格展示了拟议的开放世界评估方案中的任务构成。显示了每个任务的语义内容及各数据划分中的图像数量和实例（物体）数量。

Evaluation metrics: Since an unknown object easily gets confused as a known object, we use the Wilderness Impact (WI) metric ¹⁵ to explicitly characterises this behaviour.
评估指标：由于未知物体容易被误认为已知物体，我们使用荒野影响（WI）指标¹⁵来明确描述这种行为。
${\,} Impact {\,} (W I) = \frac{P_\mathcal{K}} { P_{\mathcal{K} {\,}∪{\,} \mathcal{U}} }− 1, \tag5$
where $P_\mathcal{K}$ refers to the precision of the model when evaluated on known classes and $P_{\mathcal{K} {\,}∪{\,} \mathcal{U}}$ is the precision when evaluated on known and unknown classes, measured at a recall level $R$ ( $0.8$ in all experiments). Ideally, WI should be less as the precision must not drop when unknown objects are added to the test set. Besides WI, we also use Absolute Open-Set Error (A-OSE) ¹⁶ to report the number count of unknown objects that get wrongly classified as any of the known class. Both WI and A-OSE implicitly measure how effective the model is in handling unknown objects.
其中， $P_\mathcal{K}$ 表示模型在已知类别上的精确率， $P_{\mathcal{K} {\,}∪{\,} \mathcal{U}}$ 为模型在已知与未知类别联合测试集上的精确率（所有实验的召回率 $R$ 固定为 $0.8$ ）。理想情况下，WI值应较小，因为当测试集加入未知对象时，精确率不应下降。除WI外，我们还采用绝对开放集误差(A-OSE)¹⁶来统计被错误分类为任何已知类别的未知对象数量。WI和A-OSE均隐式衡量模型处理未知对象的有效性。
In order to quantify incremental learning capability of the model in the presence of new labeled classes, we measure the mean Average Precision (mAP) at IoU threshold of $0.5$ (consistent with the existing literature ⁵⁶, ⁵⁷).
为了量化模型在面对新增标注类别时的增量学习能力，我们采用交并比(IoU)阈值为 $0.5$ 时的平均精度均值(mAP)作为评估指标（与现有文献⁵⁶, ⁵⁷的测评标准保持一致）。

5.2. Implementation Details 实施细节

ORE re-purposes the standard Faster R-CNN ³ object detector with a ResNet-50 ⁵⁸ backbone. To handle variable number of classes in the classification head, following incremental classification methods ⁵¹, ⁵², ⁴⁴, ⁴⁶, we assume a bound on the maximum number of classes to expect, and modify the loss to take into account only the classes of interest. This is done by setting the classification logits of the unseen classes to a large negative value ( $v$ ), thus making their contribution to softmax negligible ( $e^{−v} → 0$ ).
ORE采用了标准的Faster R-CNN³目标检测器，并配备ResNet-50⁵⁸主干网络。为处理分类头中可变类别数量的挑战，该方法遵循增量分类技术⁵¹, ⁵², ⁴⁴, ⁴⁶的通用处理策略：预先设定最大预期类别数上限，并通过修改损失函数使其仅关注当前阶段的待识别类别。具体实现时将未见类别的分类逻辑值设为极大负数（ $v$ ），使其对softmax函数的贡献趋近于零（ $e^{−v} → 0$ ）。
The 2048-dim feature vector which comes from the last residual block in the RoI Head is used for contrastive clustering. The contrastive loss (defined in Eqn. 1) is added to the standard Faster R-CNN classification and localization losses and jointly optimised for. While learning a task $T_i$ , only the classes that are part of $T_i$ will be labelled. While testing $T_i$ , all the classes that were previously introduced are labelled along with classes in $T_i$ , and all classes of future tasks will be labelled $^{'} u nkn o w n^{'}$ . For the exemplar replay, we empirically choose $N_{ex} = 50$ . We do a sensitivity analysis on the size of the exemplar memory in Sec. 6. Further implementation details are provided in supplementary.
来自RoI Head最后一个残差块的2048维特征向量被用于对比聚类。对比损失（公式1定义）被添加到标准的Faster R-CNN分类和定位损失中，并进行联合优化。在学习任务 $T_i$ 时，只有属于 $T_i$ 部分的类别会被标注。在测试 $T_i$ 时，所有先前引入的类别将与 $T_i$ 中的类别一起被标注，而未来任务的所有类别将被标记为"'unknown"。对于样本回放，我们根据经验选择 $N_{ex} = 50$ 。我们在第6节中对样本记忆库的规模进行了敏感性分析。更多实现细节见补充材料。

5.3. Open World Object Detection Results 开放世界目标检测结果

Table 2 shows how ORE compares against Faster RCNN on the proposed open world evaluation protocol. An ‘Oracle’ detector has access to all known and unknown labels at any point, and serves as a reference. After learning each task, WI and A-OSE metrics are used to quantify how unknown instances are confused with any of the known classes. We see that ORE has significantly lower WI and AOSE scores, owing to an explicit modeling of the unknown. When unknown classes are progressively labelled in Task 2, we see that the performance of the baseline detector on the known set of classes (quantified via mAP) significantly deteriorates from $56.16\%$ to $4.076\%$ . The proposed balanced finetuning is able to restore the previous class performance to a respectable level ( $51.09\%$ ) at the cost of increased WI and A-OSE, whereas ORE is able to achieve both goals: detect known classes and reduce the effect of unknown comprehensively. Similar trend is seen when Task 3 classes are added. WI and A-OSE scores cannot be measured for Task 4 because of the absence of any unknown ground-truths. We report qualitative results in Fig. 4 and supplementary section, along with failure case analysis. We conduct extensive sensitivity analysis in Sec. 6 and supplementary section.
表2展示了ORE在提出的开放世界评估协议上与Faster RCNN的对比情况。"Oracle"检测器在任何时刻都能获取所有已知和未知标签，作为参考基准。每学习完一个任务后，采用WI和A-OSE指标量化未知实例与已知类别的混淆程度。实验表明，由于对未知类的显式建模，ORE的WI和A-OSE得分显著更低。当任务2中逐步标记未知类别时，基线检测器在已知类别上的性能（通过mAP衡量）从 $56.16\%$ 急剧下降至 $4.076\%$ 。提出的平衡微调方法以WI和A-OSE上升为代价，将已知类别性能恢复至可观水平（ $51.09\%$ ），而ORE则能同时实现两个目标：准确检测已知类别并全面降低未知类别影响。在添加任务3类别时也观察到类似趋势。由于任务4不存在任何未知真实标签，无法测量WI和A-OSE得分。我们在图4和补充材料中给出了定性结果及失败案例分析，并在第6节和补充材料中进行了全面的敏感性分析。
在这里插入图片描述

表2：此处展示ORE在开放世界目标检测任务中的表现。Wilderness Impact（WI）和Average Open Set Error（A-OSE）量化ORE处理未知类别（灰色背景）的能力，而Mean Average Precision（mAP）衡量其检测已知类别（白色背景）的性能。可见ORE在所有指标上均持续超越基于Faster R-CNN的基线方法。具体评估指标的详细分析与说明请参阅第5.3节。

在这里插入图片描述

表3：我们在三种不同设置下将ORE与最先进的增量目标检测器进行对比。分别向已在10、15和19个类别上训练好的检测器（蓝色背景所示）引入来自Pascal VOC 2007数据集¹¹的10个、5个及最后一个新增类别。ORE无需调整方法即可在所有设置中取得优异表现，详情请参阅第5.4节。

在这里插入图片描述

图4：ORE在任务1训练后的预测结果。模型尚未接触“大象”、“苹果”、“香蕉”、“斑马”和“长颈鹿”等类别，因此这些对象被成功归类为“unknown”。该方法将其中一只“长颈鹿”误判为“马”，显示出ORE的局限性。

5.4. Incremental Object Detection Results 增量目标检测结果

We find an interesting consequence of the ability of ORE to distinctly model unknown objects: it performs favorably well on the incremental object detection (iOD) task against the state-of-the-art (Tab. 3). This is because, ORE reduces the confusion of an unknown object being classified as a known object, which lets the detector incrementally learn the true foreground objects. We use the standard protocol ⁵⁶, ⁵⁷ used in the iOD domain to evaluate ORE, where group of classes (10, 5 and the last class) from Pascal VOC 2007 ¹¹ are incrementally learned by a detector trained on the remaining set of classes. Remarkably, ORE is used as it is, without any change to the methodology introduced in Sec. 4. We ablate contrastive clustering (CC) and energy based unknown identification (EBUI) to find that it results in reduced performance than standard ORE.
我们发现ORE对未知物体独特建模能力带来一个有趣的结果：在增量式物体检测（iOD）任务中，其性能显著优于现有最佳方法（表3）。这是因为ORE减少了未知物体被误分类为已知物体的情况，使检测器能逐步学习真实的前景物体。我们采用iOD领域标准协议⁵⁶, ⁵⁷来评估ORE，该方法通过已在其他类别上训练的检测器，逐步学习Pascal VOC 2007¹¹中的类别组（10类、5类和最后一类）。值得注意的是，此处使用的ORE完全保持第4节所述方法未做任何改动。通过消融对比聚类（CC）和基于能量的未知标识（EBUI）发现，移除这些组件会导致性能低于标准ORE。

6. Discussions and Analysis 讨论与分析

6.1 Ablating ORE Components: 消融ORE组件

To study the contribution of each of the components in ORE, we design careful ablation experiments (Tab. 4). We consider the setting where Task 1 is introduced to the model. The auto-labelling methodology (referred to as ALU), combined with energy based unknown identification (EBUI) performs better together (row 5) than using either of them separately (row3 and 4). Adding contrastive clustering (CC) to this configuration, gives the best performance in handling unknown (row 7), measured in terms of WI and A-OSE. There is no severe performance drop in known classes detection (mAP metric) as a side effect of unknown identification. In row6, we see that EBUI is a critical component whose absence increases WI and A-OSE scores. Thus, each component in ORE has a critical role to play for unknown identification.
为研究ORE模型中各组成部分的贡献，我们设计了严谨的消融实验（表4）。实验设定为向模型引入任务1的情况。自动标注方法（简称ALU）与基于能量的未知样本识别（EBUI）联合使用（第5行）的表现优于单独使用任一方法（第3、4行）。在此配置中加入对比聚类（CC）后，通过WI和A-OSE指标衡量，获得了最佳的未知样本处理效果（第7行）。作为未知识别的附加效应，已知类别的检测性能（mAP指标）未出现显著下降。第6行数据显示，EBUI是关键组件，缺失该组件将导致WI和A-OSE指标上升。因此，ORE中的每个组件对未知样本识别都发挥着不可替代的作用。
在这里插入图片描述

表4：我们仔细分析了ORE的各个组成部分。CC、ALU和EBUI分别指“对比聚类”、“未知类别自动标注”和“基于能量的未知标识器”。更多细节请参阅第6.1节。

6.2 Sensitivity Analysis on Exemplar Memory Size: 范例记忆库规模的敏感性分析

Our balanced finetuning strategy requires storing exemplar images with at least $N_{ex}$ instances per class. We vary $N_{ex}$ while learning Task 2 and report the results in Table 5. We find that balanced finetuning is very effective in improving the accuracy of previously known class, even with just having minimum $10$ instances per class. However, we find that increasing Nex to large values does-not help and at the same time adversely affect how unknowns are handled (evident from WI and A-OSE scores). Hence, by validation, we set $N_{ex}$ to $50$ in all our experiments, which is a sweet spot that balances performance on known and unknown classes.
我们的平衡微调策略要求存储每类至少 $N_{ex}$ 个示例图像。在学习任务2的过程中，我们调整 $N_{ex}$ 值并在表5中报告结果。研究发现，即使每类仅保留 $10$ 个样本实例，平衡微调对提升已知类准确率也非常有效。但值得注意的是，过度增大Nex值不仅无益，反而会影响模型处理未知类的性能（这一点从WI和A-OSE评分可明显看出）。经实验验证，我们最终在所有测试中将 $N_{ex}$ 设定为 $50$ ——这是兼顾已知类与未知类性能的最佳平衡点。
在这里插入图片描述

表5：该表显示了敏感性分析。大幅增加 $N_{ex}$ 会损害未知样本上的性能，而少量图像对于缓解遗忘至关重要（最佳行以绿色标出）。

6.3 Comparison with an Open Set Detector: 与开放式集检测器的比较

The mAP values of the detector when it is evaluated on closed set data (trained and tested on Pascal VOC 2007) and open set data (test set contains equal number of unknown images from MS-COCO) helps to measure how the detector handles unknown instances. Ideally, there should not be a performance drop. We compare ORE against the recent open set detector proposed by Miller et al. ¹⁶. We find from Tab. 6 that drop in performance of ORE is much lower than ¹⁶ owing to the effective modelling of the unknown instances.
检测器在闭集数据（基于Pascal VOC 2007训练测试）和开集数据（测试集包含等量MS-COCO未知图像）上评估的mAP值，可衡量其处理未知实例的能力。理想情况下性能不应下降。我们将ORE与Miller等人¹⁶提出的最新开集检测器进行对比。从表6可见，由于对未知实例的有效建模，ORE的性能下降幅度远低于¹⁶。
在这里插入图片描述

表6：与开放集物体检测器的性能对比。ORE能够显著减少mAP值下降幅度。

6.4 Clustering loss and t-SNE ⁵⁹ visualization: 聚类损失和t-SNE⁵⁹可视化

We visualise the quality of clusters that are formed while training with the contrastive clustering loss (Eqn. 1) for Task 1. We see nicely formed clusters in Fig. 5 (a). Each number in the legend correspond to the $20$ classes introduced in Task 1. Label $20$ denotes unknown class. Importantly, we see that the unknown instances also gets clustered, which reinforces the quality of the auto-labelled unknowns used in contrastive clustering. In Fig. 5 (b), we plot the contrastive clustering loss against training iterations, where we see a gradual decrease, indicative of good convergence.
我们可视化在任务1中使用对比聚类损失（公式1）训练时形成的聚类质量。图5(a)显示出良好形成的聚类簇。图例中的每个数字对应任务1中介绍的 $20$ 个类别，标签 $20$ 表示未知类别。值得注意的是，我们看到未知实例同样形成了聚类，这印证了对比聚类中自动标注未知样本的质量。图5(b)绘制了对比聚类损失随训练迭代的变化曲线，可见损失值逐渐下降，表明模型具有良好的收敛性。
在这里插入图片描述

图5：(a) 潜在空间中的不同聚类。(b) 我们的对比损失确保这种聚类稳定收敛。

7. Conclusion 结论

The vibrant object detection community has pushed the performance benchmarks on standard datasets by a large margin. The closed-set nature of these datasets and evaluation protocols, hampers further progress. We introduce Open World Object Detection, where the object detector is able to label an unknown object as unknown and gradually learn the unknown as the model gets exposed to new labels. Our key novelties include an energy-based classifier for unknown detection and a contrastive clustering approach for open world learning. We hope that our work will kindle further research along this important and open direction.
充满活力的目标检测研究界已在标准数据集上将性能基准大幅提升。然而，这些数据集和评估协议的封闭性阻碍了进一步突破。我们提出"开放世界目标检测"新范式，当检测器遇到未知物体时能够将其标记为"未知"，并在模型接触新标签后逐步学习识别这些未知对象。该研究的关键创新包括：基于能量的未知对象分类器，以及采用对比聚类方法的开放世界学习框架。我们期望这项工作能在这个重要且开放的研究方向上点燃更多探索火花。

Acknowledgements 致谢

We thank TCS for supporting KJJ through its PhD fellowship; MBZUAI for a start-up grant; VR starting grant (2016-05543) and DST, Govt of India, for partly supporting this work through IMPRINT program (IMP/2019/000250). We thank our anonymous reviewers for their valuable feedback.

References

Manoj Acharya, Tyler L. Hayes, and Christopher Kanan. Rodeo: Replay for online object detection. In The British Machine Vision Conference, 2020. 13 ↩︎ ↩︎
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015. 1, 2 ↩︎ ↩︎ ↩︎ ↩︎
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015. 1, 3, 5, 6 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Kaiming He, Georgia Gkioxari, Piotr Doll ́ar, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017. 1 ↩︎ ↩︎
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll ́ar. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 1, 2, 3 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016. 1, 3 ↩︎ ↩︎ ↩︎ ↩︎
John A Meacham. Wisdom and the context of knowledge: Knowing that one doesn’t know. On the development of developmental psychology, 8:111–134, 1983. 1 ↩︎ ↩︎
Mario Livio. Why?: What makes us curious. Simon and Schuster, 2017. 1 ↩︎ ↩︎
Susan Engel. Children’s need to know: Curiosity in schools.Harvard educational review, 81(4):625–645, 2011. 1 ↩︎ ↩︎
Brian Grazer and Charles Fishman. A curious mind: The secret to a bigger life. Simon and Schuster, 2016. 1 ↩︎ ↩︎
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. IJCV, 88(2):303–338, 2010. 1, 6, 7 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll ́ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InECCV, pages 740–755. Springer, 2014. 1, 6 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7):1757–1772, 2012. 1 ↩︎ ↩︎
Abhijit Bendale and Terrance Boult. Towards open world recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1893–1902, 2015. 1, 2 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Akshay Dhamija, Manuel Gunther, Jonathan Ventura, and Terrance Boult. The overlooked elephant of object detection: Open set. In The IEEE Winter Conference on Applications of Computer Vision, pages 1021–1030, 2020. 2, 3, 6 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Dimity Miller, Lachlan Nicholson, Feras Dayoub, and Niko S ̈underhauf. Dropout sampling for robust object detection in open-set conditions. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1–7. IEEE, 2018. 2, 6, 8 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
Walter J Scheirer, Anderson de Rezende Rocha, Archana Sapkota, and Terrance E Boult. Toward open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 7(35):1757–1772, 2013. 2 ↩︎ ↩︎
Lalit P Jain, Walter J Scheirer, and Terrance E Boult. Multiclass open set recognition using probability of inclusion. InEuropean Conference on Computer Vision, pages 393–409. Springer, 2014. 2 ↩︎ ↩︎
Walter J Scheirer, Lalit P Jain, and Terrance E Boult. Probability models for open set recognition. IEEE transactions on pattern analysis and machine intelligence, 36(11):2317– 2324, 2014. 2 ↩︎ ↩︎
Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1563–1572, 2016. 2 ↩︎ ↩︎
Zongyuan Ge, Sergey Demyanov, Zetao Chen, and Rahil Garnavi. Generative openmax for multi-class open set classification. In British Machine Vision Conference 2017. British Machine Vision Association and Society for Pattern Recognition, 2017. 2 ↩︎ ↩︎
Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2537–2546, 2019. 2 ↩︎ ↩︎
Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In International Conference on Learning Representations, 2018. 2 ↩︎ ↩︎
Stanislav Pidhorskyi, Ranya Almohsen, and Gianfranco Doretto. Generative probabilistic novelty detection with adversarial autoencoders. In Advances in neural information processing systems, pages 6822–6833, 2018. 2 ↩︎ ↩︎
Pramuditha Perera, Vlad I. Morariu, Rajiv Jain, Varun Manjunatha, Curtis Wigington, Vicente Ordonez, and Vishal M. Patel. Generative-discriminative feature representations for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020. 2 ↩︎ ↩︎
Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. Classificationreconstruction learning for open-set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 2 ↩︎ ↩︎
Federico Pernici, Federico Bartoli, Matteo Bruni, and Alberto Del Bimbo. Memory based online learning of deep representations from video streams. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2324–2334, 2018. 2 ↩︎ ↩︎
Hu Xu, Bing Liu, Lei Shu, and P Yu. Open-world learning and application to product classification. In The World Wide Web Conference, pages 3413–3419, 2019. 2 ↩︎ ↩︎
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence, 39(6):1137–1149, 2016. 2 ↩︎ ↩︎
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016. 2 ↩︎ ↩︎
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014. 2 ↩︎ ↩︎
Dimity Miller, Feras Dayoub, Michael Milford, and Niko S ̈underhauf. Evaluating merging strategies for samplingbased uncertainty techniques in object detection. In 2019 International Conference on Robotics and Automation (ICRA), pages 2348–2354. IEEE, 2019. 2 ↩︎ ↩︎ ↩︎ ↩︎
David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia Angelova, and Niko S ̈underhauf. Probabilistic object detection: Definition and evaluation. In The IEEE Winter Conference on Applications of Computer Vision, pages 1031–1040, 2020. 2 ↩︎ ↩︎
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059, 2016. 2 ↩︎ ↩︎
Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989. 3 ↩︎ ↩︎
Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006. 5 ↩︎ ↩︎
Weitang Liu, Xiaoyun Wang, John Owens, and Sharon Yixuan Li. Energy-based out-of-distribution detection. Advances in Neural Information Processing Systems, 33, 2020. 5 ↩︎ ↩︎
Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989. 5 ↩︎ ↩︎
Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999. 5 ↩︎ ↩︎
Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision (ECCV), pages 139–154, 2018. 5 ↩︎ ↩︎
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017. 5 ↩︎ ↩︎
Zhizhong Li and Derek Hoiem. Learning without forgetting.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2018. 5 ↩︎ ↩︎
Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3987–3995. JMLR. org, 2017. 5 ↩︎ ↩︎
Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with agem. In ICLR, 2019. 5, 6 ↩︎ ↩︎ ↩︎ ↩︎
Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5533–5542. IEEE, 2017. 5 ↩︎ ↩︎
David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, pages 6467–6476, 2017. 5, 6 ↩︎ ↩︎ ↩︎ ↩︎
Francisco M Castro, Manuel J Mar ́ın-Jim ́enez, Nicol ́as Guil, Cordelia Schmid, and Karteek Alahari. End-to-end incremental learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 233–248, 2018. 5 ↩︎ ↩︎
Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7765–7773, 2018. 5 ↩︎ ↩︎
Joan Serr`a, D ́ıdac Sur ́ıs, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. arXiv preprint arXiv:1801.01423, 2018. 5 ↩︎ ↩︎
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671, 2016. 5 ↩︎ ↩︎
Jathushan Rajasegaran, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Mubarak Shah. itaml: An incremental task-agnostic meta-learning approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13588–13597, 2020. 5, 6 ↩︎ ↩︎ ↩︎ ↩︎
Joseph KJ and Vineeth Nallure Balasubramanian. Metaconsolidation for continual learning. Advances in Neural Information Processing Systems, 33, 2020. 5, 6 ↩︎ ↩︎ ↩︎ ↩︎
Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. Gdumb: A simple approach that questions our progress in continual learning. In European Conference on Computer Vision, pages 524–540. Springer, 2020. 5 ↩︎ ↩︎ ↩︎ ↩︎
Jeremias Knoblauch, Hisham Husain, and Tom Diethe. Optimal continual learning has perfect memory and is np-hard.arXiv preprint arXiv:2006.05188, 2020. 5 ↩︎ ↩︎ ↩︎ ↩︎
Xin Wang, Thomas E Huang, Trevor Darrell, Joseph E Gonzalez, and Fisher Yu. Frustratingly simple few-shot object detection. arXiv preprint arXiv:2003.06957, 2020. 5 ↩︎ ↩︎ ↩︎ ↩︎
Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE International Conference on Computer Vision, pages 3400–3409, 2017. 6, 7, 13 ↩︎ ↩︎ ↩︎ ↩︎
Can Peng, Kun Zhao, and Brian C. Lovell. Faster ilod: Incremental learning for object detectors based on faster rcnn.Pattern Recognition Letters, 140:109 – 115, 2020. 6, 7, 13 ↩︎ ↩︎ ↩︎ ↩︎
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6 ↩︎ ↩︎
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008. 8 ↩︎ ↩︎

Towards Open World Object Detection概述(论文)