YOLOv10: Real-Time End-to-End Object Detection目标检测论文精读(逐段解析)
论文地址:https://arxiv.org/abs/2405.14458
CVPR 2024
清华发布
Figure 1: Comparisons with others in terms of latency-accuracy (left) and size-accuracy (right) trade-offs. We measure the end-to-end latency using the official pre-trained models.
【翻译】图1:与其他方法在延迟-准确性(左)和大小-准确性(右)权衡方面的比较。我们使用官方预训练模型测量端到端延迟。
Abstract
Over the past years, YOLOs have emerged as the predominant paradigm in the field of real-time object detection owing to their effective balance between computational cost and detection performance. Researchers have explored the architectural designs, optimization objectives, data augmentation strategies, and others for YOLOs, achieving notable progress. However, the reliance on the non-maximum suppression (NMS) for post-processing hampers the end-to-end deployment of YOLOs and adversely impacts the inference latency. Besides, the design of various components in YOLOs lacks the comprehensive and thorough inspection, resulting in noticeable computational redundancy and limiting the model’s capability. It renders the suboptimal efficiency, along with considerable potential for performance improvements. In this work, we aim to further advance the performance-efficiency boundary of YOLOs from both the post-processing and the model architecture. To this end, we first present the consistent dual assignments for NMS-free training of YOLOs, which brings the competitive performance and low inference latency simultaneously. Moreover, we introduce the holistic efficiency-accuracy driven model design strategy for YOLOs. We comprehensively optimize various components of YOLOs from both the efficiency and accuracy perspectives, which greatly reduces the computational overhead and enhances the capability. The outcome of our effort is a new generation of YOLO series for real-time end-to-end object detection, dubbed YOLOv10. Extensive experiments show that YOLOv10 achieves the state-of-the-art performance and efficiency across various model scales. For example, our YOLOv10-S is 1.8× faster than RT-DETR-R18 under the similar AP on COCO, meanwhile enjoying 2.8× smaller number of parameters and FLOPs. Compared with YOLOv9-C, YOLOv10-B has 46% less latency and 25% fewer parameters for the same performance. Code and models are available at https://github.com/THU-MIG/yolov10.
【翻译】在过去几年中,YOLO已经成为实时目标检测领域的主导范式,因为它们在计算成本和检测性能之间实现了有效的平衡。研究人员探索了YOLO的架构设计、优化目标、数据增强策略等方面,取得了显著进展。然而,对非极大值抑制(NMS)后处理的依赖阻碍了YOLO的端到端部署,并对推理延迟产生不利影响。此外,YOLO中各种组件的设计缺乏全面和彻底的检查,导致明显的计算冗余并限制了模型的能力。这导致了次优的效率,以及性能改进的巨大潜力。在这项工作中,我们旨在从后处理和模型架构两个方面进一步推进YOLO的性能-效率边界。为此,我们首先提出了用于YOLO无NMS训练的一致双分配策略,同时带来了竞争性能和低推理延迟。此外,我们引入了针对YOLO的整体效率-准确性驱动的模型设计策略。我们从效率和准确性两个角度全面优化YOLO的各种组件,大大减少了计算开销并增强了能力。我们努力的成果是用于实时端到端目标检测的新一代YOLO系列,称为YOLOv10。大量实验表明,YOLOv10在各种模型规模上都实现了最先进的性能和效率。例如,我们的YOLOv10-S在COCO上相似AP下比RT-DETR-R18快1.8倍,同时参数数量和FLOPs减少2.8倍。与YOLOv9-C相比,YOLOv10-B在相同性能下延迟减少46%,参数减少25%。代码和模型可在https://github.com/THU-MIG/yolov10获得。
【解析】作者指出传统YOLO存在两个主要问题:一是依赖NMS后处理步骤,这会增加推理时间并阻碍端到端部署;二是模型架构设计不够精细,存在计算冗余。针对这些问题,YOLOv10提出了两个核心创新:一致双分配策略用于消除NMS依赖,以及整体效率-准确性驱动的模型设计策略用于优化架构。这些改进使得YOLOv10在保持高精度的同时显著提升了推理速度,实现了真正的端到端实时检测。
1 Introduction
Real-time object detection has always been a focal point of research in the area of computer vision, which aims to accurately predict the categories and positions of objects in an image under low latency. It is widely adopted in various practical applications, including autonomous driving [ 3 ], robot navigation [ 12 ], and object tracking [ 72 ], etc . In recent years, researchers have concentrated on devising CNN-based object detectors to achieve real-time detection [ 19 , 23 , 48 , 49 , 50 , 57 , 13 ]. Among them, YOLOs have gained increasing popularity due to their adept balance between performance and efficiency [ 2 , 20 , 29 , 20 , 21 , 65 , 60 , 70 , 8 , 71 , 17 , 29 ]. The detection pipeline of YOLOs consists of two parts: the model forward process and the NMS post-processing. However, both of them still have deficiencies, resulting in suboptimal accuracy-latency boundaries.
【翻译】实时目标检测一直是计算机视觉领域研究的焦点,其目标是在低延迟条件下准确预测图像中物体的类别和位置。它被广泛应用于各种实际应用中,包括自动驾驶、机器人导航和目标跟踪等。近年来,研究人员专注于设计基于CNN的目标检测器来实现实时检测。其中,YOLO因其在性能和效率之间的巧妙平衡而越来越受欢迎。YOLO的检测流水线由两部分组成:模型前向过程和NMS后处理。然而,这两部分仍然存在缺陷,导致准确性-延迟边界不够理想。
【解析】指出传统YOLO存在两个主要瓶颈:一是模型本身的前向推理过程可能存在计算冗余,二是必须依赖NMS后处理步骤来去除重复检测框,这个步骤会额外增加推理时间并影响端到端部署的效率。
Specifically, YOLOs usually employ one-to-many label assignment strategy during training, whereby one ground-truth object corresponds to multiple positive samples. Despite yielding superior performance, this approach necessitates NMS to select the best positive prediction during inference. This slows down the inference speed and renders the performance sensitive to the hyperparameters of NMS, thereby preventing YOLOs from achieving optimal end-to-end deployment [ 78 ]. One line to tackle this issue is to adopt the recently introduced end-to-end DETR architectures [ 4 , 81 , 73 , 30 , 36 , 42 , 67 ]. For example, RT-DETR [ 78 ] presents an efficient hybrid encoder and uncertainty-minimal query selection, propelling DETRs into the realm of real-time applications. Nevertheless, when considering only the forward process of model during deployment, the efficiency of the DETRs still has room for improvements compared with YOLOs. Another line is to explore end-to-end detection for CNNbased detectors, which typically leverages one-to-one assignment strategies to suppress the redundant predictions [ 6 , 55 , 66 , 80 , 17 ]. However, they usually introduce additional inference overhead or achieve suboptimal performance for YOLOs.
【翻译】具体来说,YOLO在训练过程中通常采用一对多标签分配策略,即一个真实目标对应多个正样本。尽管这种方法能产生优异的性能,但它需要在推理过程中使用NMS来选择最佳的正预测。这会降低推理速度,并使性能对NMS的超参数敏感,从而阻碍YOLO实现最优的端到端部署。解决这个问题的一个方向是采用最近引入的端到端DETR架构。例如,RT-DETR提出了高效的混合编码器和不确定性最小查询选择,将DETR推向实时应用领域。然而,当仅考虑部署期间模型的前向过程时,DETR的效率与YOLO相比仍有改进空间。另一个方向是探索基于CNN检测器的端到端检测,通常利用一对一分配策略来抑制冗余预测。然而,它们通常会引入额外的推理开销或在YOLO上实现次优性能。
【解析】一对多标签分配策略是指在训练时,对于每个真实物体,会有多个预测框被标记为正样本,这样可以提供更丰富的监督信号,有助于模型学习。但问题在于推理时会产生多个预测同一个物体的情况,必须通过NMS来筛选出最佳预测框。NMS的工作原理是根据置信度和重叠度来去除重复检测,但这个过程不仅增加了计算时间,而且其效果很大程度上依赖于阈值等超参数的设置,这使得模型的性能变得不稳定。为了解决这个问题,研究者们尝试了两种主要方案:一是借鉴DETR的端到端设计思路,但DETR虽然避免了NMS,其Transformer架构在推理速度上相比YOLO仍有劣势;二是在CNN架构上直接采用一对一分配,但这往往会牺牲训练时的监督信号丰富度,导致性能下降。
Furthermore, the model architecture design remains a fundamental challenge for YOLOs, which exhibits an important impact on the accuracy and speed [ 50 , 17 , 71 , 8 ]. To achieve more efficient and effective model architectures, researchers have explored different design strategies. Various primary computational units are presented for the backbone to enhance the feature extraction ability, including DarkNet [ 48 , 49 , 50 ], CSPNet [ 2 ], EfficientRep [ 29 ] and ELAN [ 62 , 64 ], etc . For the neck, PAN [ 37 ], BiC [ 29 ], GD [ 60 ] and RepGFPN [ 71 ], etc ., are explored to enhance the multi-scale feature fusion. Besides, model scaling strategies [ 62 , 61 ] and re-parameterization [ 11 , 29 ] techniques are also investigated. While these efforts have achieved notable advancements, a comprehensive inspection for various components in YOLOs from both the efficiency and accuracy perspectives is still lacking. As a result, there still exists considerable computational redundancy within YOLOs, leading to inefficient parameter utilization and suboptimal efficiency. Besides, the resulting constrained model capability also leads to inferior performance, leaving ample room for accuracy improvements.
【翻译】此外,模型架构设计仍然是YOLO面临的基本挑战,它对准确性和速度都有重要影响。为了实现更高效和有效的模型架构,研究人员探索了不同的设计策略。为主干网络提出了各种主要计算单元来增强特征提取能力,包括DarkNet、CSPNet、EfficientRep和ELAN等。对于颈部网络,探索了PAN、BiC、GD和RepGFPN等来增强多尺度特征融合。此外,还研究了模型缩放策略和重参数化技术。虽然这些努力取得了显著进展,但仍然缺乏从效率和准确性两个角度对YOLO中各种组件的全面检查。因此,YOLO内部仍然存在相当大的计算冗余,导致参数利用效率低下和次优效率。此外,由此产生的受限模型能力也导致性能较差,为准确性改进留下了充足的空间。
【解析】虽然研究者们在各个组件上都做了大量改进工作,比如在主干网络中引入了更先进的特征提取模块,在颈部网络中设计了更好的多尺度特征融合机制,但这些改进往往是局部的、零散的,缺乏整体性的设计思考。这就像在优化一台机器时,虽然每个零件都在单独改进,但没有从整体系统的角度来考虑各部件之间的协调配合。结果就是模型中存在很多不必要的计算开销,参数没有得到充分有效的利用,同时模型的表达能力也受到了限制。这种情况下,模型既没有达到最优的推理速度,也没有发挥出应有的检测精度,存在很大的优化空间。
In this work, we aim to address these issues and further advance the accuracy-speed boundaries of YOLOs. We target both the post-processing and the model architecture throughout the detection pipeline. To this end, we first tackle the problem of redundant predictions in the post-processing by presenting a consistent dual assignments strategy for NMS-free YOLOs with the dual label assignments and consistent matching metric. It allows the model to enjoy rich and harmonious supervision during training while eliminating the need for NMS during inference, leading to competitive performance with high efficiency. Secondly, we propose the holistic efficiency-accuracy driven model design strategy for the model architecture by performing the comprehensive inspection for various components in YOLOs. For efficiency, we propose the lightweight classification head, spatial-channel decoupled downsampling, and rank-guided block design, to reduce the manifested computational redundancy and achieve more efficient architecture. For accuracy, we explore the large-kernel convolution and present the effective partial self-attention module to enhance the model capability, harnessing the potential for performance improvements.
【翻译】在这项工作中,我们旨在解决这些问题并进一步推进YOLO的准确性-速度边界。我们针对整个检测流水线中的后处理和模型架构。为此,我们首先通过提出一致双分配策略来解决后处理中的冗余预测问题,该策略采用双标签分配和一致匹配度量实现无NMS的YOLO。它允许模型在训练期间享受丰富和谐的监督,同时在推理期间消除对NMS的需求,从而以高效率实现竞争性能。其次,我们通过对YOLO中各种组件进行全面检查,为模型架构提出了整体效率-准确性驱动的模型设计策略。在效率方面,我们提出了轻量级分类头、空间-通道解耦下采样和秩引导块设计,以减少明显的计算冗余并实现更高效的架构。在准确性方面,我们探索了大核卷积并提出了有效的部分自注意力模块来增强模型能力,利用性能改进的潜力。
【解析】这段话是YOLOv10的核心创新总结。作者从两个维度来解决传统YOLO的问题:后处理优化和架构设计优化。在后处理方面,传统YOLO需要用NMS来去除重复检测框,这会增加推理时间。YOLOv10提出了"一致双分配策略",简单来说就是在训练时用两套检测头:一个用传统的一对多分配(一个真实物体对应多个预测框,提供丰富监督信号),另一个用一对一分配(一个真实物体只对应一个预测框,避免重复预测)。关键在于让这两个头的学习目标保持一致,这样在推理时就可以只用一对一的头,直接输出结果而不需要NMS。在架构设计方面,作者不是零散地改进某个组件,而是系统性地分析整个网络的每个部分,从效率和精度两个角度同时优化。效率优化包括简化分类头结构、改进下采样方式、根据不同阶段的特点设计不同的基础块;精度优化则引入大核卷积和部分自注意力机制来增强模型的表达能力。
Based on these approaches, we succeed in achieving a new family of real-time end-to-end detectors with different model scales, i.e., YOLOv10-N/S/M/B/L/X. Extensive experiments on standard benchmarks for object detection, i.e., COCO [35], demonstrate that our YOLOv10 can significantly outperform previous state-of-the-art models in terms of computation-accuracy trade-offs across model scales. As shown in Fig. 1, our YOLOv10-S/X are 1.8 × / 1.3 × 1.8\times/1.3\times 1.8×/1.3× faster than RT-DETR-R18/R101, respectively, under the similar performance. Compared with YOLOv9-C, YOLOv10-B achieves a 46 % 46\% 46% reduction in latency with the same performance. Moreover, YOLOv10 exhibits highly efficient parameter utilization. Our YOLOv10-L/X outperforms YOLOv8-L/X by 0.3 A P 0.3\mathrm{AP} 0.3AP and 0.5 AP, with 1.8 × 1.8\times 1.8× and 2.3 × 2.3\times 2.3× smaller number of parameters, respectively. YOLOv10-M achieves the similar AP compared with YOLOv9-M/YOLO-MS, with 23%/31% fewer parameters, respectively. We hope that our work can inspire further studies and advancements in the field.
【翻译】基于这些方法,我们成功实现了具有不同模型规模的新一代实时端到端检测器,即YOLOv10-N/S/M/B/L/X。在目标检测标准基准(即COCO)上的大量实验表明,我们的YOLOv10在各种模型规模的计算-准确性权衡方面都能显著优于以前的最先进模型。如图1所示,在相似性能下,我们的YOLOv10-S/X分别比RT-DETR-R18/R101快 1.8 × / 1.3 × 1.8\times/1.3\times 1.8×/1.3×。与YOLOv9-C相比,YOLOv10-B在相同性能下实现了 46 % 46\% 46%的延迟减少。此外,YOLOv10表现出高效的参数利用率。我们的YOLOv10-L/X分别以 1.8 × 1.8\times 1.8×和 2.3 × 2.3\times 2.3×更少的参数数量,比YOLOv8-L/X高出 0.3 A P 0.3\mathrm{AP} 0.3AP和0.5 AP。YOLOv10-M与YOLOv9-M/YOLO-MS相比实现了相似的AP,但参数分别减少了23%/31%。我们希望我们的工作能够激发该领域的进一步研究和进展。
【解析】YOLOv10的实验成果证明了前面提到的创新确实有效。作者提供了从小到大六个不同规模的模型版本,满足不同应用场景的需求。实验结果从三个维度证明了YOLOv10的优势:速度提升、参数效率和精度改进。
2 Related Work
Real-time object detectors. Real-time object detection aims to classify and locate objects under low latency, which is crucial for real-world applications. Over the past years, substantial efforts have been directed towards developing efficient detectors [ 19 , 57 , 48 , 34 , 79 , 75 , 32 , 31 , 41 ]. Particularly, the YOLO series [ 48 , 49 , 50 , 2 , 20 , 29 , 62 , 21 , 65 ] stand out as the mainstream ones. YOLOv1, YOLOv2, and YOLOv3 identify the typical detection architecture consisting of three parts, i.e ., backbone, neck, and head [48, 49, 50]. YOLOv4 [2] and YOLOv5 [20] introduce the CSPNet [63] design to replace DarkNet [ 47 ], coupled with data augmentation strategies, enhanced PAN, and a greater variety of model scales, etc . YOLOv6 [ 29 ] presents BiC and SimCSPSPPF for neck and backbone, respectively, with anchor-aided training and self-distillation strategy. YOLOv7 [ 62 ] introduces E-ELAN for rich gradient flow path and explores several trainable bag-of-freebies methods. YOLOv8 [ 21 ] presents C2f building block for effective feature extraction and fusion. Gold-YOLO [ 60 ] provides the advanced GD mechanism to boost the multi-scale feature fusion capability. YOLOv9 [ 65 ] proposes GELAN to improve the architecture and PGI to augment the training process.
【翻译】实时目标检测器。实时目标检测旨在低延迟条件下对物体进行分类和定位,这对现实世界的应用至关重要。在过去几年中,大量努力致力于开发高效的检测器。特别是,YOLO系列作为主流检测器脱颖而出。YOLOv1、YOLOv2和YOLOv3确定了由三个部分组成的典型检测架构,即主干网络、颈部网络和检测头。YOLOv4和YOLOv5引入CSPNet设计来替代DarkNet,结合数据增强策略、增强的PAN和更多样的模型规模等。YOLOv6为颈部和主干网络分别提出了BiC和SimCSPSPPF,采用锚框辅助训练和自蒸馏策略。YOLOv7引入E-ELAN以实现丰富的梯度流路径,并探索了几种可训练的免费技巧方法。YOLOv8提出C2f构建块用于有效的特征提取和融合。Gold-YOLO提供先进的GD机制来增强多尺度特征融合能力。YOLOv9提出GELAN来改进架构,提出PGI来增强训练过程。
【解析】这段话回顾了YOLO系列的发展历程,从最初的YOLOv1到YOLOv9,每一代都在前代基础上进行了关键性改进。早期的YOLOv1-v3主要确立了目标检测的基本架构框架,将整个检测系统分为三个核心组件:主干网络负责从输入图像中提取基础特征,颈部网络负责融合不同尺度的特征信息,检测头负责最终的分类和位置预测。随着技术发展,YOLOv4和v5开始关注网络架构的优化,用CSPNet替代了原有的DarkNet主干,这种替换不仅提升了特征提取能力,还改善了梯度流动。同时,这一阶段开始重视数据增强和模型规模的多样化,为不同应用场景提供了更多选择。YOLOv6进一步细化了各个组件的设计,BiC和SimCSPSPPF分别针对颈部和主干进行了专门优化,锚框辅助训练和自蒸馏策略的引入则提升了训练效果。YOLOv7的创新在于E-ELAN结构,它通过设计更丰富的梯度传播路径来缓解深度网络训练中的梯度消失问题,同时探索了多种训练技巧的组合使用。YOLOv8的C2f模块在特征提取和融合方面实现了新的突破,而Gold-YOLO和YOLOv9则分别在多尺度特征融合和整体架构设计上做出了进一步改进。
End-to-end object detectors. End-to-end object detection has emerged as a paradigm shift from traditional pipelines, offering streamlined architectures [ 53 ]. DETR [ 4 ] introduces the transformer architecture and adopts Hungarian loss to achieve one-to-one matching prediction, thereby eliminating hand-crafted components and post-processing. Since then, various DETR variants have been proposed to enhance its performance and efficiency [ 42 , 67 , 56 , 30 , 36 , 28 , 5 , 77 , 82 ]. Deformable-DETR [ 81 ] leverages multi-scale deformable attention module to accelerate the convergence speed. DINO [ 73 ] integrates contrastive denoising, mix query selection, and look forward twice scheme into DETRs. RT-DETR [ 78 ] further designs the efficient hybrid encoder and proposes the uncertainty-minimal query selection to improve both the accuracy and latency. Another line to achieve end-to-end object detection is based CNN detectors. Learnable NMS [ 24 ] and relation networks [ 26 ] present another network to remove duplicated predictions for detectors. OneNet [55] and DeFCN [66] propose oneto-one matching strategies to enable end-to-end object detection with fully convolutional networks. F C O S p s s \mathrm{FCOS}_{\mathrm{pss}} FCOSpss [80] introduces a positive sample selector to choose the optimal sample for prediction.
【翻译】端到端目标检测器。端到端目标检测已经成为传统流水线的范式转变,提供了简化的架构。DETR引入了transformer架构并采用匈牙利损失来实现一对一匹配预测,从而消除了手工制作的组件和后处理。从那时起,各种DETR变体被提出来增强其性能和效率。Deformable-DETR利用多尺度可变形注意力模块来加速收敛速度。DINO将对比去噪、混合查询选择和前瞻两次方案集成到DETR中。RT-DETR进一步设计了高效的混合编码器并提出了不确定性最小查询选择来改善准确性和延迟。实现端到端目标检测的另一条路线是基于CNN检测器。可学习NMS和关系网络提出了另一个网络来为检测器移除重复预测。OneNet和DeFCN提出了一对一匹配策略,以通过全卷积网络实现端到端目标检测。 F C O S p s s \mathrm{FCOS}_{\mathrm{pss}} FCOSpss引入了正样本选择器来为预测选择最优样本。
【解析】端到端检测的核心思想是将整个检测过程统一在一个网络中完成,输入图像直接输出最终的检测结果,无需复杂的后处理步骤。DETR是这个领域的开创性工作,它借鉴了机器翻译中的思想,将目标检测看作是一个集合预测问题。匈牙利损失是一种特殊的损失函数,它能够找到预测框和真实框之间的最优匹配,确保每个真实目标只对应一个预测框,这样就自然地避免了重复检测的问题。后续的改进工作主要集中在两个方向:一是基于Transformer的DETR系列改进,通过引入可变形注意力、去噪训练、查询选择等技术来提升性能和效率;二是基于CNN的端到端检测方法,这些方法试图在保持CNN高效性的同时实现端到端检测,主要通过改进标签分配策略和引入专门的网络模块来去除重复预测。
3 Methodology
3.1 用于无NMS训练的一致双分配策略
During training, YOLOs [ 21 , 65 , 29 , 70 ] usually leverage TAL [ 15 ] to allocate multiple positive samples for each instance. The adoption of one-to-many assignment yields plentiful supervisory signals, facilitating the optimization and achieving superior performance. However, it necessitates YOLOs to rely on the NMS post-processing, which causes the suboptimal inference efficiency for deployment. While previous works [ 55 , 66 , 80 , 6 ] explore one-to-one matching to suppress the redundant predictions, they usually introduce additional inference overhead or yield suboptimal performance. In this work, we present a NMS-free training strategy for YOLOs with dual label assignments and consistent matching metric, achieving both high efficiency and competitive performance.
【翻译】在训练过程中,YOLO通常利用TAL来为每个实例分配多个正样本。采用一对多分配产生了丰富的监督信号,促进了优化并实现了优异的性能。然而,这需要YOLO依赖NMS后处理,这导致部署时的推理效率不够理想。虽然之前的工作探索了一对一匹配来抑制冗余预测,但它们通常会引入额外的推理开销或产生次优性能。在这项工作中,我们提出了一种用于YOLO的无NMS训练策略,采用双标签分配和一致匹配度量,实现了高效率和竞争性能。
【解析】传统YOLO在训练时使用TAL(Task-Aligned Learning)策略,这是一种标签分配方法。一对多分配是对于图像中的每个真实物体,会有多个预测框被标记为正样本来学习检测这个物体。这样做的好处是提供了更多的学习信号,但问题在于推理时会产生多个预测框都认为自己检测到了同一个物体,这时就需要NMS来决定哪个预测框是最好的,把其他的删掉。NMS虽然能解决重复检测的问题,但它本身需要额外的计算时间,而且这个过程无法并行化,成为了推理速度的瓶颈。之前有研究尝试用一对一分配(一个物体只对应一个预测框)来避免NMS,但这样做要么增加了其他计算开销,要么因为监督信号不够丰富而导致检测精度下降。YOLOv10的创新在于提出了"双标签分配"策略,既保留了一对多分配的训练优势,又实现了一对一分配的推理效率,通过"一致匹配度量"来协调这两种分配方式,让模型在训练时享受丰富的监督信号,在推理时直接输出结果而不需要NMS。
Dual label assignments. Unlike one-to-many assignment, one-to-one matching assigns only one prediction to each ground truth, avoiding the NMS post-processing. However, it leads to weak supervision, which causes suboptimal accuracy and convergence speed [ 82 ]. Fortunately, this deficiency can be compensated for by the one-to-many assignment [ 6 ]. To achieve this, we introduce dual label assignments for YOLOs to combine the best of both strategies. Specifically, as shown in Fig. 2.(a), we incorporate another one-to-one head for YOLOs. It retains the identical structure and adopts the same optimization objectives as the original one-to-many branch but leverages the one-to-one matching to obtain label assignments. During training, two heads are jointly optimized with the model, allowing the backbone and neck to enjoy the rich supervision provided by the oneto-many assignment. During inference, we discard the one-to-many head and utilize the one-to-one head to make predictions. This enables YOLOs for the end-to-end deployment without incurring any additional inference cost. Besides, in the one-to-one matching, we adopt the top one selection, which achieves the same performance as Hungarian matching [4] with less extra training time.
【翻译】双标签分配。与一对多分配不同,一对一匹配只为每个真实目标分配一个预测,避免了NMS后处理。然而,这会导致监督信号较弱,从而造成次优的准确性和收敛速度。幸运的是,这种不足可以通过一对多分配来补偿。为了实现这一点,我们为YOLO引入了双标签分配来结合两种策略的优势。具体来说,如图2.(a)所示,我们为YOLO加入了另一个一对一检测头。它保持相同的结构并采用与原始一对多分支相同的优化目标,但利用一对一匹配来获得标签分配。在训练期间,两个检测头与模型联合优化,允许主干网络和颈部网络享受一对多分配提供的丰富监督。在推理期间,我们丢弃一对多检测头并利用一对一检测头进行预测。这使得YOLO能够进行端到端部署而不产生任何额外的推理成本。此外,在一对一匹配中,我们采用top-1选择,它在较少额外训练时间下实现了与匈牙利匹配相同的性能。
【解析】YOLOv10的解决方案是在网络中同时使用两个检测头:一个采用一对多分配,另一个采用一对一分配。关键在于这两个头共享相同的主干网络和颈部网络,这意味着网络的特征提取部分能够同时从两种分配策略中获益。在训练阶段,一对多头提供丰富的监督信号帮助网络学习更好的特征表示,而一对一头则学习如何直接输出无重复的预测结果。到了推理阶段,只需要使用一对一头即可,既保证了预测质量,又避免了NMS的计算开销。这种设计的精妙之处在于它实现了"训练时享受丰富监督,推理时保持高效简洁"的目标,是一种典型的训练推理分离的优化策略。
Figure 2: (a) Consistent dual assignments for NMS-free training. (b) Frequency of one-to-one assignments in Top-1/5/10 of one-to-many results for Y O L O v 8 − S \mathrm{YOLOv8-S} YOLOv8−S which employs α o 2 m = 0.5 \alpha_{o2m}{=}0.5 αo2m=0.5 and β o 2 m = 6 \beta_{o2m}{=}6 βo2m=6 by default [21]. For consistency, α o 2 o = 0.5 \alpha_{o2o}{=}0.5 αo2o=0.5 ; β o 2 o = 6 \beta_{o2o}{=}6 βo2o=6 . For inconsistency, α o 2 o = 0.5 \alpha_{o2o}{=}0.5 αo2o=0.5 ; β o 2 o = 2 \beta_{o2o}=2 βo2o=2 .
【翻译】图2:(a) 用于无NMS训练的一致双分配。(b) 在YOLOv8-S的一对多结果的Top-1/5/10中一对一分配的频率,该模型默认采用 α o 2 m = 0.5 \alpha_{o2m}{=}0.5 αo2m=0.5和 β o 2 m = 6 \beta_{o2m}{=}6 βo2m=6。对于一致性情况, α o 2 o = 0.5 \alpha_{o2o}{=}0.5 αo2o=0.5; β o 2 o = 6 \beta_{o2o}{=}6 βo2o=6。对于不一致性情况, α o 2 o = 0.5 \alpha_{o2o}{=}0.5 αo2o=0.5; β o 2 o = 2 \beta_{o2o}=2 βo2o=2。
【解析】这个图表展示了双标签分配策略中一个重要的设计考量:两个检测头之间的一致性问题。图2(b)通过统计分析揭示了一个关键现象:当两个检测头使用相同的超参数设置时(一致性情况),一对一分配选择的预测框往往也是一对多分配中排名靠前的候选框。这种一致性非常重要,因为它意味着两个检测头在学习过程中是相互协调的,而不是各自为政。如果两个头的学习目标差异过大(不一致性情况),就可能出现训练时一对多头学到的知识无法有效传递给一对一头的问题,从而影响最终的推理性能。通过调整 α \alpha α和 β \beta β这两个超参数,可以控制两个检测头之间的一致性程度,确保它们在训练过程中能够协同工作,最终实现既享受丰富监督又保持推理高效的目标。
Consistent matching metric. During assignments, both one-to-one and one-to-many approaches leverage a metric to quantitatively assess the level of concordance between predictions and instances. To achieve prediction aware matching for both branches, we employ a uniform matching metric, i.e .,
【翻译】一致匹配度量。在分配过程中,一对一和一对多方法都利用一个度量来定量评估预测和实例之间的一致性水平。为了实现两个分支的预测感知匹配,我们采用统一的匹配度量,即:
m ( α , β ) = s ⋅ p α ⋅ I o U ( b ^ , b ) β , m(\alpha,\beta)=s\cdot p^{\alpha}\cdot\mathrm{IoU}(\hat{b},b)^{\beta}, m(α,β)=s⋅pα⋅IoU(b^,b)β,
where p p p is the classification score, b ^ \hat{b} b^ and b b b denote the bounding box of prediction and instance, respectively. s s s represents the spatial prior indicating whether the anchor point of prediction is within the instance [ 21 , 65 , 29 , 70 ]. α \alpha α and β \beta β are two important hyperparameters that balance the impact of the semantic prediction task and the location regression task. We denote the one-to-many and one-to-one metrics as m o 2 m = m ( α o 2 m , β o 2 m ) m_{o2m}=m\left(\alpha_{o2m},\beta_{o2m}\right) mo2m=m(αo2m,βo2m) and m o 2 o = m ( α o 2 o , β o 2 o ) m_{o2o}=m\left(\alpha_{o2o},\beta_{o2o}\right) mo2o=m(αo2o,βo2o) , respectively. These metrics influence the label assignments and supervision information for the two heads.
【翻译】其中 p p p是分类得分, b ^ \hat{b} b^和 b b b分别表示预测和实例的边界框。 s s s表示空间先验,指示预测的锚点是否在实例内。 α \alpha α和 β \beta β是两个重要的超参数,用于平衡语义预测任务和位置回归任务的影响。我们将一对多和一对一度量分别表示为 m o 2 m = m ( α o 2 m , β o 2 m ) m_{o2m}=m\left(\alpha_{o2m},\beta_{o2m}\right) mo2m=m(αo2m,βo2m)和 m o 2 o = m ( α o 2 o , β o 2 o ) m_{o2o}=m\left(\alpha_{o2o},\beta_{o2o}\right) mo2o=m(αo2o,βo2o)。这些度量影响两个检测头的标签分配和监督信息。
【解析】这个匹配度量公式是YOLOv10双分配策略细节。在目标检测中,我们需要决定哪些预测框应该负责检测哪些真实物体,这就是标签分配问题。这个公式 m ( α , β ) = s ⋅ p α ⋅ I o U ( b ^ , b ) β m(\alpha,\beta)=s\cdot p^{\alpha}\cdot\mathrm{IoU}(\hat{b},b)^{\beta} m(α,β)=s⋅pα⋅IoU(b^,b)β包含了三个关键要素:首先是空间先验 s s s,它是一个二进制指示器,确保只有那些锚点落在真实物体内部的预测框才有资格被考虑,这避免了距离过远的预测框参与匹配;其次是分类得分 p p p,它反映了模型对该预测框包含目标物体的置信度;最后是IoU项,它衡量预测框与真实框的位置重叠程度。两个指数 α \alpha α和 β \beta β的作用是调节分类任务和定位任务在匹配过程中的相对重要性。当 α \alpha α较大时,模型更倾向于选择分类得分高的预测框;当 β \beta β较大时,模型更重视位置精确度。关键创新在于,虽然一对多和一对一分支使用相同的匹配度量公式结构,但通过设置不同的 α \alpha α和 β \beta β参数值,可以让两个分支在训练时有不同的学习重点,同时保持整体学习目标的一致性。这种设计既保证了训练时的丰富监督信号,又确保了推理时的高效性。
In dual label assignments, the one-to-many branch provides much richer supervisory signals than one-to-one branch. Intuitively, if we can harmonize the supervision of the one-to-one head with that of one-to-many head, we can optimize the one-to-one head towards the direction of one-to-many head’s optimization. As a result, the one-to-one head can provide improved quality of samples during inference, leading to better performance. To this end, we first analyze the supervision gap between the two heads. Due to the randomness during training, we initiate our examination in the beginning with two heads initialized with the same values and producing the same predictions, i.e ., one-to-one head and one-to-many head generate the same p p p and IoU for each prediction-instance pair. We note that the regression targets of two branches do not conflict, as matched predictions share the same targets and unmatched predictions are ignored. The supervision gap thus lies in the different classification targets. Given an instance, we denote its largest IoU with predictions as u ∗ u^{*} u∗ , and the largest one-to-many and one-to-one matching scores as m o 2 m ∗ m_{o2m}^{*} mo2m∗ and m o 2 o ∗ m_{o2o}^{*} mo2o∗ , respectively. Suppose that one-to-many branch yields the positive samples can then derive the classification target Ω \Omega Ω and one-to-one branch selects t o 2 m , j = u ∗ ⋅ m o 2 m , j ‾ m o 2 m ∗ ≤ u ∗ \begin{array}{r}{t_{o2m,j}{=}u^{*}\cdot\frac{m_{o2m,\overline{{j}}}}{m_{o2m}^{*}}\le u^{*}}\end{array} to2m,j=u∗⋅mo2m∗mo2m,j≤u∗ i i i -th prediction with the metric for j ∈ Ω j\in\Omega j∈Ω and t o 2 o , i = u ∗ ⋅ m o 2 o , i m o 2 o ∗ = u ∗ \scriptstyle t_{o2o,i}=u^{*}\cdot{\frac{m_{o2o,i}}{m_{o2o}^{*}}}=u^{*} to2o,i=u∗⋅mo2o∗mo2o,i=u∗ m o 2 o , i = m o 2 o ∗ m_{o2o,i}{=}m_{o2o}^{*} mo2o,i=mo2o∗ o , we for task aligned loss as in [ 21 , 65 , 29 , 70 , 15 ]. The supervision gap between two branches can thus be derived by the 1-Wasserstein distance [46] of different classification objectives, i.e .,
【翻译】在双标签分配中,一对多分支比一对一分支提供了更丰富的监督信号。直观地说,如果我们能够协调一对一检测头与一对多检测头的监督,我们就可以将一对一检测头的优化方向与一对多检测头的优化方向保持一致。因此,一对一检测头可以在推理过程中提供更高质量的样本,从而获得更好的性能。为此,我们首先分析两个检测头之间的监督差距。由于训练过程中的随机性,我们从两个检测头用相同值初始化并产生相同预测的开始进行检查,即一对一检测头和一对多检测头为每个预测-实例对生成相同的 p p p和IoU。我们注意到两个分支的回归目标不冲突,因为匹配的预测共享相同的目标,而未匹配的预测被忽略。因此监督差距在于不同的分类目标。给定一个实例,我们将其与预测的最大IoU记为 u ∗ u^{*} u∗,将最大的一对多和一对一匹配分数分别记为 m o 2 m ∗ m_{o2m}^{*} mo2m∗和 m o 2 o ∗ m_{o2o}^{*} mo2o∗。假设一对多分支产生正样本 Ω \Omega Ω,一对一分支选择第 i i i个预测,其度量为 m o 2 o , i = m o 2 o ∗ m_{o2o,i}{=}m_{o2o}^{*} mo2o,i=mo2o∗,我们可以推导出分类目标 t o 2 m , j = u ∗ ⋅ m o 2 m , j m o 2 m ∗ ≤ u ∗ t_{o2m,j}{=}u^{*}\cdot\frac{m_{o2m,j}}{m_{o2m}^{*}}\le u^{*} to2m,j=u∗⋅mo2m∗mo2m,j≤u∗对于 j ∈ Ω j\in\Omega j∈Ω和 t o 2 o , i = u ∗ ⋅ m o 2 o , i m o 2 o ∗ = u ∗ t_{o2o,i}=u^{*}\cdot{\frac{m_{o2o,i}}{m_{o2o}^{*}}}=u^{*} to2o,i=u∗⋅mo2o∗mo2o,i=u∗,用于任务对齐损失。因此,两个分支之间的监督差距可以通过不同分类目标的1-Wasserstein距离来推导,即:
【解析】这段话分析了双标签分配策略中两个检测头之间存在的监督差异问题。核心思想是要让一对一检测头能够从一对多检测头的丰富监督中受益。作者从数学角度分析了这个问题:首先明确了两个检测头在回归任务上是一致的,真正的差异来自分类任务的监督信号强度。一对多分配会为每个真实物体选择多个正样本进行训练,而一对一分配只选择一个最佳样本。这种差异导致了监督信号的不平衡。为了量化这种差异,作者引入了数学框架:用 u ∗ u^{*} u∗表示预测框与真实框的最大IoU,用 m o 2 m ∗ m_{o2m}^{*} mo2m∗和 m o 2 o ∗ m_{o2o}^{*} mo2o∗分别表示两种分配方式下的最大匹配分数。通过这些参数,可以计算出每个分支的分类目标值。一对多分支中的每个正样本 j j j的分类目标是 t o 2 m , j = u ∗ ⋅ m o 2 m , j m o 2 m ∗ t_{o2m,j}{=}u^{*}\cdot\frac{m_{o2m,j}}{m_{o2m}^{*}} to2m,j=u∗⋅mo2m∗mo2m,j,这个值通常小于等于 u ∗ u^{*} u∗,因为不是所有正样本都是最优的。而一对一分支选择的样本由于是最优的,其分类目标直接等于 u ∗ u^{*} u∗。这种差异就是监督差距的来源,可以用1-Wasserstein距离来精确量化,为后续的一致性优化提供了理论基础。
A = t o 2 o , i − I ( i ∈ Ω ) t o 2 m , i + ∑ k ∈ Ω \ { i } t o 2 m , k , A=t_{o2o,i}-\mathbb{I}(i\in\Omega)t_{o2m,i}+\sum_{k\in\Omega\backslash\{i\}}t_{o2m,k}, A=to2o,i−I(i∈Ω)to2m,i+k∈Ω\{i}∑to2m,k,
We can observe that the gap decreases as t o 2 m , i t_{o2m,i} to2m,i increases, i.e ., i i i ranks higher within Ω \Omega Ω . It reaches the minimum when t o 2 m , i = u ∗ t_{o2m,i}{=}u^{*} to2m,i=u∗ , i.e ., i i i is the best positive sample in Ω \Omega Ω , as shown in Fig. 2.(a). To achieve ent the consistent matching metric, i . e . i.e. i.e. ., α o 2 o = r ⋅ α o 2 m \alpha_{o2o}{=}r\cdot\alpha_{o2m} αo2o=r⋅αo2m and β o 2 o = r ⋅ β o 2 m \beta_{o2o}{=}r\cdot\beta_{o2m} βo2o=r⋅βo2m , which implies m o 2 o = m o 2 m r m_{o2o}{=}m_{o2m}^{r} mo2o=mo2mr . Therefore, the best positive sample for one-to-many head is also the best for one-to-one head. Consequently, both heads can be optimized consistently and harmoniously. For simplicity, we take r = 1 r{=}1 r=1 , by default, i.e ., α o 2 o = α o 2 m \alpha_{o2o}{=}\alpha_{o2m} αo2o=αo2m and β o 2 o = β o 2 m \beta_{o2o}=\beta_{o2m} βo2o=βo2m . To verify the improved supervision alignment, we count the number of one-to-one matching pairs within the top 1 / 5 / 10 \mathrm{~1~/~5~/~}10 1 / 5 / 10 of the one-to-many results after training. As shown in Fig. 2.(b), the alignment is improved under the consistent matching metric. For a more comprehensive understanding of the mathematical proof, please refer to the appendix.
【翻译】我们可以观察到,随着 t o 2 m , i t_{o2m,i} to2m,i的增加,即 i i i在 Ω \Omega Ω中排名越高,差距会减小。当 t o 2 m , i = u ∗ t_{o2m,i}{=}u^{*} to2m,i=u∗时,即 i i i是 Ω \Omega Ω中最佳正样本时,差距达到最小值,如图2.(a)所示。为了实现一致的匹配度量,即 α o 2 o = r ⋅ α o 2 m \alpha_{o2o}{=}r\cdot\alpha_{o2m} αo2o=r⋅αo2m和 β o 2 o = r ⋅ β o 2 m \beta_{o2o}{=}r\cdot\beta_{o2m} βo2o=r⋅βo2m,这意味着 m o 2 o = m o 2 m r m_{o2o}{=}m_{o2m}^{r} mo2o=mo2mr。因此,一对多检测头的最佳正样本也是一对一检测头的最佳正样本。因此,两个检测头可以一致且和谐地进行优化。为简单起见,我们默认取 r = 1 r{=}1 r=1,即 α o 2 o = α o 2 m \alpha_{o2o}{=}\alpha_{o2m} αo2o=αo2m和 β o 2 o = β o 2 m \beta_{o2o}=\beta_{o2m} βo2o=βo2m。为了验证改进的监督对齐,我们统计训练后一对一匹配对在一对多结果的前1/5/10中的数量。如图2.(b)所示,在一致匹配度量下对齐得到了改善。要更全面地理解数学证明,请参考附录。
【解析】当一对多分配中某个预测框 i i i的分类目标 t o 2 m , i t_{o2m,i} to2m,i越接近最大IoU值 u ∗ u^{*} u∗时,说明这个预测框在一对多分配中的质量越高,此时两个检测头之间的监督差距就越小。当 t o 2 m , i = u ∗ t_{o2m,i}{=}u^{*} to2m,i=u∗时,说明预测框 i i i是一对多分配中的最优选择,这时监督差距达到最小。为了让两个检测头的学习目标保持一致,作者提出了"一致匹配度量"的概念,即让两个检测头使用相同的超参数设置: α o 2 o = α o 2 m \alpha_{o2o}{=}\alpha_{o2m} αo2o=αo2m和 β o 2 o = β o 2 m \beta_{o2o}=\beta_{o2m} βo2o=βo2m。这样设计的数学依据是,如果匹配度量公式的参数相同,那么在相同的预测条件下,一对多分配中得分最高的预测框也会是一对一分配的首选。这种一致性确保了两个检测头在训练过程中朝着相同的方向优化,避免了相互冲突的学习目标。图2.(b)的实验验证了这一点:使用一致的匹配度量后,一对一分配选择的预测框更频繁地出现在一对多分配的前几名中,证明了两个检测头的学习目标确实得到了很好的协调。
Discussion with other counter-parts. Similarly, previous works [ 28 , 5 , 77 , 54 , 6 , 82 , 45 ] explore the different assignments to accelerate the training convergence and improve the performance for different networks. For example, H-DETR [ 28 ], Group-DETR [ 5 ], and MS-DETR [ 77 ] introduce one-to-many matching in conjunction with the original one-to-one matching by hybrid or multiple group label assignments, to improve upon DETR. Differently, to achieve the one-to-many matching, they usually introduce extra queries or repeat ground truths for bipartite matching, or select top several queries from the matching scores, while we adopt the prediction aware assignment that incorporates the spatial prior. Besides, LRANet [ 54 ] employs the dense assignment and sparse assignment branches for training, which all belong to the one-to-many assignment, while we adopt the one-to-many and one-to-one branches. DEYO [ 45 , 43 , 44 ] investigates the step-by-step training with one-to-many matching in the first stage for convolutional encoder and one-to-one matching in the second stage for transformer decoder, while we avoid the transformer decoder for end-to-end inference. Compared with these methods, our approach is more straightforward and effective for YOLOs, achieving superior performance with lower computational cost.
【翻译】与其他对应方法的讨论。类似地,之前的工作探索了不同的分配策略来加速训练收敛并提高不同网络的性能。例如,H-DETR、Group-DETR和MS-DETR通过混合或多组标签分配,将一对多匹配与原始的一对一匹配相结合,以改进DETR。不同的是,为了实现一对多匹配,它们通常引入额外的查询或重复真实目标进行二分匹配,或从匹配分数中选择前几个查询,而我们采用了结合空间先验的预测感知分配。此外,LRANet采用密集分配和稀疏分配分支进行训练,这些都属于一对多分配,而我们采用一对多和一对一分支。DEYO研究了分步训练,在第一阶段对卷积编码器使用一对多匹配,在第二阶段对transformer解码器使用一对一匹配,而我们避免了transformer解码器以实现端到端推理。与这些方法相比,我们的方法对于YOLO更加直接和有效,以更低的计算成本实现了优异的性能。
【解析】标签分配一直是一个重要的研究方向,不同的方法都在尝试解决训练效率和检测精度之间的平衡问题。DETR系列方法(H-DETR、Group-DETR、MS-DETR)主要针对transformer架构的检测器,它们通过增加查询数量或重复真实目标的方式来实现一对多匹配,但这种做法会增加计算复杂度。LRANet虽然也使用了双分支的思想,但其两个分支都是一对多分配的变体,没有真正解决推理时需要NMS的问题。DEYO采用了分阶段训练的策略,但仍然依赖transformer解码器,这在推理时会带来额外的计算开销。相比之下,YOLOv10的方法更加简洁高效:它直接在YOLO架构上实现双分配,不需要复杂的查询机制或分阶段训练,同时通过预测感知分配和空间先验的结合,确保了分配的质量。最重要的是,YOLOv10在推理时完全避免了transformer解码器和NMS后处理,实现了真正的端到端检测,这使得它在保持高精度的同时具有更好的推理效率。
3.2 整体效率-精度驱动的模型设计
In addition to the post-processing, the model architectures of YOLOs also pose great challenges to the efficiency-accuracy trade-offs [ 50 , 8 , 29 ]. Although previous works explore various design strategies, the comprehensive inspection for various components in YOLOs is still lacking. Consequently, the model architecture exhibits non-negligible computational redundancy and constrained capability, which impedes its potential for achieving high efficiency and performance. Here, we aim to holistically perform model designs for YOLOs from both efficiency and accuracy perspectives.
【翻译】除了后处理之外,YOLO的模型架构也对效率-精度权衡提出了巨大挑战。尽管之前的工作探索了各种设计策略,但对YOLO中各种组件的全面检查仍然缺乏。因此,模型架构表现出不可忽视的计算冗余和受限的能力,这阻碍了其实现高效率和高性能的潜力。在这里,我们旨在从效率和精度两个角度全面地为YOLO进行模型设计。
【解析】现有的YOLO架构在追求效率和精度平衡时面临挑战。虽然研究者们提出了很多改进策略,但大多数工作都是针对特定组件的局部优化,缺乏对整个模型架构的系统性分析和设计。这种碎片化的改进方式导致了两个问题:一是模型中存在计算冗余,即某些部分的计算资源没有得到充分利用或者存在不必要的计算开销;二是模型能力受限,即架构设计限制了模型表达复杂模式的能力。YOLOv10的作者认识到这个问题,决定采用更加全面和系统的方法来重新设计YOLO架构,不仅要考虑如何提高计算效率,还要同时保证甚至提升检测精度。
Efficiency driven model design. The components in YOLO consist of the stem, downsampling layers, stages with basic building blocks, and the head. The stem incurs few computational cost and we thus perform efficiency driven model design for other three parts.
【翻译】效率驱动的模型设计。YOLO中的组件包括stem、下采样层、带有基本构建块的阶段和检测头。stem产生的计算成本很少,因此我们对其他三个部分进行效率驱动的模型设计。
【解析】这里作者明确了YOLO模型的基本架构组成,并确定了优化重点。YOLO模型通常包含四个主要部分:stem是模型的起始部分,通常包含几个简单的卷积层,用于初步处理输入图像;下采样层负责逐步减小特征图的空间分辨率;stages是模型的主体部分,包含多个重复的基本构建块,这些块负责提取和处理特征;head是最终的检测头,负责输出分类和定位结果。由于stem部分相对简单,计算开销较小,所以作者将优化重点放在其他三个计算密集的部分。
(1) Lightweight classification head. The classification and regression heads usually share the same architecture in YOLOs. However, they exhibit notable disparities in computational overhead. For example, the FLOPs and parameter count of the classification head (5.95G/1.51M) are 2.5 × 2.5\times 2.5× and 2.4 × 2.4\times 2.4× those of the regression head (2.34G/0.64M) in YOLOv8-S, respectively. However, after analyzing the impact of classification error and the regression error (seeing Tab. 6), we find that the regression head undertakes more significance for the performance of YOLOs. Consequently, we can reduce the overhead of classification head without worrying about hurting the performance greatly. Therefore, we simply adopt a lightweight architecture for the classification head, which consists of two depthwise separable convolutions [25, 9] with the kernel size of 3 × 3 3\times3 3×3 followed by a 1 × 1 1\times1 1×1 convolution.
【翻译】轻量级分类头。在YOLO中,分类头和回归头通常共享相同的架构。然而,它们在计算开销方面表现出显著差异。例如,在YOLOv8-S中,分类头的FLOPs和参数量(5.95G/1.51M)分别是回归头(2.34G/0.64M)的 2.5 × 2.5\times 2.5×和 2.4 × 2.4\times 2.4×。然而,在分析分类误差和回归误差的影响后(见表6),我们发现回归头对YOLO的性能更为重要。因此,我们可以减少分类头的开销而不必担心会大幅损害性能。因此,我们简单地为分类头采用轻量级架构,它由两个核大小为 3 × 3 3\times3 3×3的深度可分离卷积,然后是一个 1 × 1 1\times1 1×1卷积组成。
【解析】回归头负责预测物体的精确位置和大小,这对检测任务至关重要,因为即使分类正确,如果位置不准确,检测结果仍然无用。作者提出使用轻量级的分类头设计:采用深度可分离卷积替代标准卷积。深度可分离卷积将标准卷积分解为深度卷积和逐点卷积两步,大幅减少了参数量和计算量,同时保持了足够的表达能力来完成分类任务。
(2) Spatial-channel decoupled downsampling. YOLOs typically regular 3 × 3 3\times3 3×3 standard convolutions with stride of 2, achieving spatial downsampling (from H × W H\times W H×W × to H 2 × W 2 ) \textstyle{\frac{{H}}{2}}\times\frac{W}{2}) 2H×2W) and channel transformation (from C to 2C ) simultaneously. This introduces non-negligible computational cost of O ( 9 2 H W C 2 ) \mathcal{O}(\textstyle{\frac{9}{2}}H W C^{2}) O(29HWC2) and and parameter count of O ( 18 C 2 ) \mathcal{O}(18C^{2}) O(18C2). Instead, we propose to decouple the spatial reduction and channel increase operations, enabling more efficient downsampling. Specifically, we firstly leverage the pointwise convolution to modulate the channel dimension and then utilize the depthwise convolution to perform spatial downsampling. This reduces the computational cost O ( 2 H W C 2 + 9 2 H W C ) \mathcal{O}(\textstyle{2}H W C^{2} + {\frac{9}{2}}H W C) O(2HWC2+29HWC) and the parameter count to O ( 2 C 2 + 18 C ) \mathcal{O}({2}C^{2}{+}18C) O(2C2+18C) . Meanwhile, it maximizes information retention during downsampling, leading to competitive performance with latency reduction.
【翻译】空间-通道解耦下采样。YOLO通常使用步长为2的常规 3 × 3 3\times3 3×3标准卷积,同时实现空间下采样(从 H × W H\times W H×W到 H 2 × W 2 \textstyle{\frac{{H}}{2}}\times\frac{W}{2} 2H×2W)和通道变换(从C到2C)。这引入了不可忽视的计算成本 O ( 9 2 H W C 2 ) \mathcal{O}(\textstyle{\frac{9}{2}}H W C^{2}) O(29HWC2)和参数量 O ( 18 C 2 ) \mathcal{O}(18C^{2}) O(18C2)。相反,我们提出解耦空间缩减和通道增加操作,实现更高效的下采样。具体来说,我们首先利用逐点卷积来调节通道维度,然后利用深度卷积来执行空间下采样。这将计算成本降低到 O ( 2 H W C 2 + 9 2 H W C ) \mathcal{O}(\textstyle{2}H W C^{2} + {\frac{9}{2}}H W C) O(2HWC2+29HWC),参数量降低到 O ( 2 C 2 + 18 C ) \mathcal{O}({2}C^{2}{+}18C) O(2C2+18C)。同时,它在下采样过程中最大化信息保留,在减少延迟的同时实现竞争性能。
【解析】传统的下采样操作减少特征图的空间尺寸和增加通道数。这种做法虽然简洁,但在计算效率上并不是最优的。标准的 3 × 3 3\times3 3×3卷积在执行这种双重任务时,计算复杂度达到 O ( 9 2 H W C 2 ) \mathcal{O}(\textstyle{\frac{9}{2}}H W C^{2}) O(29HWC2),这是因为每个输出通道都需要与所有输入通道进行卷积运算,而 3 × 3 3\times3 3×3的卷积核又增加了9倍的计算量。作者提出的解耦策略将这个复杂的操作分解为两个更简单、更高效的步骤。第一步使用 1 × 1 1\times1 1×1的逐点卷积来调整通道数,这一步的计算复杂度是 O ( 2 H W C 2 ) \mathcal{O}(2HWC^2) O(2HWC2),虽然看起来很大,但实际上比原来的方法更高效,因为没有空间卷积的开销。第二步使用深度卷积进行空间下采样,计算复杂度是 O ( 9 2 H W C ) \mathcal{O}(\frac{9}{2}HWC) O(29HWC),注意这里是线性关于通道数C的,而不是二次的。总的计算复杂度变成了 O ( 2 H W C 2 + 9 2 H W C ) \mathcal{O}(2HWC^2 + \frac{9}{2}HWC) O(2HWC2+29HWC),在通道数较大时,这比原来的 O ( 9 2 H W C 2 ) \mathcal{O}(\frac{9}{2}HWC^2) O(29HWC2)要小得多。更重要的是,这种分解方式能够更好地保留信息,因为逐点卷积专注于通道间的信息融合,而深度卷积专注于空间信息的处理,各司其职,避免了信息在复杂操作中的损失。
(3) Rank-guided block design. YOLOs usually employ the same basic building block for all stages [ 29 , 65 ], e.g ., the bottleneck block in YOLOv8 [ 21 ]. To thoroughly examine such homogeneous design for YOLOs, we utilize the intrinsic rank [ 33 , 16 ] to analyze the redundancy 2 of each stage. Specifically, we calculate the numerical rank of the last convolution in the last basic block in each stage, which counts the number of singular values larger than a threshold. Fig. 3.(a) presents the results of YOLOv8, indicating that deep stages and large models are prone to exhibit more redundancy. This observation suggests that simply applying the same block design for all stages is suboptimal for the best capacity-efficiency trade-off. To tackle this, we propose a rank-guided block design scheme which aims to decrease the complexity of stages that are shown to be redundant using compact architecture design. We first present a compact inverted block (CIB) structure, which adopts the cheap depthwise convolutions for spatial mixing and cost-effective pointwise convolutions for channel mixing, as shown in Fig. 3.(b). It can serve as the efficient basic building block, e.g ., embedded in the ELAN structure [ 64 , 21 ] (Fig. 3.(b)). Then, we advocate a rank-guided block allocation strategy to achieve the best efficiency while maintaining competitive capacity. Specifically, given a model, we sort its all stages based on their intrinsic ranks in ascending order. We further inspect the performance variation of replacing the basic block in the leading stage with CIB. If there is no performance degradation compared with the given model, we proceed with the replacement of the next stage and halt the process otherwise. Consequently, we can implement adaptive compact block designs across stages and model scales, achieving higher efficiency without compromising performance. Due to the page limit, we provide the details of the algorithm in the appendix.
【翻译】秩引导的块设计。YOLO通常对所有阶段采用相同的基本构建块,例如YOLOv8中的瓶颈块。为了彻底检查YOLO的这种同质化设计,我们利用内在秩来分析每个阶段的冗余性。具体来说,我们计算每个阶段中最后一个基本块的最后一个卷积的数值秩,它计算大于阈值的奇异值数量。图3.(a)展示了YOLOv8的结果,表明深层阶段和大型模型更容易表现出更多冗余。这一观察表明,简单地对所有阶段应用相同的块设计对于最佳的容量-效率权衡是次优的。为了解决这个问题,我们提出了一种秩引导的块设计方案,旨在使用紧凑的架构设计来降低被证明是冗余的阶段的复杂性。我们首先提出了一个紧凑倒置块(CIB)结构,它采用廉价的深度卷积进行空间混合和成本效益高的逐点卷积进行通道混合,如图3.(b)所示。它可以作为高效的基本构建块,例如嵌入到ELAN结构中(图3.(b))。然后,我们提倡一种秩引导的块分配策略,以在保持竞争性能力的同时实现最佳效率。具体来说,给定一个模型,我们根据其内在秩按升序对所有阶段进行排序。我们进一步检查用CIB替换领先阶段中基本块的性能变化。如果与给定模型相比没有性能下降,我们继续替换下一个阶段,否则停止该过程。因此,我们可以在不同阶段和模型规模上实现自适应的紧凑块设计,在不损害性能的情况下实现更高的效率。由于页面限制,我们在附录中提供了算法的详细信息。
【解析】传统的YOLO架构所有网络阶段都使用相同的基本构建块,这种"一刀切"的设计方式并不合理。网络的不同阶段承担着不同的功能,浅层主要提取低级特征如边缘和纹理,而深层则负责提取高级语义特征。作者通过数学工具——内在秩分析来量化每个阶段的计算冗余程度。内在秩是通过奇异值分解(SVD)计算得出的,它反映了特征表示的有效维度。当一个卷积层的输出特征中有很多奇异值接近零时,说明这些维度包含的信息很少,存在冗余。实验发现,网络越深的阶段和参数越多的大模型往往冗余程度越高,这是因为深层网络容易过参数化,而大模型本身就有更多可能未被充分利用的参数。基于这个发现,作者设计了紧凑倒置块(CIB),这是一种轻量级的构建块,采用深度可分离卷积的思想:先用深度卷积处理空间信息,再用1×1卷积处理通道信息,大幅减少计算量。更重要的是,作者提出了秩引导的分配策略:首先根据各阶段的冗余程度排序,然后从冗余度最高的阶段开始,逐步用CIB替换原有的重型块,每次替换后都检查性能是否下降,如果性能保持稳定就继续替换下一个阶段,直到出现性能下降为止。这种策略实现了精准的架构优化,既保证了模型性能,又最大化了计算效率。
Figure 3: (a) The intrinsic ranks across stages and models in YOLOv8. The stage in the backbone and neck is numbered in the order of model forward process. The numerical rank r r r is normalized to r / C o r/C_{o} r/Co for y-axis and its threshold is set to λ m a x / 2 \lambda_{m a x}/2 λmax/2 , by default, where C o C_{o} Co denotes the number of output channels and λ m a x \lambda_{m a x} λmax is the largest singular value. It can be observed that deep stages and large models exhibit lower intrinsic rank values. (b) The compact inverted block (CIB). (c) partial self-attention module (PSA).
【翻译】图3:(a) YOLOv8中各阶段和模型的内在秩。骨干网络和颈部的阶段按模型前向过程的顺序编号。数值秩 r r r被归一化为 r / C o r/C_{o} r/Co作为y轴,其阈值默认设置为 λ m a x / 2 \lambda_{max}/2 λmax/2,其中 C o C_{o} Co表示输出通道数, λ m a x \lambda_{max} λmax是最大奇异值。可以观察到,深层阶段和大型模型表现出较低的内在秩值。(b) 紧凑倒置块(CIB)。(c)部分自注意力模块(PSA)。
【解析】内在秩是衡量神经网络层特征表示冗余程度的重要指标。当一个网络层的内在秩较低时,说明该层学到的特征之间存在较强的线性相关性,即特征表示中包含了冗余信息。图3(a)展示了YOLOv8不同阶段和不同规模模型的内在秩分析结果。为了便于比较,作者将原始的数值秩 r r r除以输出通道数 C o C_o Co进行归一化,这样可以消除不同层通道数差异对比较结果的影响。阈值设置为最大奇异值的一半,这是一个常用的判断标准,用于区分重要特征和冗余特征。从分析结果可以看出两个重要现象:首先,网络的深层阶段比浅层阶段具有更低的内在秩,这表明随着网络深度的增加,特征表示变得更加冗余,这为后续的模型压缩和优化提供了理论依据;其次,大型模型比小型模型表现出更低的内在秩,说明大模型虽然参数更多,但其特征表示的有效维度并没有成比例增长,存在显著的参数冗余。基于这些发现,作者设计了两个关键组件:紧凑倒置块(CIB)用于减少计算冗余,部分自注意力模块(PSA)用于在保持效率的同时提升模型的表达能力。
Accuracy driven model design. We further explore the large-kernel convolution and self-attention for accuracy driven design, aiming to boost the performance under minimal cost.
【翻译】精度驱动的模型设计。我们进一步探索大核卷积和自注意力机制用于精度驱动的设计,旨在以最小的成本提升性能。
【解析】在完成效率驱动的模型设计后,作者转向精度提升的研究。这里的核心思想是在不显著增加计算成本的前提下,通过引入更强的特征提取能力来提高检测精度。大核卷积能够捕获更大范围的空间上下文信息,这对于检测任务中的物体定位和识别都很重要,特别是对于大尺寸物体或者需要更大感受野的场景。自注意力机制则能够建模特征之间的长距离依赖关系,帮助网络更好地理解图像中不同区域之间的关联性。
(1) Large-kernel convolution. Employing large-kernel depthwise convolution is an effective way to enlarge the receptive field and enhance the model’s capability [ 10 , 40 , 39 ]. However, simply leveraging them in all stages may introduce contamination in shallow features used for detecting small objects. Therefore, we only employ large-kernel depthwise convolutions in the deep stages (stage 3 and 4) of the backbone, where the receptive field is crucial for capturing long-range dependencies and the resolution is relatively low, making them less sensitive to the increased computational cost. Specifically, we replace the second 3 × 3 3\times3 3×3 depthwise convolution in the bottleneck with a 7 × 7 7\times7 7×7 depthwise convolution.
【翻译】大核卷积。采用大核深度卷积是扩大感受野和增强模型能力的有效方法。然而,在所有阶段简单地利用它们可能会在用于检测小物体的浅层特征中引入污染。因此,我们只在骨干网络的深层阶段(第3和第4阶段)采用大核深度卷积,在这些阶段感受野对于捕获长距离依赖关系至关重要,分辨率相对较低,使它们对增加的计算成本不太敏感。具体来说,我们将瓶颈块中的第二个 3 × 3 3\times3 3×3深度卷积替换为 7 × 7 7\times7 7×7深度卷积。
【解析】感受野是指网络中某个神经元能够"看到"的输入图像区域的大小。在目标检测任务中,不同尺寸的物体需要不同大小的感受野来有效检测。小物体需要保持精细的空间细节,因此适合较小的感受野;而大物体或者需要理解物体间关系的任务则需要更大的感受野来获取更广泛的上下文信息。大核卷积能够直接增大感受野,但也会带来计算开销的增加。作者的策略很巧妙:他们只在网络的深层使用大核卷积,这是因为深层特征图的分辨率已经比较低,计算开销的增加相对可控,同时深层特征主要负责捕获高级语义信息和长距离依赖关系,正好需要大感受野的帮助。相反,浅层特征图分辨率高,如果使用大核卷积会显著增加计算量,而且浅层特征主要用于检测小物体,过大的感受野反而可能引入无关信息,影响小物体检测的精度。通过将 3 × 3 3\times3 3×3卷积替换为 7 × 7 7\times7 7×7卷积,感受野从9个像素点扩展到49个像素点,大幅提升了模型捕获空间上下文的能力。
(2) Partial self-attention. Self-attention has been proven to be effective for capturing long-range dependencies [ 70 , 71 , 72 ]. However, the global modeling in self-attention incurs quadratic computational complexity with respect to the spatial size, making it less efficient for high-resolution feature maps. To address this, we design a partial self-attention (PSA) module that only applies self-attention to a subset of channels, while leaving the remaining channels unchanged. This reduces the computational overhead while maintaining the model’s capability to capture long-range dependencies. Specifically, given an input feature X ∈ R H × W × C X \in \mathbb{R}^{H \times W \times C} X∈RH×W×C , we evenly partition it into two parts along the channel dimension: X 1 ∈ R H × W × C / 2 X_1 \in \mathbb{R}^{H \times W \times C/2} X1∈RH×W×C/2 and X 2 ∈ R H × W × C / 2 X_2 \in \mathbb{R}^{H \times W \times C/2} X2∈RH×W×C/2 . We then apply multi-head self-attention (MHSA) [ 70 ] to X 1 X_1 X1 and keep X 2 X_2 X2 unchanged. The outputs are concatenated to form the final output. Additionally, we employ an N × N N \times N N×N depthwise convolution before MHSA to incorporate the local spatial information, where N N N is set to 7 by default. The PSA module is only applied to the stage 4 of the backbone, where the resolution is the lowest and the computational overhead is acceptable.
【翻译】部分自注意力。自注意力已被证明在捕获长距离依赖关系方面是有效的。然而,自注意力中的全局建模会产生与空间大小成二次方的计算复杂度,使其对高分辨率特征图效率较低。为了解决这个问题,我们设计了一个部分自注意力(PSA)模块,它只对通道的一个子集应用自注意力,而保持其余通道不变。这减少了计算开销,同时保持了模型捕获长距离依赖关系的能力。具体来说,给定输入特征 X ∈ R H × W × C X \in \mathbb{R}^{H \times W \times C} X∈RH×W×C,我们沿通道维度将其均匀分为两部分: X 1 ∈ R H × W × C / 2 X_1 \in \mathbb{R}^{H \times W \times C/2} X1∈RH×W×C/2和 X 2 ∈ R H × W × C / 2 X_2 \in \mathbb{R}^{H \times W \times C/2} X2∈RH×W×C/2。然后我们对 X 1 X_1 X1应用多头自注意力(MHSA),并保持 X 2 X_2 X2不变。输出被连接以形成最终输出。此外,我们在MHSA之前采用 N × N N \times N N×N深度卷积来融合局部空间信息,其中 N N N默认设置为7。PSA模块只应用于骨干网络的第4阶段,在那里分辨率最低,计算开销是可接受的。
【解析】自注意力机制是Transformer架构的核心组件,它能够计算序列中每个位置与所有其他位置之间的关系,从而捕获长距离依赖关系。在计算机视觉任务中,自注意力可以帮助模型理解图像中不同区域之间的关联性,比如一个物体的不同部分或者不同物体之间的关系。然而,自注意力的计算复杂度是 O ( n 2 ) O(n^2) O(n2),其中 n n n是序列长度(在图像中对应像素数量)。对于高分辨率的特征图,这种二次复杂度会导致计算开销急剧增加,使得直接应用自注意力变得不现实。作者提出的部分自注意力(PSA)是一个巧妙的折中方案。通过将输入特征沿通道维度分成两半,只对其中一半应用自注意力,另一半保持不变,计算复杂度直接减半。这种设计基于一个重要观察:并不是所有的特征通道都需要全局的长距离建模,有些通道可能更适合局部特征提取。通过让一部分通道专注于全局关系建模,另一部分通道保持原有的局部特征,模型既获得了长距离依赖建模的能力,又控制了计算开销。在自注意力之前添加 7 × 7 7\times7 7×7的深度卷积是为了在进行全局建模之前先融合局部空间信息,这样可以让自注意力机制在更丰富的特征表示基础上工作,提高建模效果。选择在第4阶段应用PSA是因为此时特征图分辨率最小,自注意力的计算开销相对可控,同时深层特征更需要全局上下文信息来理解复杂的语义关系。
Table 1: Comparisons with state-of-the-arts. Latency is measured using official pre-trained models. Latency f denotes the latency in the forward process of model without post-processing. † \dagger † means the results of YOLOv10 with the original one-to-many training using NMS. All results below are measured on T4 GPU with TensorRT, FP16, and batch size 1.
【翻译】表1:与最先进方法的比较。延迟使用官方预训练模型测量。Latency f表示模型前向过程中不包含后处理的延迟。 † \dagger †表示使用NMS的原始一对多训练的YOLOv10结果。以下所有结果都在T4 GPU上使用TensorRT、FP16和批大小为1进行测量。
4 Experiments
4.1 Implementation Details
We select YOLOv8 [ 21 ] as our baseline model, due to its commendable latency-accuracy balance and its availability in various model sizes. We employ the consistent dual assignments for NMS-free training and perform holistic efficiency-accuracy driven model design based on it, which brings our YOLOv10 models. YOLOv10 has the same variants as YOLOv8, i.e ., N / S / M / L / X. Besides, we derive a new variant YOLOv10-B, by simply increasing the width scale factor of YOLOv10-M. We verify the proposed detector on COCO [ 35 ] under the same training-from-scratch setting [ 21 , 65 , 62 ]. Moreover, the latencies of all models are tested on T4 GPU with TensorRT FP16, following [78].
【翻译】我们选择YOLOv8作为基线模型,因为它具有值得称赞的延迟-精度平衡以及多种模型尺寸的可用性。我们采用一致的双重分配进行无NMS训练,并在此基础上执行整体的效率-精度驱动的模型设计,从而产生了我们的YOLOv10模型。YOLOv10具有与YOLOv8相同的变体,即N/S/M/L/X。此外,我们通过简单地增加YOLOv10-M的宽度缩放因子,衍生出一个新的变体YOLOv10-B。我们在相同的从头训练设置下在COCO数据集上验证了所提出的检测器。此外,所有模型的延迟都在T4 GPU上使用TensorRT FP16进行测试。
4.2 Comparison with state-of-the-arts
As shown in Tab. 1, our YOLOv10 achieves the state-of-the-art performance and end-to-end latency across various model scales. We first compare YOLOv10 with our baseline models, i.e ., YOLOv8. On N/S/M/L/X five variants, our YOLOv10 achieves 1.2 % / 1.4 % / 0.5 % / 1.2\%/\:1.4\%/\:0.5\%/ 1.2%/1.4%/0.5%/ ′ 0.3 % / 0.5 % \prime0.3\%/0.5\% ′0.3%/0.5% AP improvements, with 28 % / 36 % / 41 % / 44 % / 57 % 28\%/36\%/41\%/44\%/57\% 28%/36%/41%/44%/57% fewer parameters, 23 % / 24 % / 25 % / 27 % / 38 % 23\%/24\%/25\%/27\%/38\% 23%/24%/25%/27%/38% less calculations, and 70 % / 65 % / 50 % / 41 % / 37 % 70\%/65\%/50\%/41\%/37\% 70%/65%/50%/41%/37% lower latencies. Compared with other YOLOs, YOLOv10 also exhibits superior trade-offs between accuracy and computational cost. Specifically, for lightweight and small models, YOLOv10-N / S outperforms YOLOv6-3.0-N / S by 1.5 AP and 2.0 AP, with 51 % / 61 % 51\%/61\% 51%/61% fewer parameters and 41 % 41\% 41% / 52 % 52\% 52% less computations, respectively. For medium models, compared with YOLOv9-C / YOLO-MS, YOLOv10-B / M enjoys the 46 % / 62 % 46\%/62\% 46%/62% latency reduction under the same or better performance, respectively. For large models, compared with Gold-YOLO-L, our YOLOv10-L shows 68 % 68\% 68% fewer parameters and 32 % 32\% 32% lower latency, along with a significant improvement of 1.4 % 1.4\% 1.4% AP. Furthermore, compared with RT-DETR, YOLOv10 obtains ficant performance and latency improvements. Notably, YOLOv10-S / X achieves 1.8 × 1.8\times 1.8× and 1.3 × 1.3\times 1.3× These results well demonstrate the superiority of YOLOv10 as the real-time end-to-end detector. × faster inference speed than RT-DETR-R18 / R101, respectively, under the similar performance.
【翻译】如表1所示,我们的YOLOv10在各种模型规模上都实现了最先进的性能和端到端延迟。我们首先将YOLOv10与我们的基线模型(即YOLOv8)进行比较。在N/S/M/L/X五个变体上,我们的YOLOv10实现了 1.2 % / 1.4 % / 0.5 % / 0.3 % / 0.5 % 1.2\%/1.4\%/0.5\%/0.3\%/0.5\% 1.2%/1.4%/0.5%/0.3%/0.5%的AP改进,参数减少了 28 % / 36 % / 41 % / 44 % / 57 % 28\%/36\%/41\%/44\%/57\% 28%/36%/41%/44%/57%,计算量减少了 23 % / 24 % / 25 % / 27 % / 38 % 23\%/24\%/25\%/27\%/38\% 23%/24%/25%/27%/38%,延迟降低了 70 % / 65 % / 50 % / 41 % / 37 % 70\%/65\%/50\%/41\%/37\% 70%/65%/50%/41%/37%。与其他YOLO相比,YOLOv10也表现出精度和计算成本之间的卓越权衡。具体来说,对于轻量级和小型模型,YOLOv10-N/S分别比YOLOv6-3.0-N/S高出1.5 AP和2.0 AP,参数减少了 51 % / 61 % 51\%/61\% 51%/61%,计算量减少了 41 % / 52 % 41\%/52\% 41%/52%。对于中等模型,与YOLOv9-C/YOLO-MS相比,YOLOv10-B/M在相同或更好的性能下分别享有 46 % / 62 % 46\%/62\% 46%/62%的延迟减少。对于大型模型,与Gold-YOLO-L相比,我们的YOLOv10-L显示出 68 % 68\% 68%更少的参数和 32 % 32\% 32%更低的延迟,同时显著改进了 1.4 % 1.4\% 1.4%的AP。此外,与RT-DETR相比,YOLOv10获得了显著的性能和延迟改进。值得注意的是,YOLOv10-S/X在相似性能下分别比RT-DETR-R18/R101实现了 1.8 × 1.8\times 1.8×和 1.3 × 1.3\times 1.3×更快的推理速度。这些结果很好地证明了YOLOv10作为实时端到端检测器的优越性。
Table 2: Ablation study with YOLOv10-S and YOLOv10-M on COCO.
【翻译】表2:在COCO数据集上使用YOLOv10-S和YOLOv10-M进行的消融研究。
【翻译】表3:双重分配。表4:匹配度量。
【翻译】表5:YOLOv10-S/M的效率分析。
We also compare YOLOv10 with other YOLOs using the original one-to-many training approach. We consider the performance and the latency of model forward process (Latency f ) in this situation, following [ 62 , 21 , 60 ]. As shown in Tab. 1, YOLOv10 also exhibits the state-of-the-art performance and efficiency across different model scales, indicating the effectiveness of our architectural designs.
【翻译】我们还将YOLOv10与使用原始一对多训练方法的其他YOLO进行比较。在这种情况下,我们考虑模型前向过程的性能和延迟(Latency f),遵循[62, 21, 60]的做法。如表1所示,YOLOv10在不同模型规模上也表现出最先进的性能和效率,表明了我们架构设计的有效性。
4.3 Model Analyses
Ablation study. We present the ablation results based on YOLOv10-S and YOLOv10-M in Tab. 2. It can be observed that our NMS-free training with consistent dual assignments significantly reduces the end-to-end latency of YOLOv10-S by 4.63 m s 4.63\mathrm{ms} 4.63ms , while maintaining competitive performance of 44.3 % 44.3\% 44.3% AP. Moreover, our efficiency driven model design leads to the reduction of 11.8 M 11.8\mathbf{M} 11.8M parameters and 20.8 GFlOPs, with a considerable latency reduction of 0.65 m s 0.65\mathrm{ms} 0.65ms for YOLOv10-M, well showing its effectiveness. Furthermore, our accuracy driven model design achieves the notable improvements of 1.8 AP and 0.7 AP for YOLOv10-S and YOLOv10-M, alone with only 0.18 m s 0.18\mathrm{ms} 0.18ms and 0.17 m s 0.17\mathrm{ms} 0.17ms latency overhead, respectively, which well demonstrates its superiority.
【翻译】消融研究。我们在表2中展示了基于YOLOv10-S和YOLOv10-M的消融结果。可以观察到,我们采用一致双重分配的无NMS训练显著减少了YOLOv10-S的端到端延迟 4.63 m s 4.63\mathrm{ms} 4.63ms,同时保持了 44.3 % 44.3\% 44.3% AP的竞争性能。此外,我们的效率驱动模型设计导致YOLOv10-M减少了 11.8 M 11.8\mathbf{M} 11.8M参数和20.8 GFlOPs,延迟显著减少了 0.65 m s 0.65\mathrm{ms} 0.65ms,很好地显示了其有效性。进一步地,我们的精度驱动模型设计为YOLOv10-S和YOLOv10-M分别实现了1.8 AP和0.7 AP的显著改进,仅带来 0.18 m s 0.18\mathrm{ms} 0.18ms和 0.17 m s 0.17\mathrm{ms} 0.17ms的延迟开销,这很好地证明了其优越性。
Analyses for NMS-free training.
• Dual label assignments. We present dual label assignments for NMS-free YOLOs, which can bring both rich supervision of one-to-many ( o 2 m ) (\mathrm{o}2\mathrm{m}) (o2m) branch during training and high efficiency of one-to-one (o2o) branch during inference. We verify its benefit based on YOLOv8-S, i.e ., #1 in Tab. 2. Specifically, we introduce baselines for training with only o 2 m \mathrm{o}2\mathrm{m} o2m branch and only o2o branch, respectively. As shown in Tab. 3, our dual label assignments achieve the best AP-latency trade-off. • Consistent matching metric. We introduce consistent matching metric to make the one-to-one head more harmonious with the one-to-many head. We verify its benefit based on YOLOv8-S, i.e ., #1 in Tab. 2, under different α o 2 o \alpha_{o2o} αo2o and β o 2 o \beta_{o2o} βo2o . As shown in Tab. 4, the proposed consistent matching metric, α o 2 o = r ⋅ α o 2 m \alpha_{o2o}{=}r\cdot\alpha_{o2m} αo2o=r⋅αo2m and β o 2 o = r ⋅ β o 2 m \beta_{o2o}{=}r\cdot\beta_{o2m} βo2o=r⋅βo2m , can achieve the optimal performance, where α o 2 m = 0.5 \alpha_{o2m}{=}0.5 αo2m=0.5 and β o 2 m = 6.0 \beta_{o2m}{=}6.0 βo2m=6.0 in the one-to-many head [ 21 ]. Such an improvement can be attributed to the reduction of the supervision gap (Eq. (2)), which provides improved supervision alignment between two branches. Moreover, the proposed consistent matching metric eliminates the need for exhaustive hyper-parameter tuning, which is appealing in practical scenarios.
【翻译】• 双重标签分配。我们为无NMS的YOLO提出了双重标签分配,它可以在训练期间带来一对多 ( o 2 m ) (\mathrm{o}2\mathrm{m}) (o2m)分支的丰富监督,在推理期间带来一对一(o2o)分支的高效率。我们基于YOLOv8-S(即表2中的#1)验证了其好处。具体来说,我们分别引入了仅使用 o 2 m \mathrm{o}2\mathrm{m} o2m分支和仅使用o2o分支进行训练的基线。如表3所示,我们的双重标签分配实现了最佳的AP-延迟权衡。• 一致匹配度量。我们引入一致匹配度量,使一对一头与一对多头更加协调。我们基于YOLOv8-S(即表2中的#1)在不同的 α o 2 o \alpha_{o2o} αo2o和 β o 2 o \beta_{o2o} βo2o下验证了其好处。如表4所示,所提出的一致匹配度量 α o 2 o = r ⋅ α o 2 m \alpha_{o2o}{=}r\cdot\alpha_{o2m} αo2o=r⋅αo2m和 β o 2 o = r ⋅ β o 2 m \beta_{o2o}{=}r\cdot\beta_{o2m} βo2o=r⋅βo2m可以实现最优性能,其中一对多头中 α o 2 m = 0.5 \alpha_{o2m}{=}0.5 αo2m=0.5和 β o 2 m = 6.0 \beta_{o2m}{=}6.0 βo2m=6.0。这种改进可以归因于监督差距的减少(公式(2)),它提供了两个分支之间改进的监督对齐。此外,所提出的一致匹配度量消除了对详尽超参数调优的需要,这在实际场景中很有吸引力。
Figure 4: The average cosine similarity of each anchor point’s extracted features with all others.
【翻译】图4:每个锚点提取特征与所有其他锚点特征的平均余弦相似度。
• Performance gap compared with one-to-many training. Although achieving superior end-to-end performance under NMS-free training, we observe that there still exists the performance gap compared with the original one-to-many training using NMS, as shown in Tab. 3 and Tab. 1. Besides, we note that the gap diminishes as the model size increases. Therefore, we reasonably concludes that such a gap can be attributed to the limitations in the model capability. Notably, unlike the original one-to-many training using NMS, the NMS-free training necessitates more discriminative features for one-to-one matching. In the case of the YOLOv10-N model, its limited capacity results in extracted features that lack sufficient discriminability, leading to a more notable performance gap of 1.0 % 1.0\% 1.0% AP. In contrast, the YOLOv10-X model, which possesses stronger capability and more discriminative features, shows no performance gap between two training strategies. In Fig. 4, we visualize the average cosine similarity of each anchor point’s extracted features with those of all other anchor points on the COCO val set. We observe that as the model size increases, the feature similarity between anchor points exhibits a downward trend, which benefits the one-to-one matching. Based on this insight, we will explore approaches to further reduce the gap and achieve higher end-to-end performance in the future work.
【翻译】• 与一对多训练相比的性能差距。尽管在无NMS训练下实现了优越的端到端性能,我们观察到与使用NMS的原始一对多训练相比仍然存在性能差距,如表3和表1所示。此外,我们注意到随着模型规模的增加,这种差距会缩小。因此,我们合理地得出结论,这种差距可以归因于模型能力的限制。值得注意的是,与使用NMS的原始一对多训练不同,无NMS训练需要更具判别性的特征来进行一对一匹配。在YOLOv10-N模型的情况下,其有限的容量导致提取的特征缺乏足够的判别性,从而导致更显著的 1.0 % 1.0\% 1.0% AP性能差距。相比之下,具有更强能力和更具判别性特征的YOLOv10-X模型在两种训练策略之间没有显示性能差距。在图4中,我们可视化了每个锚点提取特征与COCO验证集上所有其他锚点特征的平均余弦相似度。我们观察到随着模型规模的增加,锚点之间的特征相似性呈现下降趋势,这有利于一对一匹配。基于这一洞察,我们将在未来的工作中探索进一步减少差距并实现更高端到端性能的方法。
【解析】注意无NMS其实在N/S小规模模型上,与使用NMS的原始一对多训练相比,是会掉点的,作者说是因为小规模模型的容量导致提取的特征缺乏足够的判别性,事实上确实如此,尤其是面对特征复杂的检测场景,例如裂缝等不规则多尺度目标。
Analyses for efficiency driven model design . We conduct experiments to gradually incorporate the efficiency driven design elements based on YOLOv10-S/M. Our baseline is the YOLOv10-S/M model without efficiency-accuracy driven model design, i.e ., #2/#6 in Tab. 2. As shown in Tab. 5, each design component, including lightweight classification head, spatial-channel decoupled downsampling, and rank-guided block design, contributes to the reduction of parameters count, FLOPs, and latency. Importantly, these improvements are achieved while maintaining competitive performance.
【翻译】效率驱动模型设计分析。我们进行实验,基于YOLOv10-S/M逐步整合效率驱动的设计元素。我们的基线是没有效率-精度驱动模型设计的YOLOv10-S/M模型,即表2中的#2/#6。如表5所示,每个设计组件,包括轻量级分类头、空间-通道解耦下采样和排名引导块设计,都有助于减少参数数量、FLOPs和延迟。重要的是,这些改进是在保持竞争性能的同时实现的。
• Lightweight classification head. We analyze the impact of category and localization errors of predictions on the performance, based on the YOLOv10-S of # 1 \#1 #1 and # 2 \#2 #2 in Tab. 5, like [ 7 ]. Specifically, we match the predictions to the instances by the one-to-one assignment. Then, we substitute the predicted category score with instance labels, resulting in A P w / o c v a l \mathbf{A}{\mathbf{P}}_{w/o~c}^{v a l} APw/o cval with no classification errors. Similarly, we replace the predicted locations with those of instances, yielding A P w / o v a l Φ r \mathbf{A}\mathbf{P}_{w/o}^{v a l}\mathbf{\Phi}_{r} APw/ovalΦr with no regression errors. As shown in Tab. 6, A P w / o r v a l \mathbf{A}\mathbf{P}_{w/o r}^{v a l} APw/orval is much higher than A P w / o c v a l {\mathrm{AP}}_{w/o c}^{v a l} APw/ocval , revealing that eliminating the regression errors achieves greater improvement. The performance bottleneck thus lies more in the regression task. Therefore, adopting the lightweight classification head can allow higher efficiency without compromising the performance.
【翻译】• 轻量级分类头。我们基于表5中YOLOv10-S的 # 1 \#1 #1和 # 2 \#2 #2,分析预测的类别和定位误差对性能的影响,类似于[7]。具体来说,我们通过一对一分配将预测与实例匹配。然后,我们用实例标签替换预测的类别分数,得到没有分类误差的 A P w / o c v a l \mathbf{A}{\mathbf{P}}_{w/o~c}^{v a l} APw/o cval。类似地,我们用实例的位置替换预测的位置,得到没有回归误差的 A P w / o v a l Φ r \mathbf{A}\mathbf{P}_{w/o}^{v a l}\mathbf{\Phi}_{r} APw/ovalΦr。如表6所示, A P w / o r v a l \mathbf{A}\mathbf{P}_{w/o r}^{v a l} APw/orval远高于 A P w / o c v a l {\mathrm{AP}}_{w/o c}^{v a l} APw/ocval,表明消除回归误差能实现更大的改进。因此,性能瓶颈更多地在于回归任务。因此,采用轻量级分类头可以在不影响性能的情况下实现更高的效率。
• Spatial-channel decoupled downsampling. We decouple the downsampling operations for efficiency, where the channel dimensions are first increased by pointwise convolution (PW) and the resolution is then reduced by depthwise convolution (DW) for maximal information retention. We compare it with the baseline way of spatial reduction by DW followed by channel modulation by PW, based on the YOLOv10-S of #3 in Tab. 5. As shown in Tab. 7, our downsampling strategy achieves the 0.7 % 0.7\% 0.7% AP improvement by enjoying less information loss during downsampling.
【翻译】• 空间-通道解耦下采样。我们为了效率而解耦下采样操作,其中通道维度首先通过逐点卷积(PW)增加,然后通过深度卷积(DW)减少分辨率以实现最大信息保留。我们基于表5中YOLOv10-S的#3,将其与通过DW进行空间缩减然后通过PW进行通道调制的基线方法进行比较。如表7所示,我们的下采样策略通过在下采样过程中享受更少的信息损失,实现了 0.7 % 0.7\% 0.7%的AP改进。
• Compact inverted block (CIB). We introduce CIB as the compact basic building block. We verify its effectiveness based on the YOLOv10-S of # 4 \#4 #4 in the Tab. 5. Specifically, we introduce the inverted residual block [ 51 ] (IRB) as the baseline, which achieves the suboptimal 43.7 % 43.7\% 43.7% AP, as shown in Tab. 8. We then append a 3 × 3 3\times3 3×3 depthwise convolution (DW) after it, denoted as “IRB-DW”, which brings 0.5 % 0.5\% 0.5% AP improvement. Compared with “IRB-DW”, our CIB further achieves 0.3 % 0.3\% 0.3% AP improvement by prepending another DW with minimal overhead, indicating its superiority.
【翻译】• 紧凑倒置块(CIB)。我们引入CIB作为紧凑的基本构建块。我们基于表5中YOLOv10-S的 # 4 \#4 #4验证其有效性。具体来说,我们引入倒置残差块(IRB)作为基线,它实现了次优的 43.7 % 43.7\% 43.7% AP,如表8所示。然后我们在其后附加一个 3 × 3 3\times3 3×3深度卷积(DW),记为"IRB-DW",这带来了 0.5 % 0.5\% 0.5%的AP改进。与"IRB-DW"相比,我们的CIB通过在前面添加另一个DW并带来最小开销,进一步实现了 0.3 % 0.3\% 0.3%的AP改进,表明了其优越性。
Table 10: Accuracy. for S/M. Table 11: L.k. results. Table 12: L.k. usage. Table 13: PSA results.
【翻译】表10:S/M的精度。表11:L.k.结果。表12:L.k.使用情况。表13:PSA结果。
• Rank-guided block design. We introduce the rank-guided block design to adaptively integrate compact block design for improving the model efficiency. We verify its benefit based on the YOLOv10-S of # 3 \#3 #3 in the Tab. 5. The stages sorted in ascending order based on the intrinsic ranks are Stage 8-4-7-3-5-1-6-2, like in Fig. 3.(a). As shown in Tab. 9, when gradually replacing the bottleneck block in each stage with the efficient CIB, we observe the performance degradation starting from Stage 7. In the Stage 8 and 4 with lower intrinsic ranks and more redundancy, we can thus adopt the efficient block design without compromising the performance. These results indicate that rank-guided block design can serve as an effective strategy for higher model efficiency.
【翻译】• 排名引导块设计。我们引入排名引导块设计来自适应地整合紧凑块设计以提高模型效率。我们基于表5中YOLOv10-S的 # 3 \#3 #3验证其好处。基于内在排名按升序排列的阶段是Stage 8-4-7-3-5-1-6-2,如图3.(a)所示。如表9所示,当逐渐用高效的CIB替换每个阶段的瓶颈块时,我们观察到从Stage 7开始出现性能下降。在具有较低内在排名和更多冗余的Stage 8和4中,我们因此可以采用高效的块设计而不影响性能。这些结果表明,排名引导块设计可以作为提高模型效率的有效策略。
Analyses for accuracy driven model design. We present the results of gradually integrating the accuracy driven design elements based on YOLOv10-S/M. Our baseline is the YOLOv10-S/M model after incorporating efficiency driven design, i.e ., #3/#7 in Tab. 2. As shown in Tab. 10, the adoption of large-kernel convolution and PSA module leads to the considerable performance improvements of 0.4 % 0.4\% 0.4% AP and 1.4 % 1.4\% 1.4% AP for YOLOv10-S under minimal latency increase of 0.03 m s 0.03\mathrm{ms} 0.03ms and 0.15 m s 0.15\mathrm{ms} 0.15ms , respectively. Note that large-kernel convolution is not employed for YOLOv10-M (see Tab. 12).
【翻译】精度驱动模型设计分析。我们展示了基于YOLOv10-S/M逐步整合精度驱动设计元素的结果。我们的基线是整合效率驱动设计后的YOLOv10-S/M模型,即表2中的#3/#7。如表10所示,采用大核卷积和PSA模块为YOLOv10-S带来了显著的性能改进,分别为 0.4 % 0.4\% 0.4% AP和 1.4 % 1.4\% 1.4% AP,在最小延迟增加 0.03 m s 0.03\mathrm{ms} 0.03ms和 0.15 m s 0.15\mathrm{ms} 0.15ms的情况下。注意,YOLOv10-M没有采用大核卷积(见表12)。
• Large-kernel convolution. We first investigate the effect of different kernel sizes based on the YOLOv10-S of # 2 \#2 #2 in Tab. 10. As shown in Tab. 11, the performance improves as the kernel size increases and stagnates around the kernel size of 7 × 7 7\times7 7×7 , indicating the benefit o e perception field. Besides, removing the reparameterization branch during training achieves 0.1% AP degradation, showing its effectiveness for optimization. Moreover, we inspect the benefit of large-kernel convolution across model scales based on Y O L O v 10. N / S / M . \mathrm{YOLOv10.N/S/M}. YOLOv10.N/S/M. . As shown in Tab. 12, it brings no improvements for large models, i.e ., YOLOv10-M, due to its inherent extensive receptive field. We thus only adopt large-kernel convolutions for small models, i.e ., YOLOv10-N / S.
【翻译】• 大核卷积。我们首先基于表10中YOLOv10-S的 # 2 \#2 #2研究不同核大小的影响。如表11所示,性能随着核大小的增加而改善,并在核大小为 7 × 7 7\times7 7×7左右停滞,表明了感受野的好处。此外,在训练期间移除重参数化分支实现了0.1%的AP下降,显示了其对优化的有效性。此外,我们基于 Y O L O v 10. N / S / M \mathrm{YOLOv10.N/S/M} YOLOv10.N/S/M检查大核卷积在不同模型规模上的好处。如表12所示,由于其固有的广泛感受野,它对大模型(即YOLOv10-M)没有带来改进。因此,我们只对小模型(即YOLOv10-N/S)采用大核卷积。
• Partial self-attention (PSA). We introduce PSA to enhance the performance by incorporating the global modeling ability under minimal cost. We first verify its effectiveness based on the YOLOv10- S of # 3 \#3 #3 in Tab. 10. Specifically, we introduce the transformer block, i.e ., MHSA followed by FFN, as the baseline, denoted as “Trans.”. As shown in Tab. 13, compared with it, PSA brings 0.3 % 0.3\% 0.3% AP improvement with 0.05 m s 0.05\mathrm{ms} 0.05ms latency reduction. The performance enhancement may be attributed to the alleviation of optimization problem [ 68 , 10 ] in self-attention, by mitigating the redundancy in attention heads. Moreover, we investigate the impact of different N P S A N_{\mathrm{PSA}} NPSA . As shown in Tab. 13, increasing N P S A N_{\mathrm{PSA}} NPSA to 2 obtains 0.2 % 0.2\% 0.2% AP improvement but with 0.1 m s 0.1\mathrm{ms} 0.1ms latency overhead. Therefore, we set N P S A N_{\mathrm{PSA}} NPSA to 1, by default, to enhance the model capability while maintaining high efficiency.
【翻译】• 部分自注意力(PSA)。我们引入PSA通过在最小成本下整合全局建模能力来增强性能。我们首先基于表10中YOLOv10-S的 # 3 \#3 #3验证其有效性。具体来说,我们引入Transformer块,即MHSA后跟FFN,作为基线,记为"Trans."。如表13所示,与之相比,PSA带来了 0.3 % 0.3\% 0.3%的AP改进和 0.05 m s 0.05\mathrm{ms} 0.05ms的延迟减少。性能增强可能归因于通过减轻注意力头中的冗余来缓解自注意力中的优化问题[68, 10]。此外,我们研究了不同 N P S A N_{\mathrm{PSA}} NPSA的影响。如表13所示,将 N P S A N_{\mathrm{PSA}} NPSA增加到2获得了 0.2 % 0.2\% 0.2%的AP改进,但带来了 0.1 m s 0.1\mathrm{ms} 0.1ms的延迟开销。因此,我们默认将 N P S A N_{\mathrm{PSA}} NPSA设置为1,以在保持高效率的同时增强模型能力。
5 Conclusion
In this paper, we target both the post-processing and model architecture throughout the detection pipeline of YOLOs. For the post-processing, we propose the consistent dual assignments for NMSfree training, achieving efficient end-to-end detection. For the model architecture, we introduce the holistic efficiency-accuracy driven model design strategy, improving the performance-efficiency tradeoffs. These bring our YOLOv10, a new real-time end-to-end object detector. Extensive experiments show that YOLOv10 achieves the state-of-the-art performance and latency compared with other advanced detectors, well demonstrating its superiority.
【翻译】在本文中,我们针对YOLO检测流水线中的后处理和模型架构。对于后处理,我们提出了用于无NMS训练的一致双重分配,实现了高效的端到端检测。对于模型架构,我们引入了整体的效率-精度驱动模型设计策略,改善了性能-效率权衡。这些带来了我们的YOLOv10,一个新的实时端到端目标检测器。广泛的实验表明,与其他先进检测器相比,YOLOv10实现了最先进的性能和延迟,很好地证明了其优越性。