摘要
我们提出了一种名为Hyper-YOLO的新型目标检测方法,该方法通过整合超图计算来捕捉视觉特征间复杂的高阶关联性。传统YOLO模型虽然功能强大,但其颈部设计存在局限性,限制了跨层级特征的融合以及对高阶特征间相互关系的利用。为应对这些挑战,我们提出了基于超图计算的语义聚合与分散框架(Hypergraph Computation Empowered Semantic Collecting and Scattering,HGC-SCS),该框架将视觉特征图转换到语义空间,并构建超图以实现高阶消息传播。这使得模型能够同时获取语义信息和结构信息,超越了传统的以特征为中心的学习方式。Hyper-YOLO在其主干网络中融入了所提出的混合聚合网络(Mixed Aggregation Network,MANet)以增强特征提取能力,并在其颈部引入了基于超图的跨层级与跨位置表示网络(Hypergraph-Based Cross-Level and Cross-Position Representation Network,HyperC2Net)。HyperC2Net在五个尺度上运行,并突破了传统网格结构的限制,实现了跨层级与跨位置的复杂高阶交互。这些组件的协同作用使Hyper-YOLO成为各种尺度模型中的先进架构,其在COCO数据集上的卓越性能便是证明。具体而言,Hyper-YOLO-N相较于先进的YOLOv8-N和YOLOv9-T,在APvalˉ\bar{\mathbf{A}\mathbf{P}^{v a l}}APvalˉ指标上实现了12%的提升,在APval\mathrm{AP}^{val}APval指标上实现了9%的提升。源代码可在https://github.com/iMoonLab/Hyper-YOLO获取。
索引术语——目标检测,超图,超图神经网络,超图计算
一、引言
YOLO系列[1]-[11]是目标检测领域的主流方法,具有多种优势,可满足各种不同的应用需求。YOLO的架构主要由两个部分组成:主干网络(backbone)[7],[12]-[14]和颈部网络(neck)[10],[15],[16]。主干网络用于特征提取,已得到广泛研究;而颈部网络则负责多尺度特征的融合,为检测不同大小的目标提供了坚实的基础。本文特别强调颈部网络的重要性,因为颈部网络对于增强模型检测不同尺度目标的能力至关重要。
当代YOLO模型在颈部网络部分采用了路径聚合网络(PANet)[16],该网络通过自上而下和自下而上的路径,促进各尺度信息的全面融合。然而,PANet的能力主要局限于相邻层之间的特征融合,未能充分解决跨层级特征集成的问题。相比之下,以Gold-YOLO[1O]为代表的聚合-分发颈部网络设计促进了层间信息交换,但仍未能充分实现特征图内的跨位置交互。此外,该设计也未充分挖掘特征间相互关系的潜力,尤其是高阶相关性。高阶相关性指的是不同尺度、位置和语义层次特征之间存在的复杂且通常是非线性的关系,这对于理解视觉数据中的深层上下文和交互至关重要。研究发现,低层视觉特征及其相关性的协同表示在目标检测任务中起着关键作用。将这些基本特征与高层语义信息相结合,对于在给定场景中准确识别和定位目标至关重要。在许多计算机视觉任务中,探索低层特征背后的高阶相关性以进行语义分析仍然是一个具有挑战性但至关重要的课题。这种高阶关系挖掘常被忽视的现象,可能会限制视觉任务的性能。
在实际应用中,超图[17],[18]因其比简单图更强的表达能力,常被用于表示复杂的高阶相关性。简单图中的边仅限于连接两个顶点,从而极大地限制了其表达能力;而超图中的超边可以连接两个或多个顶点,能够建模更复杂的高阶关系。与简单图相比,超图能够捕捉多个实体之间更丰富的交互,这对于需要理解复杂多向关系的任务至关重要,例如计算机视觉中的目标检测,其中特征图之间的跨层级和跨位置相关性至关重要。
与以往大多数专注于增强特征提取主干网络的研究不同,本文提出了基于超图计算的语义收集与散射(Hypergraph Computation Empowered Semantic Collecting and Scattering,HGC-SCS)框架。该框架巧妙构思,通过将视觉主干网络提取的特征图转换到抽象语义空间,并构建复杂的超图结构来增强这些特征图。超图作为语义空间内特征之间高阶消息传播的通道。这种方法使视觉主干网络具备了融合语义和复杂结构信息的双重能力,从而克服了传统语义特征聚焦学习的局限性,并将性能提升到传统范围之外。
基于上述HGC-SCS框架,本文提出了Hyper-YOLO,这是一种基于超图计算的新型YOLO方法。Hyper-YOLO首次将超图计算集成到视觉目标检测网络的颈部组件中。通过建模主干网络提取的特征图中固有的复杂高阶关联,Hyper-YOLO显著提高了目标检测性能。在主干网络架构方面,Hyper-YOLO采用了混合聚合网络(Mixed Aggregation Network,MANet),该网络融合了三种不同的基础结构,以丰富信息流并增强特征提取能力,在YOLOv8的基础上进行构建。在颈部网络方面,利用所提出的HGC-SCS框架,我们实现了一种多尺度特征融合颈部网络,即基于超图的跨层级和跨位置表示网络(Hypergraph-Based Cross-Level and Cross- Position Representation Network,HyperC2Net)。与传统的颈部网络设计不同,HyperC2Net融合了五个不同尺度的特征,同时打破了视觉特征图的网格结构,以促进跨层级和跨位置的高阶消息传播。主干网络和颈部网络的综合改进使Hyper-YOLO成为一种开创性的架构。在COCO数据集上的实证结果(图1)证明了其在性能上的显著优势,证实了这种复杂方法在推动目标检测领域发展方面的有效性。我们的贡献可总结如下:
1)提出了基于超图计算的语义收集与散射(HGC-SCS)框架,通过高阶信息建模和学习增强了视觉主干网络。
2)利用所提出的HGC-SCS框架,开发了HyperC2Net,这是一种目标检测颈部网络,可促进整个语义层和位置的高阶消息传递。HyperC2Net显著提高了颈部网络提取高阶特征的能力。
3)提出了混合聚合网络(MANet),该网络融合了三种类型的模块以丰富信息流,从而增强了主干网络的特征提取能力。
4)提出了Hyper-YOLO,该模型融入了超图计算,增强了模型的高阶信息感知能力,从而提高了目标检测性能。具体而言,我们的Hyper-YOLO在COCO数据集上与YOLOv8-N相比提高了12%12\%12%,与YOLOv9-T相比提高了9%9\%9%。
二、相关工作
A. YOLO系列目标检测器
YOLO系列是实时目标检测领域的基石,从YOLOv1[1]的单阶段检测发展到YOLOv8[8]的性能优化模型。从YOLOv4[3]的结构优化到YOLOv7[7]的E - ELAN主干网络,每一次迭代都带来了显著进步。YOLOX[9]引入了无锚点检测,Gold - YOLO[10]通过其聚合 - 分发机制增强了特征融合。尽管出现了RT - DETR[19]等检测器,YOLO系列依然盛行,部分原因在于它有效利用了CSPNet、ELAN[14]以及改进的路径聚合网络(PANet)[16]或特征金字塔网络(FPN)[15]进行特征集成,还结合了YOLOv3[2]和FCOS[20]中先进的预测头。YOLOv9[21]引入了可编程梯度信息和广义高效层聚合网络,以减少深度网络传输过程中的信息损失。基于这些YOLO方法,本文提出了Hyper - YOLO,这是一种利用超图计算增强YOLO框架复杂关联学习能力的先进方法。Hyper - YOLO旨在改进层次特征的学习和集成,突破目标检测性能的界限。
B. 超图学习方法
超图[17],[18]可用于捕捉这些复杂的高阶关联。超图通过超边连接多个节点,在建模复杂关系方面表现出色,这在其应用于社交网络分析[22],[23]、药物 - 靶点相互作用建模[24],[25]和脑网络分析[26],[27]等多个领域得到了证明。超图学习方法已成为捕捉数据中复杂高阶关联的强大工具,而传统的基于图的技术可能无法充分表示这些关联。正如Gao等人[17]所讨论的,超边的概念通过允许多个节点同时交互,促进了这些复杂关系的建模。超图神经网络(HGNN)[28]利用这些关系,通过谱方法实现从超图结构的直接学习。在此基础上,通用超图神经网络(HGNN+)[18]引入了顶点间高阶消息传播的空间方法,进一步扩展了超图学习的能力。尽管取得了这些进展,但超图学习在计算机视觉任务中的应用仍相对未被充分探索,特别是在建模和学习高阶关联方面。在本文中,我们将深入探讨如何利用超图计算进行目标检测任务,旨在通过整合超图建模的细微关系信息,提高分类和定位的准确性。
三、基于超图计算的语义收集与散射框架
与计算机视觉中的表示学习仅处理视觉特征不同,超图计算方法[18],[28]可同时处理特征和高阶结构。大多数超图计算方法依赖于固有的超图结构,而在大多数计算机视觉场景中无法直接获得这种结构。在此,我们介绍了计算机视觉中超图计算的一般范式,包括超图构建和超图卷积。给定从神经网络中提取的特征图XXX,采用超图构建函数f : X → Gf\ :\ X\ \to\ \mathcal{G}f : X → G来估计语义空间中特征点之间潜在的高阶关联。然后,利用谱或空间超图卷积方法,通过超图结构在特征点之间传播高阶消息。生成的高阶特征记为XhyperX_{hyper}Xhyper。通过将高阶关系信息整合到Xhyper\mathbf{X}_{hyper}Xhyper中,这种超图计算策略解决了原始特征图XXX中高阶关联不足的问题。由此产生的混合特征图,记为sss,是XXX和Xhyper\mathbf{X}_{hyper}Xhyper融合的结果。这一综合过程最终形成了语义增强的视觉表示nnn,从语义和高阶结构两个角度提供了更全面的视觉特征表示。
在此,我们设计了一个计算机视觉中超图计算的一般框架,名为基于超图计算的语义收集与散射(HGC - SCS)框架。给定从卷积神经网络(CNN)[29] - [34]或其他主干网络中提取的特征图,我们的框架首先收集这些特征并将它们融合,在语义空间中构建混合特征包XmixedX_{mixed}Xmixed。第二步,我们估计这些潜在的高阶关联,在语义空间中构建超图结构。为了充分利用这些高阶结构信息,可以采用一些相关的超图计算方法[18],[28]。这样,就可以生成包含高阶结构和语义信息的高阶感知特征XhyperX_{hyper}Xhyper。最后一步,我们将高阶结构信息散射到每个输入特征图。HGC - SCS框架可表示为:
KaTeX parse error: Expected & or \\ or \cr or \end at end of input: …uad\mathit{//}}
其中{X1,X2,⋯ }\{\pmb{X}_{1},\pmb{X}_{2},\cdots\}{X1,X2,⋯}表示由视觉主干网络生成的基本特征图。“HyperComputation”表示第二步,包括超图构建和超图卷积,用于捕捉语义空间中潜在的高阶结构信息并生成高阶感知特征XhyperX_{hyper}Xhyper。在最后一行,ϕ(⋅)\phi(\cdot)ϕ(⋅)表示特征融合函数。“{X1′,X2′,⋯ }\{\mathbf{\boldsymbol{X}}_{1}^{\prime},\mathbf{\boldsymbol{X}}_{2}^{\prime},\cdots\}{X1′,X2′,⋯}”表示增强的视觉特征图。接下来,我们将介绍HGC - SCS框架在目标检测中的一个实例,名为HyperC2Net\mathrm{HyperC2Net}HyperC2Net。
IV. 方法
在本节中,我们首先介绍YOLO相关符号的预备知识以及所提出的Hyper-YOLO框架。接下来,详细阐述所提出的两个核心模块,包括Hyper-YOLO的基础模块(MANet)和颈部模块(HyperC2Net)。最后,分析Hyper-YOLO与其他YOLO方法之间的关系。
A. 预备知识
YOLO系列方法[1]-[5],[7], [8],[21], [35]-[39]通常由两个主要部分组成:主干网络(backbone)和颈部网络(neck)。主干网络[40][13]负责提取基本的视觉特征,而颈部网络[15][16][19]则促进多尺度特征的融合,以实现最终的目标检测。本文针对这两个组件提出了增强策略。为便于本文描述,我们将颈部网络的三个尺度输出表示为{N3,N4,N5}\{N_{3},N_{4},N_{5}\}{N3,N4,N5},分别对应小尺度、中尺度和大尺度检测。在主干网络的特征提取阶段,我们进一步将其分为五个阶段:{B1,B2,B3,B4,B5}\{B_{1},B_{2},B_{3},B_{4},B_{5}\}{B1,B2,B3,B4,B5},代表不同语义层次的特征。数字越大,表示该特征是由网络更深层提取的更高层次的语义特征。更多细节见A节。
B. Hyper-YOLO概述
我们的Hyper-YOLO框架保持了典型YOLO方法的整体架构,包括主干网络和颈部网络,如图S1所示。给定一张图像,Hyper-YOLO的主干网络利用所提出的MANet作为其核心计算模块,从而增强了YOLOv8[8]中传统C2f模块的特征辨别能力。与传统YOLO架构不同,Hyper-YOLO采用了一组五个主要特征集{∇B1,B2,B3,B4,B5}\{\nabla B_{1},B_{2},B_{3},B_{4},B_{5}\}{∇B1,B2,B3,B4,B5}。在创新的一步中,Hyper-YOLO的颈部网络(HyperC2Net)基于超图计算理论,整合了这五个特征集的跨层次和跨位置信息,最终生成三个不同尺度的最终语义特征{N3,N4,N5}\{N_{3},N_{4},N_{5}\}{N3,N4,N5}。这些层次结构化的语义特征随后被用于最终的目标检测任务。
C. 混合聚合网络
对于Hyper-YOLO的主干网络,为了增强基础网络的特征提取能力,我们设计了混合聚合网络(MANet),如图2所示。该架构协同融合了三种典型的卷积变体:用于通道特征重新校准的1×11\times11×1旁路卷积、用于高效空间特征处理的深度可分离卷积(DSConv)以及用于增强特征层次集成的C2f模块。这种融合在训练阶段产生了更加多样化和丰富的梯度流,显著增强了五个关键阶段基础特征中蕴含的语义深度。我们的MANet可以表示为:
{Xmid=Conv1(Xin)X1=Conv2(Xmid)X2=DSConv(Conv3(Xmid))X3,X4=Split(Xmid)X5=ConvNeck1(X4)+X4X6=ConvNeck2(X5)+X5⋯X4+n=ConvNeckn(X3+n)+X3+n}n,\left\{\begin{aligned}{}&{{}{\bf X}_{m i d}=\mathrm{Conv}_{1}({\bf X}_{i n})}\\ {}&{{}{\bf X}_{1}=\mathrm{Conv}_{2}({\bf X}_{m i d})}\\ {}&{{}{\bf X}_{2}=\mathrm{DSConv}(\mathrm{Conv}_{3}({\bf X}_{m i d}))}\\ {}&{{}{\bf X}_{3},{\bf X}_{4}=\mathrm{Split}({\bf X}_{m i d})}\\ {}&{{}\left.\begin{array}{l l}{{\bf X}_{5}=\mathrm{ConvNeck}_{1}({\bf X}_{4})+{\bf X}_{4}}\\ {{\bf X}_{6}=\mathrm{ConvNeck}_{2}({\bf X}_{5})+{\bf X}_{5}}\\ {\cdots}\\ {{\bf X}_{4+n}=\mathrm{ConvNeck}_{n}({\bf X}_{3+n})+{\bf X}_{3+n}}\end{array}\right\}n}\end{aligned}\right.,⎩
⎨
⎧Xmid=Conv1(Xin)X1=Conv2(Xmid)X2=DSConv(Conv3(Xmid))X3,X4=Split(Xmid)X5=ConvNeck1(X4)+X4X6=ConvNeck2(X5)+X5⋯X4+n=ConvNeckn(X3+n)+X3+n⎭
⎬
⎫n,
其中,XmidX_{m i d}Xmid的通道数为2c2c2c,而X1,X2,…,X4+n\mathbf{X}_{1},\mathbf{X}_{2},\ldots,\mathbf{X}_{4+n}X1,X2,…,X4+n的通道数均为ccc。最后,我们通过拼接操作和1×11\times11×1卷积,融合并压缩三种特征的语义信息,生成通道数为2c2c2c的Xout\mathbf{\boldsymbol{X}}_{o u t}Xout,如下所示:
Xout=Convo(X1∣∣X2∣∣⋯∣∣X4+n).\mathbf{\boldsymbol{X}}_{o u t}=\operatorname{Conv}_{o}(\mathbf{\boldsymbol{X}}_{1}||\mathbf{\boldsymbol{X}}_{2}||\cdots||\mathbf{\boldsymbol{X}}_{4+n}).Xout=Convo(X1∣∣X2∣∣⋯∣∣X4+n).
D. 基于超图的跨层次和跨位置表示网络
对于Hyper-YOLO的颈部网络,在本小节中,为了全面融合主干网络中的跨层次和跨位置信息,我们进一步提出了基于超图的跨层次和跨位置表示网络(HyperC2Net),如图4所示。HyperC2Net是所提出的HGC-SCS框架的一种实现,能够捕捉语义空间中潜在的高阶相关性。
超图构建:如图S1所示,我们的主干网络被分为五个离散阶段。这些阶段的特征图表示为{B1,B2,B3,B4,B5}\{B_{1},B_{2},B_{3},B_{4},B_{5}\}{B1,B2,B3,B4,B5}。为了利用超图计算来阐明基础特征之间的复杂高阶关系,我们首先对五个基础特征进行通道拼接,从而合成跨层次的视觉特征。超图G={V,E}\mathcal{G}=\{\mathcal{V},\mathcal{E}\}G={V,E}通常由顶点集V\mathcal{V}V和超边集E\mathcal{E}E定义。在我们的方法中,我们将基于网格的视觉特征解构为超图的顶点集V\mathcal{V}V。为了在语义空间中建模邻域关系,我们使用距离阈值从每个特征点构建一个ϵ\epsilonϵ-球,该球将作为超边,如图3所示。ϵ\epsilonϵ-球是一个超边,它包含了距离中心特征点一定阈值内的所有特征点。整体超边集的构建可以定义为E = {ball(v,ϵ) ∣ v ∈ V}\mathcal{E}~=~\{b a l l(v,\epsilon)~\mid~v~\in~\mathcal{V}\}E = {ball(v,ϵ) ∣ v ∈ V},其中ball(v,ϵ) = {u ∣ ∣∣xu − xv∣∣d < ϵ,u ∈ V}b a l l(v,\epsilon)\;=\;\{u\;\mid\;||\boldsymbol{x}_{u}\,-\,\boldsymbol{x}_{v}||_{d}\;<\;\epsilon,u\;\in\;\mathcal{V}\}ball(v,ϵ)={u∣∣∣xu−xv∣∣d<ϵ,u∈V}表示指定顶点vvv的邻域顶点集。∣∣x−y∣∣d||{\pmb x}-{\pmb y}||_{d}∣∣x−y∣∣d是距离函数。在计算中,超图g\mathcal{g}g通常由其关联矩阵H\mathcal{H}H表示。
2)超图卷积:为了在超图结构上实现高阶消息传递,我们采用了一种典型的空间域超图卷积[18],并添加了残差连接,对顶点特征进行高阶学习,如下所示:
{xe=1∣Nv(e)∣∑v∈Nv(e)xvΘxv′=xv+1∣Ne(v)∣∑e∈Ne(v)xe,\left\{\begin{aligned}{}&{{}\pmb{x}_{e}=\frac{1}{|\mathcal{N}_{v}(e)|}\sum_{v\in\mathcal{N}_{v}(e)}\pmb{x}_{v}\pmb{\Theta}}\\ {}&{{}\pmb{x}_{v}^{\prime}=\pmb{x}_{v}+\frac{1}{|\mathcal{N}_{e}(v)|}\sum_{e\in\mathcal{N}_{e}(v)}\pmb{x}_{e}}\end{aligned}\right.,⎩
⎨
⎧xe=∣Nv(e)∣1v∈Nv(e)∑xvΘxv′=xv+∣Ne(v)∣1e∈Ne(v)∑xe,
其中,Nv(e) and Ne(v){\mathcal{N}}_{v}(e){\mathrm{~and~}}{\mathcal{N}}_{e}(v)Nv(e) and Ne(v)是两个邻域指示函数,定义见[18]:Nv(e) = {v ∣ v ∈ e,v ∈ V}\mathcal{N}_{v}(e)\ =\ \{v\ \mid\ v\ \in\ e,v\ \in\ \mathcal{V}\}Nv(e) = {v ∣ v ∈ e,v ∈ V}和Ne(v) = {e ∣ v ∈ e,e ∈ E}\mathcal{N}_{e}(v)\:=\:\{e\:\mid\:v\:\in\:e,e\:\in\:\mathcal{E}\}Ne(v)={e∣v∈e,e∈E},Θ\ThetaΘ是一个可训练参数。为了计算方便,两阶段超图消息传递的矩阵形式可以定义为:
HyperConv(X,H)=X+Dm−1HDe−1H⊤XΘ,\begin{array}{r}{\mathrm{HyperConv}(X,\pmb{H})=X+\pmb{D}_{\scriptscriptstyle\mathrm{m}}^{-1}\pmb{H}\pmb{D}_{\scriptscriptstyle\mathrm{e}}^{-1}\pmb{H}^{\top}X\pmb{\Theta},}\end{array}HyperConv(X,H)=X+Dm−1HDe−1H⊤XΘ,
其中,Dm\pmb{D_{m}}Dm和De\pmb{D_{e}}De分别表示顶点和超边的对角度矩阵。
3)HGC-SCS框架的一个实例:通过结合前面定义的超图构建和卷积策略,我们引入了HGC-SCS框架的一个简化实例,即基于超图的跨层次和跨位置表示网络(HyperC2Net),其总体定义如下:
{Xmixed=B1∣∣B2∣∣B3∣∣B4∣∣B5Xhyper=HyperConv(Xmixed,H)N3,N4,N5=ϕ(Xhyper,B3),ϕ(Xhyper,B4),ϕ(Xhyper,B4)\small\left\{\begin{aligned}{}&{{}X_{m i x e d}=\mathbf{\mathit{B}}_{1}||\mathbf{\mathit{B}}_{2}||\mathbf{\mathit{B}}_{3}||\mathbf{\mathit{B}}_{4}||\mathbf{\mathit{B}}_{5}}\\ {}&{{}X_{h y p e r}=\operatorname{HyperConv}(X_{m i x e d},\mathbf{\mathit{H}})}\\ {}&{{}N_{3},N_{4},N_{5}=\phi(X_{h y p e r},\mathbf{\mathit{B}}_{3}),\phi(X_{h y p e r},\mathbf{\mathit{B}}_{4}),\phi(X_{h y p e r},\mathbf{\mathit{B}}_{4})}\\ \end{aligned}\right.⎩
⎨
⎧Xmixed=B1∣∣B2∣∣B3∣∣B4∣∣B5Xhyper=HyperConv(Xmixed,H)N3,N4,N5=ϕ(Xhyper,B3),ϕ(Xhyper,B4),ϕ(Xhyper,B4)
其中,∣∣||∣∣表示矩阵拼接操作。ϕ\phiϕ是融合函数,如图4所示(语义散射模块和自下而上模块)。在我们的HyperC2Net中,XmixedX_{m i x e d}Xmixed本质上包含了跨层次信息,因为它是主干网络多层次特征的融合。此外,通过将网格特征解构为语义空间中的一组特征点,并根据距离构建超边,我们的方法允许在点集中不同位置的顶点之间进行高阶消息传递。这种能力有助于捕捉跨位置信息,丰富模型对语义空间的理解。
E. 比较与分析
YOLO系列的进展主要集中在主干网络和颈部网络的改进上,特别是将主干网络作为每次迭代演进的关键元素。例如,最初的YOLO[1]框架引入了DarkNet主干网络,随后经过了一系列增强,如YOLOv7[7]中引入的ELAN(高效层聚合网络)模块和YOLOv8[8]中推出的C2f(带反馈的跨阶段部分连接)模块。这些创新极大地提升了主干网络的视觉特征提取能力。
相比之下,我们的Hyper-YOLO模型将创新重点转向了颈部网络的设计。在颈部网络架构方面,YOLOv6[5]、YOLOv7[7]和YOLOv8[8]等前沿迭代一致采用了PANet[16](路径聚合网络)结构。同时,Gold-YOLO[10]采用了创新的聚集-分发颈部范式。接下来,我们将比较HyperYOLO的HyperC2Net与这两种经典的颈部网络架构。
PANet架构虽然通过自上而下和自下而上的路径有效地融合了多尺度特征,但仍局限于相邻层之间的信息融合。这种邻域融合模式限制了网络内信息整合的广度。相比之下,HyperC2Net超越了这一限制,实现了主干网络五个层次特征之间的直接融合。这种方法产生了更强大和多样化的信息流,缩小了不同深度特征之间的连接差距。值得注意的是,虽然Gold-YOLO引入的聚集-分发颈部机制具有整合多层次信息的能力,但它并未考虑特征图内的跨位置交互。HyperC2Net的独创性在于它利用超图计算来捕捉特征图中潜在的高阶关联。语义域中的超图卷积促进了非网格约束的信息流,实现了跨层次和跨位置的高阶信息传播。这种机制打破了传统网格结构的限制,实现了更精细和集成的特征表示。
HyperC2Net生成的特征表示综合考虑了原始数据主干网络提供的语义特征和潜在的高阶结构特征。这种丰富的特征表示对于实现卓越的目标检测性能至关重要。HyperC2Net利用这些复杂高阶关系的能力,使其相对于传统的颈部网络架构(如PANet)甚至最近的创新(如聚集-分发颈部)具有显著优势,凸显了高阶特征处理在推动计算机视觉领域前沿发展中的价值。
V. EXPERIMENTS
A. Experimental Setup
1) Datasets: The Microsoft COCO dataset [41],a bench- mark for object detection, is employed to assess the efficacy of the proposed Hyper-YOLO model. In particular, the Train2017 subset is utilized for training purposes, while the Val2017 subset serves as the validation set. The performance evaluation of Hyper-YOLO is carried out on the Val2O17 subset, with the results detailed in table I.
2) Compared Methods: We select those advanced YOLO series methods, including YOLOv5 [4], YOLOv6-3.0 [5], YOLOv7 [7], YOLOv8 [8], Gold-YOLO [10], and YOLOv9 [21] for comparison. The default parameter configurations of their reported are adopted in our experiments.
3)Our Hyper-YOLO Methods: Our Hyper-YOLO is de- veloped based on the four scales of YOLOv8 (-N,-S,-M, -L).Therefore,we modified the hyperparameters (number of convolutional layers,feature dimensions) for each stage of the Hyper-YOLO architecture, as shown in table S2, resulting in Hyper-YOLO-N, Hyper-YOLO-S,Hyper-YOLO- M,and Hyper-YOLO-L. Considering that our Hyper-YOLO introduces high-order learning in the neck,which increases the number of parameters,we further reduced the parameters on the basis of Hyper-YOLO-N to form Hyper-YOLO-T. Specifically, in Hyper-YOLO-T’s HyperC2Net, the last C2f in the Bottom-Up stage is replaced with a 1 × 1 convolution. Additionally, we noted that the latest YOLOv9 employs a new programmable gradient information transmission and prunes paths during inference to reduce parameters while main- taining accuracy. Based on YOLOv9,we developed Hyper- YOLOv1.1. Specifically,we replaced the neck of YOLOv9 with the HyperC2Net from Hyper-YOLO, thereby endowing YOLOv9 with the capability of high-order learning.
4)Other Details: To ensure an equitable comparison, we excluded the use of pre-training and self-distillation strategies for all methods under consideration, as outlined in [5] and [1O]. Furthermore, recognizing the potential influence of input im- age size on the evaluation, we standardized the input resolution across all experiments to 64O × 64O pixels,a common choice in the field. The evaluation is based on the standard COCO Average Precision (AP) metric. Additional implementation specifics are provided in section A and section C.
B. Results and Discussions
The results of object detection on the COCO Val2017 vali- dation set. as shown in table I. lead to four main observations Firstly, the proposed Hyper-YOLO method outperforms other models across all four scales.For instance,in terms of 41.8% at the -N scale, 48.0% at the -S scale, 52.0% at the -M scale,and 53.8% at the -L scale. Compared to the
Method | Input Size | Apval | APyal | #Params. | FLOPs | FPS[bs=1] | FPS[bs=32] | Latency[bs=1] |
YOLOv5-N[4] | 640 | 28.0% | 45.7% | 1.9 M | 4.5 G | 763 | 1158 | 1.3 ms |
YOLOv5-S[4] | 640 | 37.4% | 56.8% | 7.2 M | 16.5 G | 455 | 606 | 2.2 ms |
YOLOv5-M[4] | 640 | 45.4% | 64.1% | 21.2 M | 49.0G | 220 | 267 | 4.6 ms |
YOLOv5-L [4] | 640 | 49.0% | 67.3% | 46.5M | 109.1G | 133 | 148 | 7.5 ms |
YOLOv6-3.0-N[5] | 640 | 37.0% | 52.7% | 4.7M | 11.4 G | 864 | 1514 | 1.2 ms |
YOLOv6-3.0-S [5] | 640 | 44.3% | 61.2% | 18.5 M | 45.3 G | 380 | 581 | 2.6 ms |
YOLOv6-3.0-M[5] | 640 | 49.1% | 66.1% | 34.9M | 85.8 G | 198 | 263 | 5.1 ms |
YOLOv6-3.0-L[5] | 640 | 51.8% | 69.2% | 59.6M | 150.7 G | 116 | 146 | 8.6 ms |
Gold-YOLO-N[10] | 640 | 39.6% | 55.7% | 5.6 M | 12.1G | 694 | 1303 | 1.4 ms |
Gold-YOLO-S[10] | 640 | 45.4% | 62.5% | 21.5 M | 46.0 G | 331 | 530 | 3.0 ms |
Gold-YOLO-M[10] | 640 | 49.8% | 67.0% | 41.3M | 87.5 G | 178 | 243 | 5.6 ms |
Gold-YOLO-L[10] | 640 | 51.8% | 68.9% | 75.1 M | 151.7 G | 107 | 139 | 9.3 ms |
YOLOv8-N[8] | 640 | 37.3% | 52.6% | 3.2 M | 8.7 G | 713 | 1094 | 1.4 ms |
YOLOv8-S[8] | 640 | 44.9% | 61.8% | 11.2 M | 28.6G | 395 | 564 | 2.5 ms |
YOLOv8-M[8] | 640 | 50.2% | 67.2% | 25.9 M | 78.9 G | 181 | 206 | 5.5 ms |
YOLOv8-L[8] | 640 | 52.9% | 69.8% | 43.7M | 165.2 G | 115 | 127 | 8.7 ms |
YOLOv9-T[21] | 640 | 38.3% | 53.1% | 2.0 M | 7.7 G | 420 | 796 | 2.4 ms |
YOLOv9-S[21] | 640 | 46.8% | 63.4% | 7.1 M | 26.4 G | 292 | 464 | 3.4 ms |
YOLOv9-M[21] | 640 | 51.4% | 68.1% | 20.0M | 76.3 G | 165 | 199 | 6.1 ms |
YOLOv9-C[21] | 640 | 53.0% | 70.2% | 25.3M | 102.1G | 148 | 170 | 6.6 ms |
Hyper-YOLO-T | 640 | 38.5% | 54.5% | 3.1M | 9.6 G | 404/692t | 644/1029t | 2.5/1.4† ms |
Hyper-YOLO-N | 640 | 41.8 % | 58.3% | 4.0M | 11.4 G | 364/554t | 460/710t | 2.7/1.8t ms |
Hyper-YOLO-S | 640 | 48.0% | 65.1% | 14.8M | 39.0 G | 212/301t | 257/343t | 4.7/3.3† ms |
Hyper-YOLO-M | 640 | 52.0% | 69.0% | 33.3M | 103.3G | 111/145† | 132/154† | 9.0/6.9† ms |
Hyper-YOLO-L | 640 | 53.8% | 70.9% | 56.3M | 211.0 G | 73/97t | 83/105t | 13.7/10.3† ms |
Hyper-YOLOv1.1-T | 640 | 40.3% 48.0% | 55.6% | 2.5 M | 10.8G | 345 | 530 | 2.9 ms |
Hyper-YOLOv1.1-S | 640 640 | 51.9% | 64.5% 69.1% | 7.6M 21.2 M | 29.9 G 87.4 G | 241 140 | 330 162 | 4.1 ms |
Hyper-YOLOv1.1-M | 640 | 53.2% | 70.4% | 29.8M | 115.5 G | 121 | 136 | 7.1 ms |
Hyper-YOLOv1.1-C | 8.3 ms |
Gold-YOLO, Hyper-YOLO shows an improvement of 2.2, 2.6, 2.2,and 2.0, respectively. When compared to YOLOv8, the improvements are 4.5,3.1,1.8,and O.9,respectively. Com- pared to the YOLOv9,Hyper-YOLO shows an improvement of 3.5,1.2, O.6,and O.8,respectively. These results validate the effectiveness of the Hyper-YOLO method.
Secondly, it is noteworthy that our method not only im- proves performance over Gold-YOLO but also reduces the number of parameters significantly. Specifically, there is a reduction of 28% at the -N scale, 31% at the -S scale, 19% at the -M scale, and 25% at the -L scale. The main reason for this is our HGC-SCS framework, which further introducs high- order learning in the semantic space comapred with the Gold- YOLO’s gather-distribute mechanism. This allows our method to utilize the diverse information extracted by the backbone, including cross-level and cross-position information,more efficiently with fewer parameters.
Thirdly, considering that Hyper-YOLO shares a similar underlying architecture with YOLOv8, we found that the proposed Hyper-YOLO-T, compared to YOLOv8-N, achieved higher object detection performance (37.3 → 38.5(37.3\,\rightarrow\,38.5(37.3→38.5 in terms of f)with fewer parameters APval⟩A P^{\mathrm{val}}\rangleAPval⟩ (3.2M → 3.1M)(3.2\mathbf{M}\ \rightarrow\ 3.1\mathbf{M})(3.2M → 3.1M) …This demonstrates that the proposed HyperC2Net can achieve better feature representation learning through high-order learning,
thereby enhancing detection performance. Similarly, we com- pared Hyper-YOLOv1.1 with YOLOv9,as both use the same backbone architecture, with the only difference being that Hyper-YOLOv1.1 employs the hypergraph-based HyperC2Net as the neck. The results show that our Hyper-YOLOv1.1 demonstrated significant performance improvements: Hyper- YOLOv1.1-T outperformed YOLOv9-T by 2.0 APval\mathrm{AP}^{v a l}APval ,and Hyper-YOLOv1.1-S outperformed YOLOv9-S by KaTeX parse error: Undefined control sequence: \AP at position 14: 1.2\,\mathrm{\̲A̲P̲}^{v a l} This fair comparison using the same architecture at the same scale validates the effectiveness of the proposed high-order learning method in object detection tasks.
Finally, we observe that, compared to YOLOv8, the im- provements brought by our Hyper-YOLO become more sig- nificant (from O.9 to 4.5) as the model scale decreases (from -L to -N). This is because a smaller model scale weakens the feature extraction capability and the ability to obtain effective information from visual data. At this point, high-order learning becomes necessary to capture the latent high-order correlations in the semantic space of the feature map, enriching the features ultimately used for the detection head.Furthermore, high-order message propagation based on hypergraphs in the semantic space allows direct information flow between different posi- tions and levels, enhancing the feature extraction capability of
the base network with limited parameters.
C. Ablation Studies on Backbone
In this and the next subsection, taking into account the model’s scale,we select the Hyper-YOLO-S to conduct ab- lation studies on the backbone and neck.
)On Basic Block of Backbone.: We conduct ablation experiments on the proposed MANet to verify the effectiveness of the mixed aggregation mechanism proposed in the basic block,as shown in table II. To ensure a fair comparison,we utilize the same PANet [16] as the neck, used in YOLOv8 [8], so that the only difference between the two methods lies in the basic block. The experimental results clearly show that the proposed MANet outperforms the C2f module under the same neck across all metrics. This superior performance is attributed to the mixed aggregation mechanism,which integrates three classic structures, leading to a richer flow of information and thus demonstrating enhanced performance.
Apval (%) | (%) | APs (%) | APm (%) | APl (%) | |
C2f(YOLOv8-S) | 44.9 | 61.7 | 25.9 | 49.7 | 61.0 |
MANet(Ours) | 46.4 | 63.4 | 28.1 | 51.7 | 62.3 |
2) On Kernel Size of Different Stages.: We further con- ducted ablation experiments on the size of the convolutional kernels,an essential factor in determining the receptive field and the ability of a network to capture spatial hierarchies in data. In our experiments,k; represents the kernel size of the MANet used at the i-th stage. Since our MANet begins to utilize mixed aggregation starting from the second stage, the configuration of k in our experiments is denoted as [k2,k3,k4,k5][k_{2},k_{3},k_{4},k_{5}][k2,k3,k4,k5] … Experimental results are presented in table III. The experimental results indicate that increasing the size of the convolutional kernels from 3 to 5 can indeed enhance the model’s accuracy. However, for small-scale and medium-scale object detection,the accuracy does not necessarily improve compared to a mixture of different kernel sizes,and it also results in a larger number of parameters. Therefore, taking into account a balance between performance and the number of parameters, our Hyper-YOLO ultimately selects the [3, 5, 5, 3] configuration as the optimal setting for the convolutional kernel sizes in our MANet.
$[k_{2},k_{3},k_{4},k_{5}]$ | (%) $\mathrm{AP}^{v a l}$ | 50 (%) | APs (%) | APm (%) | APl (%) |
[3,3,3,3] | 46.3 | 63.3 | 27.2 | 51.1 | 62.6 |
[5,5,5,5] | 46.6 | 63.5 | 27.5 | 51.6 | 63.1 |
[3, 5,5,3] | 46.4 | 63.4 | 28.1 | 51.7 | 62.3 |
D. Ablation Studies on Neck
- High-Order vs. Low-Order Learning gin HGC-SCS Framework: The core of the HGC-SCS framework lies in the
semantic space’s Hypergraph Computation, which allows for high-order information propagation among feature point sets. We conduct ablation studies to evaluate its effectiveness by simplifying the hypergraph into a graph for low-order learning, as shown in table IV.In this case, the graph is constructed by connecting the central node with its neighbors within an e- ball. The graph convolution operation used [42] is the classic A^ = D^v−1/2ADv−1/2 + I\hat{\boldsymbol{A}}\,\mathrm{~=~}\,\hat{\boldsymbol{D}}_{v}^{-1/2}\boldsymbol{A}\boldsymbol{D}_{v}^{-1/2}\mathrm{~+~}\boldsymbol{I}A^ = D^v−1/2ADv−1/2 + I ,where D, is the diagonal degree matrix of the graph adjacency matrix A. Additionally, we include a configuration with no correlation learning at all: “None". The experimental results,as presented in table IV, reveal that high-order learning demonstrates superior perfor- mance compared to the other two methods. Theoretically, low- order learning can be considered a subset [43] of high-order learning but lacks the capability to model complex correlation. High-order learning,on the other hand, possesses a more robust correlation modeling capability, which corresponds with a higher performance ceiling. As a result, it tends to achieve better performance more easily.
Hypergraph Computation | (%) $\mathrm{AP}^{v a l}$ | APygl (%) | (%) $\mathrm{AP}^{s}$ | (%) $\mathrm{AP}^{m}$ | APl (%) |
None | 46.4 | 63.4 | 28.1 | 51.7 | 62.3 |
Low-Order Learning | 47.6 | 64.8 | 29.1 | 53.1 | 63.7 |
High-OrderLearning | 48.0 | 65.1 | 29.9 | 53.2 | 64.6 |
2)On the Semantic Collecting Phase: The first phase of the HGC-SCS framework is Semantic Collecting,which determines the total amount of information fed into the semantic space for hypergraph computation. We performed ablation studies on this phase, as shown in table V, using three different configurations that select 3,4, or 5 levels of feature maps for input. The experimental results reveal that a greater number of feature maps can bring more abundant semantic space information. This enhanced information richness allows the hypergraph to fully exploit its capability in modeling complex correlation. Consequently, the input configuration with 5 feature maps achieved the best performance. This outcome suggests that the model can benefit from a more com- prehensive representation of the input data when more levels of feature maps are integrated. The inclusion of more feature maps likely introduces a broader range of semantic meaning and details from the visual input, enabling the hypergraph to establish higher-order connections that reflect a more complete understanding of the scene. Therefore,the configuration that incorporates 5 feature maps is preferred for maximizing the potential of hypergraph-based complex correlation modeling.
Semantic Collecting Set | (%) $\mathrm{AP_{50}^{}}^{v a l}$ $\mathrm{AP}^{v a l}$ | (%) (%) | APS (%) | APm (%) | APl (%) |
{B3,B4,B5} | 47.5 | 64.6 | 28.9 | 52.6 | 63.8 |
$\{B_{2},B_{3},B_{4},B_{5}\}$ | 47.8 | 65.0 | 28.4 | 53.1 | 64.2 |
$\{B_{1},B_{2},B_{3},B_{4},B_{5}\}$ | 48.0 | 65.1 | 29.9 | 53.2 | 64.6 |
3) On Hypergraph Construction of Hypergraph Compu- tation Phase: Further ablation experiments are conducted to examine the effect of the distance threshold used in the construction of the hypergraph, with the results shown in table VI. Compared to the configuration ‘None where hyper- graph computation is not introduced, the introduction of hyper- graph computation leads to a significant overall performance improvement. It is also observed that the performance of the target detection network is relatively stable across a range of threshold values from 7 to 9,with only minor variations. However, there is a performance decline at the thresholds of 6 and 1O. This decline can be attributed to the number of connected nodes directly affecting the smoothness of features in the semantic space. A higher threshold may lead to a more connected hypergraph,where nodes are more likely to share information, potentially leading to over-smoothing of the features. Conversely,a lower threshold may result in a less connected hypergraph that cannot fully exploit the high- order relationships among features. Therefore,our Hyper- YOLO uses the distance threshold 8 for construction. The precise value would be determined based on empirical results, balancing the need for a richly connected hypergraph against the risk of over-smoothing or under-connecting the feature representation.
Distance Threshold | APual (%) | 50 (%) | APs (%) | APm (%) | APl (%) |
None | 46.3 | 63.5 | 26.9 | 51.6 | 62.6 |
6 | 47.6 | 64.6 | 28.6 | 52.7 | 64.2 |
7 | 47.8 | 65.0 | 29.4 | 53.3 | 64.0 |
8 | 48.0 | 65.1 | 29.9 | 53.2 | 64.6 |
9 | 47.8 | 64.9 | 29.2 | 53.4 | 64.5 |
10 | 47.7 | 65.1 | 28.2 | 53.0 | 63.7 |
E. More Ablation Studies
In this subsection, we conduct thorough ablation studies to assess the impact of backbone and neck enhancements in Hyper-YOLO across four different model scales, with detailed results presented in table VII. The baseline performance of YOLOv8 is placed at the top of the table. The middle part of the table introduces our HyperYOLO models that incorporate only the backbone enhancement. At the bottom,we feature the fully augmented HyperYOLO models,which benefit from both backbone and neck enhancements.Based on experimental results in table VII, we have three observations.
Firstly, the adoption of both individual and combined en- hancements significantly boosts performance for the -N, -S, and -M models,validating the effectiveness of our proposed modifications. Secondly, the impact of each enhancement appears to be scale-dependent. As we progress from -N to -S, -M,and -L models, the incremental performance gains due to the backbone improvement gradually decrease from 2.6 to 1.5,O.8,and finally O.1.In contrast, the neck enhancement consistently contributes more substantial improvements across these scales,with respective gains of 1.9,1.6,1.0,and 0.8.
This suggests that while the benefits of an expanded receptive field and width scaling in the backbone are more pronounced in smaller models, the advanced HyperC2Net neck provides a more uniform enhancement by enriching the semantic content and boosting object detection performance across the board. Thirdly, when focusing on small object detection (AP’), the HyperYOLO-L model with both backbone and neck enhance- ments achieves a notable increase of 1.6,whereas just the backbone enhancement leads to a O.6 improvement. This underscores the potential of hypergraph modeling,particularly within the neck enhancement, to capture the complex relation- ships among small objects and significantly improve detection in these challenging scenarios.
F. More Evaluation on Instance Segmentation Task
We extend the application of Hyper-YOLO to the instance segmentation task on the COCO dataset, ensuring a direct comparison with its predecessor, YOLOv8,by adopting a consistent approach in network modification: replacing the detection head with a segmentation head. Experimental results are shown in table VIII.
The empirical results clearly illustrate that Hyper-YOLO attains remarkable performance enhancements. For APbox, Hyper-YOLO shows an impressive increase of 4.7 AP for the -N variant, 3.3 AP for the -S variant, 2.2 AP for the -M variant, and 1.4 AP for the -L variant. Similarly, for APmask\mathrm{AP}^{m a s k}APmask ‘,Hyper- YOLO exhibits significant improvements,with gains of 3.3 AP for -N, 2.3 AP for -S, 1.3 AP for -M, and 0.7 AP for -L. These results underscore the effectiveness of the advancements integrated into Hyper-YOLO.
G. Visualization of High-Order Learning in Object Detection
In our paper, we have provided a mathematical rationale explaining how the hypergraph-based neck can transcend the limitations of traditional neck designs,which typically rely on grid-like neighborhood structures for message propagation within feature maps. This design enables advanced high- order message propagation across the semantic spaces of the features. To further substantiate the effectiveness of our hypergraph-based neck,we have included visualizations in the revised manuscript, as shown in fig. 5.These visualiza- tions compare feature maps before and after applying our HyperConv layer. It is evident from these images that there is a consistent reduction in attention to semantically similar backgrounds, such as skies and grounds, while maintaining fo- cus on foreground objects across various scenes. This demon- strates that HyperConv, through hypergraph computations, aids the neck in better recognizing semantically similar objects within an image, thus supporting the detection head in making more consistent decisions.
V I.CONCLUSION
In this paper, we presented Hyper-YOLO, a groundbreaking object detection model that integrates hypergraph computa- tions with the YOLO architecture to harness the potential of high-order correlations in visual data. By addressing the
Methods | (%) $\mathrm{AP}^{b o x}$ | (%) $\mathsf{A}\mathsf{P}^{m a s k}$ | Params. (M) | FLOPs (G) |
YOLOv8-N-seg | 36.7 | 30.5 | 3.4 | 12.6 |
YOLOv8-S-seg | 44.6 | 36.8 | 11.8 | 42.6 |
YOLOv8-M-seg | 49.9 | 40.8 | 27.3 | 110.2 |
YOLOv8-L-seg | 52.3 | 42.6 | 46.0 | 220.5 |
HyperYOLO-N-seg | 41.4 | 33.8 | 4.3 | 15.3 |
HyperYOLO-S-seg | 47.9 | 39.1 | 15.5 | 53.0 |
HyperYOLO-M-seg | 52.1 | 42.1 | 34.7 | 134.6 |
HyperYOLO-L-seg | 53.7 | 43.3 | 58.6 | 266.3 |

inherent limitations of traditional YOLO models,particularly in the neck design’s inability to effectively integrate features across different levels and exploit high-order relationships, we have significantly advanced the SOTA in object detection. Our contributions set a new benchmark for future research and development in object detection frameworks and pave the Way for further exploration into the integration of hypergraph computations within visual architectures based on our HGC- CSC framework.
Method | APyal $\mathrm{AP}^{s}$ $\mathrm{AP}^{v a l}$ | #Params. $\mathrm{AP}^{m}$ $\mathrm{AP}^{l}$ | FLOPs | FPS [bs=1] | FPS [bs=32] | Latency [bs=1] | ||||
YOLOv8-N | 37.3% | 52.3% | 18.7% | 40.9% | 53.3% | 3.2 M | 8.7 G | 713 | 1094 | 1.4 ms |
YOLOv8-S | 44.9% 50.2% | 61.7% 67.1% | 25.9% 32.3% | 49.7% 55.6% | 61.0% 66.5% | 11.2 M 25.9M | 28.6G 78.9 G | 395 181 | 564 206 | 2.5 ms 5.5 ms |
YOLOv8-L | 52.9% | 69.6% | 35.1% | 57.9% | 69.8% | 43.7 M | 165.2 G | 115 | 127 | 8.7 ms |
BackboneEnhancement | ||||||||||
HyperYOLO-N | 39.9% | 56.2% | 20.8% | 44.3% | 55.5% | 3.5 M | 9.8 G | 554 | 710 | 1.8 ms |
HyperYOLO-S | 46.4% | 63.4% | 28.1% | 51.7% | 62.3% | 12.7M | 32.6G | 301 | 343 | 3.3 ms 6.9 ms |
HyperYOLO-M | 51.0% | 67.9% | 32.7% | 56.8% | 67.9% | 28.2 M | 86.8 G | 145 | 154 | |
HyperYOLO-L | 53.0% | 70.0% | 35.7% | 58.8% | 69.5% | 46.4M | 177.8 G | 97 | 105 | 10.3 ms |
Backbone&NeckEnhancement | ||||||||||
HyperYOLO-N | 41.8% | 58.3% | 22.2% | 46.4% | 58.7% | 11.4 G | 364 | 460 | 2.7 ms | 4.0 M |
HyperYOLO-S | 48.0% | 65.1% | 29.9% | 53.2% | 64.6% | 14.8M | 39.0 G | 212 | 257 | 4.7 ms |
HyperYOLO-M | 52.0% | 69.0% | 34.6% | 57.9% | 68.7% | 33.3 M | 103.3 G | 111 | 132 | 9.0 ms |
HyperYOLO-L | 53.8% | 70.9% | 36.7% | 59.9% | 70.2% | 56.3 M | 211.0 G | 73 | 83 | 13.7 ms |
REFERENCES
[1]J.Redmon,S.Divvala,R.Girshick,and A.Farhadi,“You only look once:Unified, real-time object detection,’ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 779- 788.
[2] J.Redmon and A.Farhadi, “YOLOv3:An incremental improvement, ArXiv Preprint ArXiv:1804.02767, 2018.
[3] A. Bochkovskiy, C.-Y. Wang,and H.-Y. M. Liao, “YOLOv4: Optimal speed and accuracy of object detection,’ ArXiv,vol. abs/20O4.10934, 2020.
[4] G.Jocher,“Ultralytics YOLOv5,’ 2020.[Online]. Available: https: //github.com/ultralytics/yolov5
[5] C. Li, L. Li, Y. Geng, H. Jiang, M. Cheng, B. Zhang, Z. Ke, X. Xu, and X. Chu,“YOLOv6 v3.O: A full-scale reloading, ArXiv,vol. abs/2301.05586,2023.[Online]. Available: https://api.semanticscholar. org/CorpusID:255825915
[6] C.Li, L.Li, H. Jiang, K. Weng, Y. Geng,L. Li, Z. Ke, Q.Li, M. Cheng, W.Nie et al.,“YOLOv6: A single-stage object detection framework for industrial applications,’ ArXiv Preprint ArXiv:2209.02976, 2022.
[7] C.-Y. Wang,A.Bochkovskiy, and H.-Y. M.Liao,“YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors, inProceedingsof theIEEE/CVFConference on ComputerVision and Pattern Recognition,2023,pp. 7464-7475.
[8]( G.Jocher,A. Chaurasia,and J. Qiu,“Ultralytics YOLOv8,’ 2023. [Online]. Available:https://github.com/ultralytics/ultralytics
[9] Z. Ge, S. Liu, F. Wang, Z. Li, and J. Sun,“YOLOX: Exceeding yolo seriesin 2021, ArXiv,vol.abs/2107.08430,2021.[Online].Available: https://api.semanticscholar.org/CorpusID:236088010
[10] C.Wang,W.He,Y.Nie,J. Guo,C.Liu,Y.Wang,andK.Han,“Gold- YOLO: Efficient object detector via gather-and-distribute mechanism," inThirty-seventh Conference onNeural InformationProcessing Systems, 2023.
[11] S. Xu, X. Wang,W. Lv, Q. Chang,C. Cui, K. Deng,G. Wang, Q.Dang,S. Wei,Y. Du,and B.Lai,“PP-YOLOE: An evolved version of yolo," ArXiv,vol.abs/2203.16250, 2022.[Online].Available: https://api.semanticscholar.org/CorpusID:247793126
[12]1 M. Tan and Q. Le,“Efficientnet: Rethinking model scaling for con- volutional neural networks,’ in International Conference on Machine Learning. I PMLR,2019,pp.6105-6114.
[13] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding, and J. Sun,“Repvgg: Mak- ing vgg-style convnets great again,’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2O21,pp. 13 733-13 742.
[14]( C.-Y. Wang,H.-Y. M. Liao, and I.-H. Yeh, “Designing 一 network design strategies through gradient path analysis,’ ArXiv1 Preprint ArXiv:2211.04800, 2022.
[15] T.-Y.Lin,P.Dollar,R.Girshick,K.He,B.Hariharan,and S.Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017, pp. 2117-2125.
[16] S.Liu,L.Qi,H. Qin,J. Shi,and J. Jia,“Path aggregation network forinstance segmentation,’ in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,pp. 8759-8768.
[17] Y. Gao, Z. Zhang,H. Lin, X. Zhao, S.Du,and C. Zou,“Hypergraph learning:Methodsand practices," IEEETransactionsonPattern Analysis and Machine Intelligence,vol. 44, no. 5,pp. 2548-2566, 2020.
[18]Y. Gao,Y. Feng,S.Ji,and R.Ji,“HGNN+:General hypergraph neural networks,IEEETransactionson Pattern Analysisand Machine Intelligence,vol. 45,pp. 3181-3199,2023.
[19]C.Lyu, W. Zhang,H.Huang,Y.Zhou,Y. Wang,Y.Liu, S. Zhang,and K.Chen,“RTMDet:An empirical study of designing real-time object detectors, 2022.
[20] Z. Tian, C. Shen,H. Chen,and T.He,“FCOS: Fully convolutional one- stageobject detection,’in Proceedings ofthe IEEE/CVF International Conferenceon Computer Vision,2019,pp. 9627-9636.
[21] C.-Y. Wang, I.-H. Yeh, and H.-Y. M.Liao,“Yolov9:Learning what you Want to learn using programmable gradient information,” arXiv preprint arXiv:2402.13616,2024.
[22] J.-G. Young,G.Petri,and T. P. Peixoto,“Hypergraph reconstruction from network data,” Communications Physics,vol. 4,no. 1,p. 135, 2021.
[23]D. Yang,B.Qu,J. Yang,and P. Cudré-Mauroux,“Lbsn2vec++: Het- erogeneous hypergraph embedding for location-based social networks," IEEETransactions on Knowledge and Data Engineering,vol.34,no.4, pp. 1843-1855,2020.
[24] S. Jin, Y. Hong, L. Zeng, Y. Jiang, Y. Lin,L. Wei, Z. Yu, X. Zeng,and X.Liu,“A general hypergraph learning algorithm for drug multi-task predictions in micro-to-macro biomedical networks,’ PLOS Computa- tional Biology,vol. 19,no.11, p. e1011597, 2023.
[25]R.Vinas,C. K.Joshi,D.Georgiev,P. Lin,B.Dumitrascu,E. R. Gamazon,and P.Lio,“Hypergraph factorization for multi-tissue gene expression imputation,” Nature Machine Intelligence, vol. 5,no. 7, pp. 739-753,2023.
[26]L. Xiao,J. Wang,P. H. Kassani,Y. Zhang,Y. Bai,J. M. Stephen, T. W. Wilson,V. D. Calhoun,and Y.-P. Wang,“Multi-hypergraph learning-based brain functional connectivity analysis in fMRI data," IEEETransactions on Medical Imaging,vol. 39,no. 5,pp.1746-1758, 2019.
[27] C. Zu,Y. Gao,B.Munsell,M. Kim,Z. Peng, Y. Zhu, W. Gao,D. Zhang, D.Shen,and G.Wu,“Identifying high order brain connectome biomark- ersvialearningonhypergraph,’in MICCAI20l6.Springer,2016,pp. 1-9.
[28]Y. Feng,H. You,Z. Zhang,R.Ji,and Y. Gao,“Hypergraph neural networks,’ in Proceedingsofthe Thirty-Third AAAI Conferenceon Artificial Intelligence,2019.
[29]K.He,X. Zhang,S.Ren,and J. Sun,“Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision andPatternRecognition,2016, pp.77O-778.
[30] -,“Identity mappings in deep residual networks,’ in Computer Vision-ECCV 20l6: 14th European Conference, Amsterdam, ,The Netherlands,October11-14,2016,Proceedings,Part IV14. Springer, 2016, pp. 630-645.
[31] G.Huang,Z.Liu,L.Van Der Maaten,and K.Q. Weinberger, “Densely connected convolutional networks,’ in Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition, 2O17, pp.470O-4708.
[32] Z.Liu, H. Mao,C.-Y. Wu, C. Feichtenhofer, T.Darrell, and S. Xie,“A convnet for the 2O2Os,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2022, pp. 11976-11986.
[33] C. Szegedy, V. Vanhoucke, S.Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,2016, pp. 2818-2826.
[34]S. Xie,R.Girshick,P. Dollar, Z. Tu,and K.He,“Aggregated residual transformations for deep neural networks,’in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, 2O17, pp. 1492-1500.
[35]Y.Chen, X. Yuan,R. Wu, J. Wang,Q.Hou, and M.-M. Cheng,“YOLO- MS:Rethinking multi-scale representation learning for real-time object detection,’ ArXiv Preprint ArXiv:2308.05480, 2023.
[36] X. Xu,Y. Jiang,W. Chen,Y. Huang,Y. Zhang,and X. Sun,“DAMO- YOLO: A report on real-time object detection design,’ ArXiv Preprint ArXiv:2211.15444,2022.
[37] C.-Y. Wang,A. Bochkovskiy,and H.-Y. M.Liao,“Scaled-YOLOv4: Scaling cross stage partial network,’ in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,2O21,pp. 13029-13038
[38] J.Redmon and A.Farhadi,“YOLO9OOO:Better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2017,pp.7263-7271.
[39] L. Huang, W.Li, L. Shen,H. Fu, X. Xiao,and S. Xiao,“YOLOCS: Object detection based on dense channel compression for feature spatial solidification," ArXiv Preprint ArXiv:2305.04170, 2023.
[40] Y.Lee, J.-w.Hwang, S. Lee, Y. Bae,and J. Park, “An energy and gpu- computation efficient backbone network for real-time object detection," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition workshops,2019,pp. 0-0.
[41]T.-Y.Lin,M.Maire,S.Belongie,J.Hays,P. Perona,D.Ramanan, P. Dollar,and C.L. Zitnick,“Microsoft coco: Common objects in context,’ in European Conference on Computer Vision.Springer International Publishing,2014,pp.740-755.
[42] T.N.Kipf and M Welling,“Semi-supervised classification with graph convolutional networks,’ in International Conference onLearningRep- resentations,2017.
[43] Y. Feng,S. Ji, Y.-S. Liu, S. Du,Q. Dai,and Y. Gao,“Hypergraph- based multi-modal representation for open-set 3d object retrieval," IEEE Transactions on Pattern Analysis and Machine Intelligence,2023.



Yifan Feng received the BE degree in computer sci- ence and technology from Xidian University, Xi’an, China,in 2O18,and the MS degree from Xiamen University, Xiamen,China,in 2O21.He is currently working toward the PhD degree from the School ofSoftware,Tsinghua University,Beijing,China. Hisresearch interests includehypergraph neural networks,machine learning,and pattern recognition. Jiangang Huang received the BE degree in software engineering from Xi’an Jiaotong University, Xi’an, China,in 2O22.He is currently working toward the master’s degree in the same field from Xi’an Jiaotong University. His research interests include object detection, software engineering,and artificial intelligence.
Shaoyi Du received double Bachelor degrees in computational mathematics and in computer science in 2002 and received his M.S.degree in applied mathematics in 2OO5 and Ph.D.degree in pattern recognition and intelligence system from Xi’an Jiao- tong University, China in 2OO9.He is a professor at Xi’an Jiaotong University.His research interests in- clude computer vision,machine learning and pattern recognition.
Shihui Ying (M’11) received the B.Eng. degree in mechanical engineering and the Ph.D.degree in applied mathematics from Xi’an Jiaotong University, Xi’an,China,in 2001 and 20O8,respectively.He is currently a Professor with the Department of Mathematics,School ofScience,ShanghaiUniver- sity, Shanghai,China.His current research interests include geometric theory and methods for machine intelligence andmedical image analysis.
Jun-Hai Yong received the B.S.and Ph.D.degrees in computer science from Tsinghua University, Bei- jing, China, in 1996 and 2001, respectively.He held a visiting researcher position with the Department of Computer Science,Hong Kong University of Science and Technology in 2OoO.He was a Post- Doctoral Fellow with the Department of Computer Science,University ofKentucky,from 20oO to 2002. He is currently a Professor with the School of Software,Tsinghua University. His main research interests include comnuter-aided design and com-


puter graphics.He receiveda lotof awards,such as the NationalExcellent Doctoral Dissertation Award,the National Science Fund for Distinguished Young Scholars, the Best Paper Award of the ACM SIGGRAPHEurographics Symposium on Computer Animation, the Outstanding Service Award as an Associate Editor of the Computers and Graphics journal by Elsevier,and severalNational Excellent Textbook Awards.
Yipeng Li received the B.S. and M.S. degrees in electronic engineering from the Harbin Institute of Technology,Harbin,China,and the Ph.D.degree in electronic engineering from Tsinghua University, Beijing,China,in 2003,2005,and 2011,respec- tively.He is currently an Assistant Researcher with theDepartment ofAutomation, TsinghuaUniversity. His current research interests include UAV vision- based autonomous navigation,3-D reconstruction of natural environment,complex systems theory,and Internet applications analysis.
Guiguang Ding is currently a Distinguished Re- searcher with the School of Software,Tsinghua University;a Ph.D.Supervisor; an Associate Dean ofthe School of Software,Tsinghua University; and the Deputy Director of the National Research Center for Information Science and Technology.His research interests mainly focus on visual perception, theory and method of efficient retrieval and weak supervised learning,neural network compression of vision task under edge computing and power lim ited scenes,visual computingsystems,and platform


developing.He was the Winner of the National Science Fund forDistinguished Young Scholars.
Rongrong Ji is currently a Professor and the Direc- torof the Intelligent Multimedia Technology Lab- oratory, School of Informatics,Xiamen University, Xiamen, China.His work mainly focuses on innova- tive technologies for multimedia signal processing, computer vision,and pattern recognition,with over 100 papers published in international journals and conferences.He servesasan Associate/Guest Editor for international journals and magazines,such as Neurocomputing, Signal Processing,and Multimedia Systems.


Yue Gao is an associate professor with the School of Software,Tsinghua University. He received the B.S.degree from the Harbin Institute of Technology. Harbin,China,and the M.E.and Ph.D.degrees from Tsinghua University,Beijing,China.

APPENDIXA
MANet | ||||||||||
Stage 1 | Stage 2 | Stage 3 | Stage 4 | Stage 5 | ||||||
n | k | n | k | n | k | n | k | n | k | |
Hyper-YOLO-N | 1 | 3 | 2 | 5 | 2 | 5 | 1 | 3 | ||
Hyper-YOLO-S | 1 | 3 | 2 | 5 | 2 | 5 | 1 | 3 | ||
Hyper-YOLO-M | 2 | 3 | 4 | 5 | 4 | 5 | 2 | 3 | ||
Hyper-YOLO-L | 3 | 3 | 6 | 5 | 6 | 5 | 3 | 3 |
I MPLEMENTAL DETAILS OF HYPER-YOLO
In this section, we detail the implementation of our proposed models: Hyper-YOLO-N, Hyper-YOLO-S, Hyper- YOLO-M, and Hyper-YOLO-L. These models are developed upon the PyTorch1. In line with the configuration established by YOLOv8 [8], our models share analogous architectures and loss functions,with the notable exception of incorporating MANet and HyperC2Net. .An efficient decoupled head has been integrated for precise object detection. The specific configurations of the Hyper-YOLO-S are depicted in fig. S1.
A.Backbone
The backbone of HyperYOLO,detailed 1 intableS1,has been updated from its predecessor, with the C2f module being replaced by the MANet module, maintaining the same number of layers as in YOLOv8 [8], structured as [3,6,6,3]. . The channel counts for each stage are kept consistent with those in YOLOv8,with n the only change being the r module swap.
1https://pytorch.org/
The MANet employs depthwise separable convolutions with an increased channel count, where a 2c input is expanded to a 4c output (with 2c equivalent to Cout).
In addition to these adjustments, the hyperparameters K and n for the four stages are set to [3,5,5,3] and [3, 6,6, 6] × depth, respectively. The depth multiplier varies across the different scales of the model, being gsetto 1/3,1/3, 2/3, and 1 for the Hyper-YOLO-N, Hyper-YOLO-S, Hyper-YOLO- M, and Hyper-YOLO-L, respectively. This means that the actual count of n at each stage of the models is [3,6,6, 6]
Channel ofFeature | HyperConv | |||||||
B1 | B1 | B3 | B4 | B5 | Cin | Cout | E | |
Hyper-YOLO-N | 16 | 32 | 64 | 128 | 256 | 128 | 128 | 6 |
Hyper-YOLO-S | 32 | 64 | 128 | 256 | 512 | 256 | 256 | 8 |
Hyper-YOLO-M | 48 | 96 | 192 | 384 | 576 | 384 | 384 | 10 |
Hyper-YOLO-L | 64 | 128 | 256 | 512 | 512 | 512 | 512 | 10 |
multiplied by the corresponding depth factor for that scale. These specifications ensure that each scale of the HyperYOLO model is equipped with a backbone that is finely tuned for its size and complexity, enabling efficient feature extraction at multiple scales.
B. Neck
Compared to the neck design in YOLOv8, the Hyper- YOLO model introduces the HyperC2Net (Hypergraph-Based Cross-level and Cross-position Representation Network) as its neck component, detailed in fig. 4. This innovative structure is an embodiment of the proposed HGC-SCS framework, specifically engineered to encapsulate potential high-order correlations existing within the semantic space.
The HyperC2Netis designed to comprehensively fuse cross- level and cross-position information emanating from the back- bone network. By leveraging the hypergraph architecture, it effectively captures the complex interdependencies among feature points across different layers and positions.This allows the model to construct a more intricate and enriched repre- sentation of the input data, which is particularly useful for identifying and delineating subtle nuances within the images being processed. In the context of the Hyper-YOLO model’s varying scales, the neck plays a critical role in maintaining the consistency of high-order correlation representation. Since the spatial distribution of feature points can significantly differ be- tween models like Hyper-YOLO-N and Hyper-YOLO-L, with the latter typically having a more dispersed distribution, the HyperC2Net adjusts its approach accordingly by employing different distance thresholds for each model scale,as outlined in table S2, to ensure that the network captures the appropriate level of high-order correlations without succumbing to over- smoothing. The HyperC2Net’s ability to dynamically adapt its threshold values based on the model scale and feature point distribution is a testament to its sophisticated design. It strikes a fine balance between the depth of contextual understanding and the need to preserve the sharpness and granularity of the feature space, thereby enhancing the model’s overall performance in detecting and classifying objects within varied and complex visual environments.
APPENDIX B VISUALIZATIONS OFRESULTS
In this section, we further provide visualizations of the Hyper-YOLO on two tasks: object detection and instance segmentation, as shown in fig. S2 and fig. S3, respectively.
Hyperparameter | N | S | M | L |
Epochs | 500 | 500 | 500 | 500 |
Optimizer | SGD | SGD | SGD | SGD |
lr0 | 0.01 | 0.01 | 0.01 | 0.01 |
lrf | 0.02 | 0.01 | 0.1 | 0.1 |
lr decay | linear | linear | linear | linear |
Momentum | 0.937 | 0.937 | 0.937 | 0.937 |
Weightdecay | 0.0005 | 0.0005 | 0.0005 | 0.0005 |
Warmupepochs | 3.0 | 3.0 | 3.0 | 3.0 |
Warm up momentum | 0.8 | 0.8 | 0.8 | 0.8 |
Warm up bias learning rate | 0.1 | 0.1 | 0.1 | 0.1 |
Box loss gain | 7.5 | 7.5 | 7.5 | 7.5 |
Class loss gain | 0.5 | 0.5 | 0.5 | 0.5 |
DFL loss gain | 1.5 | 1.5 | 1.5 | 1.5 |
HSV hue augmentation | 0.015 | 0.015 | 0.015 | 0.015 |
HSV saturation augmentation | 0.7 | 0.7 | 0.7 | 0.7 |
HSV value augmentation | 0.4 | 0.4 | 0.4 | 0.4 |
Translation augmentation | 0.1 | 0.1 | 0.1 | 0.1 |
Scale augmentation | 0.5 | 0.6 | 0.9 | 0.9 |
Mosaic augmentation | 1.0 | 1.0 | 1.0 | 1.0 |
Mixup augmentation | 0.0 | 0.0 | 0.1 | 0.1 |
Copy & Paste augmentation | 0.0 | 0.0 | 0.0 | 0.1 |
Close mosaic epochs | 10 | 10 | 20 | 20 |
Hypergraph threshold | 6 | 8 | 10 | 10 |
The results depicted in fig. S2 illustrate that our Hyper- YOLO model exhibits superior object recognition capabilities, as demonstrated in figures (b) and ©. Moreover, owing to the usage of a hypergraph-based neck in its architecture, Hyper- YOLO possesses a certain degree of class inference ability. This is most evident in figure (a),where Hyper-YOLO is capable of inferring with high confidence that if one bird is detected, the other two entities are also birds.Additionally, as observed in figure (e), it is common for humans to play with dogs using a frisbee.Even though only a glove is visible in the image,our Hyper-YOLO is still able to recognize it as part of a human.
B. Instance Segmentation
Results from fig. S3 indicate that, compared to YOLOv8, Hyper-YOLO achieves significant improvements in both cat- egorization and boundary delineation for segmentation tasks. Despite the ground truth annotation in figure (a) not being entirely accurate, our Hyper-YOLO still manages to provide precise boundary segmentation. Figures©, (d) and (e) depict more complex scenes, yet our Hyper-YOLO continues to deliver accurate instance segmentation results,ensuring that not a single cookie is missed.
APPENDIXC
TRAINING DETAILS OF HYPER-YOLO
The training protocol for Hyper-YOLO was carefully de- signed to foster consistency and robustness across varying experiments.Each GPU was allocated a uniform batch size of 2O to maintain a consistent computational environment, utilizing a total of 8 NVIDIA GeForce RTX 409O GPUs. To assess the learning efficacy and generalization capacity, all

variants of Hyper-YOLO, including -N, -S, -M, and -L,were trained from the ground up. The models underwent 500 epochs of training without relying on pre-training from large-scale datasets like ImageNet, thereby avoiding potential biases. The training hyperparameters were fine-tuned to suit the specific needs of the different sizes of the model. table S3 summarizes the key hyperparameters for each model scale.
Those core parameters, such as the initial learning rate and weight decay, were uniformly set across all scales to standard- ize the learning process. The hypergraph threshold, however, was varied according to the model scale and batch size. This threshold was configured with a batch size of 20 per GPU in mind, implying that if the batch size were to change, the threshold would need to be adjusted accordingly. Generally, a larger batch size on a single GPU would necessitate lower threshold, whereas B larger model scale correlates to a higher threshold.
Most hyperparameters remained consistent across the differ- ent model scales: nonetheless, parameters such as the learning rate, scale augmentation, mixup augmentation, copy & paste augmentation, and the hypergraph threshold were tailored for each model scale. Data augmentation hyperparameters were set in accordance with YOLOv5’s configuration, with certain modifications for Hyper-YOLO. For instance, the N and S models employed lower levels of data augmentation, with specific adjustments made for the N model’s final learning rate (lrf=0.02) and the S model’s scale augmentation (scale=0.6).
The M and T models, on the other hand, utilized moderate and high levels of data augmentation, respectively, with both scales having the same setting for close mosaic epochs (20).
It should be emphasized that the hypergraph threshold is set under the premise of batch size of 20 per GPU. Alterations to the batch size should be accompanied by corresponding adjustments to the threshold, following the trend that p larger single-GPU batch size should lead to a smaller relative thresh- old. Similarly, larger model scales require higher thresholds. Most hyperparameters are consistent across different model scales, with the exception of a few like lrf, scale augmenta- tion, mixup augmentation. copy & paste augmentation, and hypergraph threshold, which are tailored to the specific scale of the model. Data augmentation parameters are largely based on the YOLOv5 settings, with some values being distinct to accommodate the different requirements of the Hyper-YOLO model.
APPENDIX D DETAILS OF SPEED TEST
The speed benchmarking for our Hyper-YOLO model adopts a two-group approach. The first group comprises mod- els requiring reparameterization, such as YOLOv6-3.0 and Gold-YOLO. The second group includes YOLOv5, YOLOv8. and HyperYOLO. Notably, during the conversion process to ONNX format, the HyperYOLO model encounters issues with the 'torch.cdist function. leading to large tensor sizes that
cause errors at batch sizes of 32. To address this and ensure accurate speed measurements, we replace the‘torch.cdist’ function with a custom feature distance function during testing. In addition, we also test the speed of a variant with only an enhanced backbone.
The benchmarking process involves converting the models to ONNX format, followed by conversion to TensorRT en- gines. The tests are performed twice,under batch sizes of 1 and 32,to assess performance across different operational contexts. Our test environment is controlled, consisting of Python 3.8.16, Pytorch 2.0.1, CUDA 11.7, cuDNN 8.0.5, TensorRT 8.6.1, and ONNX 1.15.0. All tests are carried out with a fixed input size of 64O × 64O pixels.