计算机视觉基础模型学习了人类视觉系统的底层特征吗?CVPR 2025

发布于:2025-03-13 ⋅ 阅读:(13) ⋅ 点赞:(0)

Do computer vision foundation models learn the low-level characteristics of the human visual system?

Computer vision foundation models, such as DINO or OpenCLIP, are trained in a self-supervised manner on large image datasets. Analogously, substantial evidence suggests that the human visual system (HVS) is influenced by the statistical distribution of colors and patterns in the natural world, characteristics also present in the training data of foundation models. The question we address in this paper is whether foundation models trained on natural images mimic some of the low-level characteristics of the human visual system, such as contrast detection, contrast masking, and contrast constancy. Specifically, we designed a protocol comprising nine test types to evaluate the image encoders of 45 foundation and generative models. Our results indicate that some foundation models (e.g., DINO, DINOv2, and OpenCLIP), share some of the characteristics of human vision, but other models show little resemblance. Foundation models tend to show smaller sensitivity to low contrast and rather irregular responses to contrast across frequencies. The foundation models show the best agreement with human data in terms of contrast masking. Our findings suggest that human vision and computer vision may take both similar and different paths when learning to interpret images of the real world. Overall, while differences remain, foundation models trained on vision tasks start to align with low-level human vision, with DINOv2 showing the closest resemblance.

计算机视觉基础模型(如DINO或OpenCLIP)通过自监督方式在大型图像数据集上进行训练。类似地,大量证据表明人类视觉系统(HVS)受到自然世界中颜色和图案统计分布的影响——这一特性同样存在于基础模型的训练数据中。本文探讨的核心问题是:在自然图像上训练的基础模型是否能够模拟人类视觉系统的某些低级特征(如对比度检测、对比度掩蔽和对比度恒常性)。为此,我们设计了一个包含9种测试类型的实验协议,对45种基础模型和生成模型的图像编码器进行了评估。结果表明,部分基础模型(如DINO、DINOv2和OpenCLIP)与人类视觉特征存在相似性,但其他模型则差异显著。基础模型对低对比度的敏感度普遍较低,且对不同频率的对比度响应呈现不规则性。在对比度掩蔽测试中,基础模型与人类数据的匹配度最佳。研究发现表明,在学习解析真实世界图像时,人类视觉与计算机视觉可能采取既相似又不同的路径。总体而言,尽管仍存在差异,但面向视觉任务训练的基础模型已开始与低级人类视觉特征趋于一致,其中DINOv2表现出最高的相似性。

Introduction

Computer vision foundation models, such as DINO [9] or OpenCLIP [22, 32], show exceptional ability to generalize to different tasks and are becoming cornerstones of many computer vision methods. They owe their exceptional performance to self-supervised training on very large image datasets. The human visual system also owes much of its capability to being able to perceive the world, over many years from infancy to childhood [6]. A question arises: if the neural network and the visual system are trained by being exposed to a large number of images of the world, will they share their low-level vision characteristics? If they do, we will know that those low-level characteristics arise naturally and likely reflect the statistics of real-world scenes. If they do not, it means that human low-level vision characteristics are specific to the optical/biological limitations of human vision rather than natural image statistics. Our analysis is meant to shed some light on how the vision, either biological or computational, may develop from observing samples of the world, taking either the same or different routes to accomplish their respective tasks.

计算机视觉基础模型(如DINO[9]、OpenCLIP[22,32])展现出卓越的任务泛化能力,正成为许多计算机视觉方法的核心组件。其卓越性能源于在超大规模图像数据集上的自监督训练。而人类视觉系统的能力形成,则源于从婴儿期到童年期长期暴露于真实世界视觉刺激的发育过程[6]。这引发了一个关键问题:如果神经网络与视觉系统均通过大量真实世界图像训练,它们的低级视觉特征是否会趋同?若答案为肯定,则表明这些低级特征是自然产生的,可能反映了真实场景的统计特性;若为否定,则说明人类低级视觉特征源于光学/生物限制,而非图像统计规律。本研究旨在揭示生物视觉与计算视觉在观察世界样本时,如何通过相同或不同路径实现功能目标。

In particular, we are interested in the characteristics that are well understood and measured in human vision science using psychophysical methods: contrast detection [4], contrast masking [25] and contrast constancy [18]. Contrast detection and contrast masking quantify the ability of the visual system to detect small contrast patterns, either on uniform backgrounds (contrast detection) or on backgrounds with patterns (contrast masking). Contrast detection and masking capture the “bottlenecks” of the visual system — the characteristic that may prevent us from detecting patterns that are too dark or too small. Similarly, cameras used for computer vision are limited by the MTF of the lens, sensor resolution, photon and sensor noise, and we can expect that computer vision methods may need to deal with similar limitations.

Contrast constancy is the term used in vision science to describe the invariance of the visual system to spatial frequency [18] and partially luminance [24, 31]. Georgeson and Sullivan [18] showed that the perceived magnitude of the contrast that is well above the detection threshold (supra-threshold) appears to us the same regardless of spatial frequency. This is a very important characteristic as it allows us to see contrast (and therefore objects) the same regardless of the viewing distance; otherwise, the frequencies would change with the viewing distance and hence the contrast appearance. A partial constancy (invariance) is also observed across luminance [24, 31], though there is a significant deviation from constancy at lower luminance levels, once the visual system needs to rely on the rod vision. The invariance is also an important feature of many computer vision methods. For example in SIFT features [27] have been designed to be invariant to the changes in contrast, brightness, scale, and rotation. In our experiments, we used the supra-threshold contrast matching test to assess whether the models exhibit the characteristic of contrast constancy.

Numerous works on adversarial attacks demonstrated that the classification performance of deep learning models can be greatly degraded by visually inconsequential changes [20]. At the same time, human vision does not suffer from such adversarial vulnerability [43]. This is one of the most salient arguments put forward to state that deep architectures are different from human vision. Here, we propose a different methodology to study this question. We consider the deep neural network to be a black box and compare its responses to well-understood and measured characteristics of the human visual system. In particular, we want to check whether the foundation vision models share the same “bottlenecks” and invariance properties as the visual system. To achieve this, we test foundation models on basic vision stimuli, such as Gabor patches and band-limited noise, and compare the response of those models with the psychophysical data collected from human observers.

我们特别关注视觉科学中通过心理物理学方法验证的三大特征:对比度检测、对比度掩蔽和对比度恒常性对比度检测与掩蔽量化了视觉系统对微小对比度模式的感知能力——前者针对均匀背景(对比度检测),后者针对带图案背景(对比度掩蔽)。这些指标反映了视觉系统的"瓶颈"特性,即限制人类无法感知过暗或过小图案的机制。类似地,计算机视觉系统的摄像头受限于镜头的调制传递函数(MTF)、传感器分辨率、光子噪声等,其方法也需应对相似限制。

对比度恒常性指视觉系统对空间频率和部分亮度的不变性。Georgeson和Sullivan发现,高于检测阈值的对比度(超阈值对比度)感知幅度在不同空间频率下保持一致。这一特性使人类能在不同观察距离下保持对物体对比度的稳定感知。类似不变性也体现在亮度变化中,尽管低亮度下因依赖视杆细胞而出现显著偏差。这种不变性同样是许多计算机视觉方法的核心特征,例如SIFT特征[27]被设计为对对比度、亮度、尺度和旋转保持不变。本研究通过超阈值对比度匹配测试,评估模型是否具备对比度恒常性。

现有研究表明,微小视觉扰动可能导致深度学习模型分类性能显著下降,而人类视觉系统对此类对抗攻击具有鲁棒性,这成为论证深度网络与人类视觉差异的重要论据。

本研究提出新方法:将深度神经网络视为黑箱,通过其对标准视觉刺激(如Gabor斑块和带限噪声)的响应,与人类观察者的心理物理数据进行对比,从而检验基础视觉模型是否共享视觉系统的"瓶颈"特性与不变性。

In summary, our contributions are as follows:

• We developed a protocol to evaluate the similarity between machine vision models and the human visual system. This protocol includes contrast detection, contrast masking, and contrast constancy, subdivided into nine distinct test types that collectively capture the low-level fundamental characteristics of human vision.

• We tested the image encoders of 45 foundation and generative models. The results reveal similarities between certain foundation models (e.g., DINOv2 and OpenCLIP) and human vision, particularly in the contrast masking test. However, differences persist across other tests.

研究贡献总结
• 提出评估机器视觉模型与人类视觉系统相似性的实验协议,涵盖对比度检测、掩蔽与恒常性三大维度,细分为9种测试类型,全面捕捉人类视觉的低级特征。
• 对45种基础模型和生成模型的图像编码器进行测试。结果表明,部分基础模型(如DINOv2、OpenCLIP)在对比度掩蔽测试中与人类视觉高度相似,但在其他测试中仍存在差异

Figure 6. The quantified similarity error between all 45 models and HVS under 9 different tests. For the Contrast Detection and Contrast Masking tasks, Spearman Correlation was used as the metric, with higher values (closer to 1) indicating greater similarity to human vision. For the Supra-threshold Contrast Matching task, RMSE was used as the metric, with lower values (closer to 0) indicating better similarity.