제목 미제공 논문 분석
📝 Abstract
💡 Analysis
📄 Content
Figure 1. Would you say images in Group A are similar to the Reference Image? Current state-of-the-art image similarity models (e.g., LPIPS [1], CLIP [2]) would answer no. These models would say only Group B are similar to the reference image, as they equate similarity with a high degree of shared perceptual attribute features (i.e., color, shape, semantic class). However, as humans, we would confidently say yes-images in both groups are similar to the reference. While Group B is similar in perceptual attributes, Group A is similar in a more abstract, relational sense (e.g., “transformation of {subject} through time”, first row). In this paper, we propose to model this missing dimension of visual similarity, or called relational visual similarity, capturing human-like reasoning over relational structures.
Humans do not just see attribute similarity-we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach’s skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How † denotes equal advising can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized-describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of realworld applications, existing image similarity models fail to capture it-revealing a critical gap in visual computing.
The ability to perceive and recognize visual similarity is arguably the most fundamental sense for any visual creature, including humans, to interact and make sense of the world [3,4]. We process visual attributes to guide decisions: recognizing that a peach is red might signal that it is edible. We also notice similarities across different objects (e.g., shape, color, texture) to categorize, remember, and abstract them: an apple and a peach are both red and round, so they are likely both fruits. Beyond this, we can see relational similarity as well: we abstract familiar patterns to understand more complex or unseen phenomena. For example, we can anticipate the Earth is like a peach, as its layers-crust, mantle, and core-roughly correspond to the peach’s skin, flesh, and pit, even though no one has directly observed it. In cognitive science, attribute similarity and relational similarity are often considered the two central pillars when it comes to understanding human perception of similarity [5,6]. Attribute similarity underlies everyday activities (e.g., recognition [7], classification [8], memorization [9]), while relational similarity fuels reasoning and creativity (e.g., analogies [10], abstract thought [11]). Some researchers argue that relational similarity is even more central to human cognition, as it drives analogical learning and creativity-the traits that set humans apart from other intelligent species [12][13][14].
Unfortunately, current state-of-the-art visual similarity frameworks focus almost exclusively on attribute-level similarity. Traditionally, image similarity in computer vision has been framed as the task of comparing two images and deciding whether they are visually similar, typically at the pixel or feature level using handcrafted descriptors [15,16]. In recent years, large-scale hierarchical datasets (e.g., Ima-geNet [17]) and cross-modal datasets (e.g., LAION-2B [18]) have enabled deep learning models to move beyond lowlevel visual details. Modern approaches (e.g., [2,[19][20][21][22][23]) can recognize different images of the same semantic class or images that match a rough textual description-for example, “a photo of matchsticks”-even if they differ in shape, color, or other low-to mid-level details (Fig. 1, Group B, first row).
However, by focusing primarily on surface-level features, these models struggle to capture relational similarity (see Sec. 4.2). For instance, they cannot easily recognize that the burning stages of
This content is AI-processed based on ArXiv data.