Trust in Vision-Language Models: Insights from a Participatory User Workshop

Reading time: 5 minute
...

📝 Abstract

With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants’ engagement to fit the case of user-VLM interaction.

💡 Analysis

With the growing deployment of Vision-Language Models (VLMs), pre-trained on large image-text and video-text datasets, it is critical to equip users with the tools to discern when to trust these systems. However, examining how user trust in VLMs builds and evolves remains an open problem. This problem is exacerbated by the increasing reliance on AI models as judges for experimental validation, to bypass the cost and implications of running participatory design studies directly with users. Following a user-centred approach, this paper presents preliminary results from a workshop with prospective VLM users. Insights from this pilot workshop inform future studies aimed at contextualising trust metrics and strategies for participants’ engagement to fit the case of user-VLM interaction.

📄 Content

Vision Language Models (VLMs) represent a methodological shift for learning correspondences between image-text and video-text pairs from large-scale data. Unlike traditional Computer Vision approaches, VLMs reduce reliance on curated, task-specific training datasets, enabling zero-shot inference on previously unseen tasks and categories [1]. Thanks to their remarkable ability to interpret image and video content, these models are being rapidly adopted by society at large. While this unprecedented adoption presents exciting opportunities, it also raises significant concerns. VLMs are inherently difficult to audit, often closed-source, and operate as black-box systems accessible only through indirect observation, steering Artificial Intelligence (AI) research closer to an “ersatz natural science” [2].

These challenges are heightened when VLMs are used in safety-critical environments robotic systems, where errors in decision-making can lead to catastrophic consequences [3] -like in the case of autonomous transportation, [4], inspection and maintenance [5], disaster response [6], and assistive healthcare [7]. Human oversight in evaluating VLM performance in real-world settings is crucial to ensure users can determine how and when to trust these systems.

The EU Ethics Guidelines for Trustworthy AI [8] and AI Act [9] provide a regulatory framework grounded on ethical principles, varying levels of risk, and technical requirements. However, defining Trustworthy AI (TAI) within specific real-world contexts remains contested [10]. The first gap is epistemological, requiring a distinction between trustworthiness, an inherent property of a system’s actual capabilities, and trust, which reflects the user’s perception of trustworthiness [11]. Moreover, the term trust, borrowed from interpersonal contexts, has been directly applied to AI, yet further research and regulation are needed to adapt this notion to human-AI interactions.

Adding to the complexity, the concept of trust overlaps with other key dimensions of Human-AI interaction, such as explainability [12], transparency [13], fairness [14], and accountability [15]. Thus, Problem scope. Trust dynamics are hard to define, as they are inherently subjective and dependent on the application context. For example, the level of trust placed in a VLM may differ when describing a video for entertainment versus one used in a medical diagnosis. Perceived trust also evolves over time, shaped by prior experience, familiarity with the technology, and repeated interactions [16]. Given the broad and multi-faceted nature of the field of TAI, our focus is on examining trust dynamics during users’ interaction with Vision Language Models. As highlighted in a recent coverage study of works in user-VLM trust [17], the current literature lacks comprehensive studies directly involving users, especially when considering applications in Computer Vision and multi-modal AI models. In this paper, we draw on insights from a workshop with users to ground the notion of VLM trust in a concrete use case and gather preliminary requirements for future studies.

The Vision Language Model learning paradigm typically involves three stages. A pre-training phase, where the model is optimised using large-scale, off-the-shelf data, either labelled or unlabelled. An optional fine-tuning phase, where the model is adapted to domain-specific data. An inference phase, where the model performance is evaluated on downstream tasks. VLMs have gained particular attention for their ability to skip the fine-tuning step, enabling zero-shot inference on unseen tasks or categories without additional training [1]. Zero-shot capabilities in VLMs arise from their design, as visual and textual features are extracted via encoder modules in a general-purpose manner, i.e., independently of any specific downstream task. Visual features are typically extracted from images and video frames using either Convolutional Neural Networks (CNNs) or Transformers, while textual features are almost invariably extracted with Transformers.

Correlations between visual and textual features are learned through various pre-training objectives. These include i) contrastive objectives, which optimise embeddings for positioning similar features closer and dissimilar features farther in the vector space; ii) generative objectives, where correlations are learned by generating data within a single modality (e.g., image-to-image generation [18]) or across modalities -e.g., text-to-image generation [19]; iii) alignment objectives, which focus on directly matching corresponding elements in visual and textual inputs (e.g., matching local image regions to words [20]). In the case of videos, models are often trained on iv) temporal objectives, like re-ordering input frames [21].

Earlier VLMs used separate branches for each modality during pre-training as in the original CLIP model [22]. However, recent architectures have shifted to unified designs with a

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut