Do Images Clarify? A Study on the Effect of Images on Clarifying Questions in Conversational Search
Conversational search systems increasingly employ clarifying questions to refine user queries and improve the search experience. Previous studies have demonstrated the usefulness of text-based clarifying questions in enhancing both retrieval performance and user experience. While images have been shown to improve retrieval performance in various contexts, their impact on user performance when incorporated into clarifying questions remains largely unexplored. We conduct a user study with 73 participants to investigate the role of images in conversational search, specifically examining their effects on two search-related tasks: (i) answering clarifying questions and (ii) query reformulation. We compare the effect of multimodal and text-only clarifying questions in both tasks within a conversational search context from various perspectives. Our findings reveal that while participants showed a strong preference for multimodal questions when answering clarifying questions, preferences were more balanced in the query reformulation task. The impact of images varied with both task type and user expertise. In answering clarifying questions, images helped maintain engagement across different expertise levels, while in query reformulation they led to more precise queries and improved retrieval performance. Interestingly, for clarifying question answering, text-only setups demonstrated better user performance as they provided more comprehensive textual information in the absence of images. These results provide valuable insights for designing effective multimodal conversational search systems, highlighting that the benefits of visual augmentation are task-dependent and should be strategically implemented based on the specific search context and user characteristics.
💡 Research Summary
This paper investigates how visual augmentation of clarifying questions influences user behavior and retrieval performance in conversational search (CS). While prior work has shown that text‑only clarifying questions can improve both system effectiveness and user experience, the potential benefits of adding images to these questions have not been systematically examined. To fill this gap, the authors conducted a controlled within‑subject user study with 73 participants (31 experts, 42 novices) across 22 topics drawn from the ClariQ dataset. Each participant performed two core tasks under two conditions: (i) answering a clarifying question (CQA) and (ii) reformulating the original query (QR). The conditions were (a) text‑only clarifying questions and (b) multimodal clarifying questions that included a relevant image.
The study measured subjective satisfaction and engagement via Likert‑scale questionnaires, and objective performance via expert‑rated answer correctness for CQA and standard IR metrics (nDCG@10, MAP) for QR. Three hypotheses guided the work: (H1) multimodal questions increase engagement and satisfaction; (H2) user expertise moderates reliance on images; (H3) images are more useful for queries with inherent visual attributes.
Key findings reveal a nuanced, task‑dependent impact of images. In the CQA task, participants strongly preferred multimodal questions (78 % rated them more intuitive), yet their answer quality was higher in the text‑only condition (average correctness 0.71 vs. 0.64). This suggests that while images boost perceived ease, they may constrain users from providing richer textual explanations, especially when the image does not fully capture the required nuance. In the QR task, the opposite pattern emerged: multimodal prompts led to longer, more specific reformulated queries and significantly better retrieval outcomes (nDCG@10 0.42 vs. 0.35; MAP 0.38 vs. 0.31). The visual cue acted as an anchor, helping users articulate visual attributes such as style, color, or spatial layout that are hard to describe with text alone.
Expertise further moderated these effects. Novice users benefited more from images in both tasks, showing higher engagement scores and larger gains in QR performance. Experts, however, performed equally well—or even slightly better—without images, indicating that they can infer the needed information from text alone and may find images redundant or distracting. Domain analysis showed that topics with strong visual components (e.g., wedding dresses, architectural roofs) narrowed the performance gap in CQA, whereas abstract topics (e.g., political results) showed little impact from images.
The authors discuss these results in light of cognitive load theory and multimodal learning principles. Images reduce cognitive effort by providing immediate visual context, but they must be tightly aligned with the question intent to avoid “visual tunnel vision” that limits textual elaboration. Consequently, system designers should adopt a dynamic, context‑aware strategy: employ images when the task is query reformulation or when the information need is visually grounded, and rely on text‑only prompts for direct answer extraction, especially for expert users or abstract queries.
Design implications include: (1) developing image‑selection algorithms that prioritize relevance to the clarifying question; (2) ensuring that textual explanations complement rather than duplicate visual content; (3) personalizing the multimodal experience based on user expertise and domain. The paper concludes by calling for future work on real‑time user modeling to decide when to present images, and on exploring how image quality, diversity, and presentation format further affect both user experience and retrieval effectiveness.
Comments & Academic Discussion
Loading comments...
Leave a Comment