Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding

Reading time: 5 minute
...

📝 Original Info

  • Title: Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding
  • ArXiv ID: 2511.19005
  • Date: 2025-11-24
  • Authors: : - Author1 Name - Author2 Name - …

📝 Abstract

Spoken Language Understanding (SLU) consists of two subtasks: intent detection (ID) and slot filling (SF). Given its broad range of real-world applications, enhancing SLU for practical deployment is increasingly critical. Profile-based SLU addresses ambiguous user utterances by incorporating context awareness (CA), user profiles (UP), and knowledge graphs (KG) to support disambiguation, thereby advancing SLU research toward real-world applicability. However, existing SLU datasets still fall short in representing real-world scenarios. Specifically, (1) CA uses one-hot vectors for representation, which is overly idealized, and (2) models typically focuses solely on predicting intents and slot labels, neglecting the reasoning process that could enhance performance and interpretability. To overcome these limitations, we introduce VRSLU, a novel SLU dataset that integrates both Visual images and explicit Reasoning. For over-idealized CA, we use GPT-4o and FLUX.1-dev to generate images reflecting users' environments and statuses, followed by human verification to ensure quality. For reasoning, GPT-4o is employed to generate explanations for predicted labels, which are then refined by human annotators to ensure accuracy and coherence. Additionally, we propose an instructional template, LR-Instruct, which first predicts labels and then generates corresponding reasoning. This two-step approach helps mitigate the influence of reasoning bias on label prediction. Experimental results confirm the effectiveness of incorporating visual information and highlight the promise of explicit reasoning in advancing SLU.

💡 Deep Analysis

📄 Full Content

Spoken Language Understanding (SLU) is a fundamental component of task-oriented dialogue (TOD) systems (Ni et al. 2023). It generally comprises two sub-tasks: intent detection (ID) and slot filling (SF) (Tur and De Mori 2011). ID aims to identify the user's underlying intent, while SF focuses on extracting entities associated with that intent. Together, the predicted intent and slot labels form the basis for generating appropriate system responses and executing subsequent actions within TOD systems. Given its wide applicability in domains such as smart speakers, virtual assistants, and smart home systems, SLU has attracted increasing research attention in recent years (Qin et al. 2021b;Weld et al. 2022;Muhammad et al. 2025).

Previous research on SLU can be broadly categorized into two stages. The first stage focuses on traditional SLU under idealized conditions, where user utterances are clear, explicit, and unambiguous. In such settings, models are able to accurately identify intent and slot information based solely on the utterance. Existing methods have already achieved promising results under these conditions (Qin et al. 2021a;Chen et al. 2022;Wu et al. 2024a). However, this scenario does not reflect the complexity of real-world applications, where user utterances are often vague or ambiguous. For example, given the utterance “Play Martial Universe by Tiancan Tudou on the smart screen”, it is difficult for the model to determine whether the user is requesting music, a video, or an audiobook. To address such challenges, Xu et al. (2022) demonstrated that traditional SLU methods struggle with ambiguous inputs and proposed Profile-based SLU (ProSLU). This task requires the model to integrate additional sources of information to accurately infer user intent from ambiguous utterances. These sources include: Context Awareness (CA), which reflects the user’s current environment and state; User Profile (UP), which captures personal preferences and attributes; and Knowledge Graphs (KG), which provide knowledge about entities in the utterance. Although ProSLU marks an important step toward handling utterance ambiguity in real-world scenarios, significant discrepancies still remain between current research efforts and practical deployment:

• Overly idealized representation of CA: In ProSLU, CA is represented using one-hot vectors (e.g., “Home: 1.0, … Walking: 1.0, …”), as illustrated in Figure 1(a). However, such representations are overly simplified and fail to capture the complexity of real-world environments. In practice, CA is more realistically derived from visual inputs, where the model must infer the user’s surroundings and state from images, as shown in Figure 1(b), rather than relying on predefined categorical encodings. • Lack of explicit reasoning: Traditional SLU and ProSLU systems produce only intent and slot predictions without providing the underlying reasoning process. This limitation overlooks the potential benefits of reasoning for improving both accuracy and interpretability. Explicit reasoning not only facilitates a deeper understanding of user intent but also enhances the transparency and trustworthiness of SLU systems, which is crucial for realworld deployment.

To address the two challenges discussed above and advance SLU research toward real-world applicability, we construct a new SLU dataset that incorporates both Visual scenes and Reasoning, VRSLU. To mitigate the issue of overly idealized CA representation, we first empirically demonstrate that CA significantly enhance the model’s understanding of user utterances, underscoring their importance. We further argue for the use of images instead of onehot vectors. Then, we use GPT-4o to convert the one-hot encoded CA vectors into descriptive paragraphs, and generate corresponding images using FLUX.1-dev1 . All generated images undergo manual verification to ensure semantic alignment and visual quality. To address the lack of explicit reasoning, we input the utterance, CA, UP, KG, and the corresponding intent and slot labels into GPT-4o, prompting it to generate explanatory reasoning for each label, as illustrated in Figure 1(c). All reasoning outputs are manually reviewed and refined to ensure both accuracy and clarity. Furthermore, we propose LR-Instruct, an instructional method that first predicts labels and subsequently generates reasoning. This two-step approach effectively mitigates the impact of reasoning deviations on label prediction in multimodal large language models (MLLMs). Experimental results demonstrate the effectiveness of our approach across multiple MLLMs. Finally, we show that both visual images and explicit reasoning play critical roles in accurately understanding user requests.

The contributions of this work are as follows: 1. To extend SLU to more practical scenarios, we construct a novel dataset, VRSLU, which is the first to include both visual scene images and explicit reasoning processes.

  1. We propose LR-Inst

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut