Conversational agents often encounter ambiguous user requests, requiring an effective clarification to successfully complete tasks. While recent advancements in real-world applications favor multi-agent architectures to manage complex conversational scenarios efficiently, ambiguity resolution remains a critical and underexplored challenge-particularly due to the difficulty of determining which agent should initiate a clarification and how agents should coordinate their actions when faced with uncertain or incomplete user input. The fundamental questions of when to interrupt a user and how to formulate the optimal clarification query within the most optimal multi-agent settings remain open. In this paper, we propose MAC (Multi-Agent Clarification), an interactive multi-agent framework specifically optimized to resolve user ambiguities by strategically managing clarification dialogues. We first introduce a novel taxonomy categorizing user ambiguities to systematically guide clarification strategies. Then, we present MAC that autonomously coordinates multiple agents to interact synergistically with users. Empirical evaluations on MultiWOZ 2.4 demonstrate that enabling clarification at both levels increases task success rate 7.8% (54.5 โ 62.3) and reduces the average number of dialogue turns (6.53 โ 4.86) by eliciting all required user information up front and minimizing repetition. Our findings highlight the importance of active user interaction and role-aware clarification for more reliable human-agent communication.
MultiWOZ 2.4 User: Hey! I need to book a restaurant for this Friday at 8pm.
Assistant (Manager): User: There will be 7 of us. Clarify: How many people will be dining? Think: The user's query is missing the group size, which is necessary for any booking. Cuisine details are domain-specific, so I will leave that to the expert.
๐ Full Content
Effective user clarification is fundamental to conversational agents, significantly impacting their ability to fulfill user requests accurately and efficiently (Aliannejadi et al., 2019). In natural interactions, users often express ambiguous queries, intentionally or unintentionally omitting details that seem inferable or contextually obvious. Such am- The manager agent first identifies and resolves only high-level, domain-agnostic ambiguity (group size), explicitly leaving domain-specific clarifications (such as cuisine and restaurant selection) to the domain expert which can require domain knowledge from the database. This dialogue illustrates how each agent's role is confined to its designated scope: the manager collects general requirements, while the expert gathers and confirms specialized details before completing the reservation. biguity can cause agents to make incorrect assumptions, provide incomplete responses, or even fail to accomplish tasks-issues that are especially critical in high-stakes domains such as healthcare, finance, and customer support. Proactively resolving ambiguities through targeted user interactions by asking clear and relevant clarification questions can substantially enhance the accuracy of task execution, user satisfaction, and the overall effectiveness of conversational systems (Deng et al., 2023;Acikgoz et al., 2025e).
In single-agent systems, the challenge of ambiguity resolution has been previously studied with different strategies (Dongre et al., 2024), from asking targeted questions (Li et al., 2023;Zhang and Choi, 2025) to inferring user preferences from past interactions (Andukuri et al., 2024). However, the landscape of conversational AI is rapidly evolving towards more complex, multi-agent architectures, especially in industrial settings where a single agent cannot efficiently manage the large number of APIs and multitasking demands (Sun et al., 2025). As a result, manager-expert routing systems are becoming the standard for handling real-world tasks (Guo et al., 2024;Tran et al., 2025). This paradigm, often featuring a “manager” or “advisory” agent that routes requests to specialized “expert” agents, introduces new layers of complexity for user interaction (Ong et al., 2025). In such a setup, determining the optimal moment and method for clarification becomes a significant challenge. For instance, should the high-level advisory agent, which first receives the user’s request, interrupt for clarification, or should this be delegated to a domain-specific expert agent, potentially increasing latency and conversational turns? Moreover, deciding how much domain-specific knowledge the manager should possess introduces another design challenge, establishing an essential knowledge boundary. Therefore, proposing an approach that effectively manages ambiguity resolution while remaining independent of this specific design choice is crucial for creating flexible and robust multi-agent systems.
To explore these open questions, we introduce MAC (Multi-Agent Clarification), the first framework that focuses on resolving user ambiguities within multi-agent conversational systems and aims to uncover how, when, and by whom user clarifications should be initiated within these multi-agent settings (See Figure 1). The framework strategically determines not only the moment to seek clarification but also which agent-the supervisor or the domain-specific expert-is best positioned to ask. Our experiments on MultiWOZ 2.4 (Ye et al., 2022) reveal that the placement and timing of clarification matter more than previously recognized. First, we show that enabling clarification at both levels delivers not just higher task success with a 7.8% absolute gain over the no-clarification baseline, but does so while reducing the average number of conversational turns. Second, the optimal coordinated setup-where the supervisor manages high-level ambiguities and the expert agent resolves domain-specific ones-delivers the highest performance, even outperforming previous stateof-the-art TOD approaches on MultiWOZ with 11.50%. This means effective clarification is not merely about “asking more questions”, but about delegating the right questions to the right agents at the right time.
The main contributions of this work are:
Asking Clarification Questions Asking user clarification questions has been studied in conversational AI research, with distinct focuses on when to ask and what to ask (Kuhn et al., 2022). Some approaches use uncertainty estimation or informationtheoretic models to decide when to initiate clarification (Zhang and Choi, 2023). More advanced frameworks attempt to address both when and what to ask problems jointly (Andukuri et al., 2024;Zhang et al., 2024), but they are often limited to a small number of conversational turns, which is insufficient for complex, real-world tasks. Closest to our approach, ReSpAct (Dongre et al., 2024) enables clarification with rule-based prompting, however operating in vanilla single-agent settings. This approach fails to address the challenges of production systems, such as smart home platforms, which operate as complex multi-agent systems (Guo et al., 2024). Our work fills this critical gap by proposing a novel framework for coordinating clarification strategies across multiple specialized agents to ensure a coherent and efficient user experience similar to direct real world settings. Large Language Models for Task-Oriented Dialogue Recent progress in LLMs has led to their adoption in multi-domain TOD systems (Hudeฤek and Dusek, 2023). Existing approaches typically rely on either prompting-based methods (Hu et al., 2022;Chung et al., 2023;Xu et al., 2024) or specialized fine-tuning (Hosseini-Asl et al., 2020;Yang et al., 2021;Zhong et al., 2023;Sun et al., 2023;Bang et al., 2023;Li et al., 2024). Fine-tuned models are often tailored to narrow tasks such as state tracking or offline benchmarks, and as a result, they struggle to generalize to complex, real-world, multi-turn conversations (Acikgoz et al., 2025a). More recently, AutoTOD (Xu et al., 2024) demonstrated the use of GPT-4 with domain-specific, hand-crafted prompts and pre-defined APIs, but this approach depends heavily on lengthy instructions and lacks adaptability. On the other hand, some recent studies have begun to explore multiagent architectures for TOD (Gupta et al., 2024), but they have not addressed the crucial aspect of user clarification, which is an essential skill for handling ambiguous or incomplete user requests in practical settings. In contrast, our work introduces the first multi-agent system explicitly focusing on asking user clarification question in multiturn TOD, establishing an optimal framework for more reliable and user-centric dialogue agents.
MultiWOZ 2.4 In our research, we utilize Mul-tiWOZ 2.4 (Ye et al., 2022), a comprehensive multi-domain dialogue benchmark designed for task-oriented dialogue (TOD) systems. It contains multi-turn conversations between a user and a system simulating a dialogue assistant. The user is given a goal (e.g., book a hotel and a restaurant in the same area), while the system must fulfill the request using a consistent belief state and database API. Each dialogue is annotated with dialogue acts, belief states, and system actions at each turn, enabling full end-to-end modeling and evaluation. It consists of โผ8,500 training dialogues and a test set of 1,000 dialogues. The test conversations within MultiWOZ 2.4 simulate customer service interactions across five distinct domains: restaurant, hotel, train, attraction, and taxi. To create more realistic real-world scenarios in multi-turn settings with actual users, we enhanced this dataset by incorporating a user simulator, making the tasks both more challenging and more authentic.
Task During evaluation, each dialogue involves between up to five domains: restaurant, hotel, train, attraction, and taxi. The agent must understand the user’s multi-intent goals, track the evolving belief state, issue database queries, and generate appropriate system responses. Crucially, user queries often underspecify constraints (e.g., “Book a restaurant for dinner”), requiring the agent to proactively request clarification (e.g., number of people, cuisine, or time), which makes a well-suited environment to test our multi-agent approach.
User Simulator Our experimental setup involves a user simulator, which we implemented to interact with the agent in a multi-turn conversational flow. This simulator is tasked with pursuing the predefined user goals, while the agent’s objective is to assist the user in achieving these goals by interacting with an external database. We chose to work with MultiWOZ 2.4 over other benchmarks, due to its capability to simultaneously handle five distinct domains within a single conversational context. The available actions for each domain are detailed in Appendix Table 6.
Ambiguity in user requests is a central challenge in conversational agents, yet there is limited empirical guidance in multi-agent dialogue systems on where and how clarification should be initiated in such frameworks. To address this, we systematically investigate different agent-level strategies for user clarification in a hierarchical multi-agent architecture comprising a manager/router and multiple domain-specific experts (See Figure 2).
In our multi-agent framework, we adopt a centralized multi-agent setting as our base architecture, which consists of a single supervisor agent and multiple specialized domain expert agents. The supervisor agent is responsible for orchestrating the overall dialogue flow by routing each user request to the most relevant domain expert, following the router-based approach in Ong et al. (2025). Each expert agent is specialized for one of the five domains in MultiWOZ 2.4, and is tasked with executing domain-specific actions to fulfill user goals. In MAC, we further enhance this framework by integrating user clarification mechanisms to re-solve ambiguities. This involves assigning specific clarification-handling capabilities to both supervisor and domain expert roles as in Table 1, enabling them to manage different forms of uncertainty and improve final task outcomes.
Supervisor Agent In the MAC framework, the Supervisor agent is responsible for two different tasks: (i) orchestrating the agent collaboration by routing user queries to the appropriate domain expert, and (ii) handling top-level clarification of user requests when the ambiguity can be resolved with general commonsense reasoning, independent of domain-specific knowledge (Table 1, top). Formally, for each incoming user query u, the Supervisor evaluates an ambiguity function is_ambiguous(u) โ 0, 1: if is_ambiguous(u) = 1, the agent issues a clarification prompt to the user using the standardized format question; otherwise, it selects the appropriate domain expert with domain. This prompt-based control flow is illustrated in Figure 7, where the Supervisor’s output is parsed and dispatched to downstream agents. Notably, the Supervisor operates without access to domain-specific databases or APIs, ensuring that only non-domain-specific ambiguities (e.g., group size or intent) are addressed at this stage. After resolving high-level ambiguities, supervisor delegates the (potentially clarified) user request to the corresponding domain expert, enabling more efficient and role-aware collaboration across the agent hierarchy. five expert agents, corresponding to the five domains in MultiWOZ 2.4: restaurant, hotel, train, taxi, and attraction. Once a user query is routed to a domain expert, the agent analyzes the input-potentially enriched by prior supervisor-level clarification-and determines whether the information is sufficient to proceed with an accurate API calls or reliable response generation. To guide this behavior, we prompt each expert individually with domain-specific instructions that are coupled with the standardized protocols for user clarification (see Figure 8), following predefined expert specific clarification taxonomy (Table 1, bottom). Similar to the supervisor, the agent computes an ambiguity function is_ambiguous(u) โ 0, 1; if the result is 1, the agent triggers a clarification request formatted as question. If the input is deemed sufficient, the agent executes the necessary domain-specific operations and responds using the structure utterance.Otherwise, the agent is allowed to ask multiple clarification questions until the total conversation length exceeds 20 turns. These prompt-structured outputs allow the framework to dynamically interleave reasoning, clarification, and execution in multi-turn interactions. Domain Experts have access to API schemas and databases corresponding to their domain, enabling them to ground their responses in task-specific constraints and complete user requests accurately.
To elucidate the core principles of MAC, we conduct a systematic analysis of different strategies across the experimental design choices detailed in Figure 2.
In our MAC framework, we used gpt-4o-2024-08-06 as the base configuration for the selected LLM, serving as both the advisor and each expert, unless otherwise specified. However, we have conducted comprehensive ablation studies on the effect of model choice for both nodes in the Section 5.4. We conducted our evaluation on the MultiWOZ 2.4 test split, which contains 1,000 test samples from five domains: restaurant, hotel, train, attraction, and taxi. The evaluation was performed in online sessions where we implemented a user simulator based on gpt-4o-2024-08-06, as defined in Section 3. To account for LLM randomness, we ran each experiment five times and report the Success Rate with Avg@5 with their standard deviations and also include Success Rate with Max@5 which gives the max scores achieved in these five runs. Further details about the evaluation metrics can be seen in Section A.
MAC is the first LLM-based multi-agent framework specifically designed for user clarification. To evaluate its effectiveness, we compare it against three variants of the same multi-agent architecture: (i) without any user clarification (see Figures 5 and6 for baseline supervisor and expert prompts Method Clarification Success (Max@5 โ) Success (Avg@5 โ) Avg. We report (Success Max@5): the highest single-run task success rate out of five runs, (Success Avg@5): the mean and standard deviation of success rates over five runs, and (Avg. Turns): the average number of dialogue turns per conversation (lower is better). Each row corresponds to a specific agent configuration-clarification enabled for the expert, the supervisor, both, or neither. Results demonstrate that enabling clarification for both supervisor and expert agents leads to the highest task success and most efficient dialogues.
without clarification), (ii) with user clarification handled only by the Supervisor, and (iii) with user clarification enabled only for the domain experts (see Figure 2). In setting (i), neither the Supervisor nor the domain experts are instructed to ask clarification questions. In setting (ii), only the Supervisor is prompted to perform both routing and user clarification, while the domain experts are limited to responding after routing. In setting (iii), only the domain experts are prompted to ask clarification questions, and the Supervisor is responsible solely for routing. In contrast, MAC enables user clarification at both the Supervisor and domain expert levels, allowing every agent node to interact with the user as needed. This setup allows for a fair and systematic evaluation of the individual and combined effects of clarification skills across different nodes.
We compare MAC against three variants: (i) MAC without any clarification capability, (ii) MAC where only domain-specific experts perform clarifications (MAC expert ), and (iii) MAC where only the supervisor at the top node initiates clarification questions (MAC supervisor ). (Yang et al., 2021) 26.80 GALAXY (Zhong et al., 2023) 28.80 MARS (Sun et al., 2023) 27.50 TOATOD (Bang et al., 2023) 26.90 FNCTOD (Li et al., 2024) 44.40 AutoTOD (Xu et al., 2024) 46.90 MAC 58.40 efit of proactive conversational strategies. These findings highlight the MAC framework’s superior performance, effectively balancing accuracy with conversational efficiency.
Takeaway 1: Asking clarification questions consistently increase task success and decrease number of turns to solve the task in multi-agent settings.
In Table 2, we demonstrated MAC’s performance compared to its variants, highlighting that combining clarification capabilities between supervisor and experts results in the most effective setup.
To further contextualize MAC’s performance, it is crucial to benchmark against other leading taskoriented dialogue (TOD) systems. Following the evaluation framework of AutoTOD (Xu et al., 2024), we present this comparison in the multi-agent architecture in multi-domain scenarios such as MultiWOZ 2.4 and the critical role of proactive clarification when handling uncertainties 1 .
Takeaway 2: MAC demonstrates superior performance over previous TOD systems, attributed to its multi-agent architecture and effective user clarification capabilities.
How does the choice LLM effect MAC? Since multi-agent setups are typically constructed using multiple LLMs with prompting, it is valuable to evaluate the performance of diverse LLMs within our MAC framework. To this end, we experimented with proprietary API-based and opensource models: GPT-4o-2024-11-20 (Hurst et al., 2024), gpt-4o-mini, Qwen3-235B-A22B (Yang et al., 2025). As shown in Table 4, enabling coordinated user clarification for both supervisor and expert agents in the MAC framework consistently improves task success rates, regardless of model type. For instance, gpt-4o and gpt-4o-mini achieve absolute improvements of +4.68 and +4.70 points, respectively, when equipped with clarification. Notably, the open-source Qwen3-235B-A22B model exhibits an even larger gain of +7.18 points, narrowing the gap with proprietary counterparts. The larger delta in accuracy for open-source LLMs suggests that well-designed supervision and agent coordination can unlock their potential, making them 1 Some earlier TOD systems in Table 2 were developed prior to the integration of LLMs and follow fundamentally different pipelines (Acikgoz et al., 2025c), making direct comparisons not fully fair. Nevertheless, this comparison aims to contextualize MAC’s performance and also illustrate the overall progress of TOD systems over time.
competitive candidates with propriety models for agentic systems in practice.
Takeaway 3: Enabling user clarification for both supervisor and expert agents in MAC consistently improves performance regardless of model type, even with open-source models.
Takeaway 4: Proactive interaction and effective agent coordination yield the highest accuracy gains for open-source LLMs, making them as strong alternatives to proprietary models in agentic systems. The multi-agent approach benefits more from user clarification, achieving the highest performance.
Figure 3 presents a comparative analysis of single-agent and multi-agent systems using GPT-4o-2024-11-20, examining their performance with and without user clarification. Our findings demonstrate that clarification enhances success rates in both setups; however, the improvement is notably more pronounced in the multi-agent configuration. Specifically, the Multi-Agent Clarification (MAC) system outperforms Single-Agent Clarification (SAC) by 6% (52.4% โ 58.4%), highlighting the advantage of separating responsibilities between a Supervisor, responsible for general, high-level ambiguities, and domain-specific Experts who handle specialized clarifications. Moreover, our multi-agent setup consistently achieves higher success rates than the single-agent approach even in scenarios without clarification, emphasizing the inherent benefit of distributing workload among coordinated agents rather than overloading a single agent with multiple roles. This superior performance, particularly under clarification conditions, underscores the advantages of modularity in resolving ambiguities effectively and efficiently scaling to complex, multi-domain interactions. In contrast, single-agent systems face diminishing returns and decreased interpretability as complexity grows due to the necessity of internalizing diverse expertise. Additionally, the modular structure of MAC facilitates incremental updates and streamlined integration of new domains, enhancing its robustness and practical applicability in real-world scenarios, as demonstrated by systems such as the model context protocol (MCP) (Hou et al., 2025).
Takeaway 5: Multi-agent setup outperforms single-agent both with and without clarification, which suggest multi-agent setup offers improved scalability and efficiency as each agent operates with a more focused and concise context.
To better understand the critical roles of different clarification strategies and understand the correspondence between the different ambiguity types defined in the taxonomy, we clustered the supervisor’s taxonomy into high-level Ambiguity and Vagueness Handling (encompassing Domain Ambiguity, Intent Ambiguity, Vague Goal Specification, and Contextual Disambiguation) and the expert’s taxonomy into Slot/Parameter Uncertainty (covering Parameter Underspecification and Value Ambiguity/Vagueness). As shown in Table 5 based on GPT-4o-2024-11-20 outputs, ablating Ambiguity and Vagueness Handling from the supervisor yields a substantial drop in task success rate (-6.20), highlighting the importance of proactively resolving common-sense ambiguities before delegating to domain-specific experts. In contrast, removing Slot/Parameter-Blocking Clarification from experts causes a smaller, but still notable, accuracy decline (-2.18), confirming that careful handling of underspecified or vague user slots is also essential, especially for API calls requiring precise parameters. It demonstrates that both forms of clarification are necessary, but high-level disambiguation with supervisor is particularly crucial for robust multiagent dialogue.
Takeaway 6: High-level ambiguity and vagueness handling by the supervisor is essential for robust performance, with its removal causing the largest drop in task success among all clarification skills.
Overall, these results support our preference for the multi-agent architecture, where MAC delivers higher accuracy, improved scalability, and enhanced maintainability, underscoring the critical role of modular design in developing robust and generalizable conversational agents.
Conclusions We introduce MAC, the first multiagent LLM framework specifically designed for interactive user clarification in conversational agents, and we also present the first comprehensive user clarification taxonomy for this domain, best of our knowledge. Our results demonstrate that effective user clarification is essential for maximizing task success and conversational efficiency, minimizing unnecessary user interactions. The proposed taxonomy enhances accuracy in both single-agent and multi-agent settings, with the multi-agent approach yielding superior results due to effective task sharing with domain experts. Notably, MAC is model-agnostic and significantly boosts performance across both open-source and proprietary LLMs, with particularly pronounced gains for opensource models; helping close the performance gap through optimal design and supervision. Our ablation studies highlight that high-level ambiguity and vagueness handling by the supervisor is especially critical for robust performance in real-world scenarios. Overall, MAC establishes a foundation for future research and deployment of multi-agent, user-centric conversational systems, offering clear benefits for practical and industrial applications.
Limitations Although MAC is a new effective and promising framework, it has several limitations. First, due to the complexity of multi-turn conversations, successful goal-oriented dialogue tasks and planning require large-scale capable models. As a result, our experiments primarily rely on largeparameter, API-based models. Teaching these capabilities to smaller LLMs and deploying them efficiently remains an open challenge (Belcak et al., 2025). Second, we use LLM-as-a-Judge (LaaJ) based evaluations, which has known limitations in its core. In rare cases, the evaluator LLM may hallucinate and assign positive scores to incorrect dialogue trajectories; even though, we did not observe such failures, outlier cases may still exist. Nevertheless, LaaJ has been widely adopted in prior TOD research and has proven effective for evaluating complex, long-horizon conversational rollouts that are difficult to assess with deterministic metrics alone (Xu et al., 2024;Acikgoz et al., 2025b). Combining LaaJ with complementary deterministic, rule-based metrics could improve robustness, but designing such hybrid evaluations remains an open challenge. Finally, our error analysis reveals that in rare cases (approximately 1%), the agent and the user simulator can become stuck in a loop (Barres et al., 2025), where the agent repeatedly asks clarification questions and the simulator returns the same response. Although infrequent, this behavior highlights the need for more realistic user simulators and points to human-agent co-evolution as an interesting direction.
Future Work Handling user interactions in conversational agents is non-trivial, as it requires managing multiple tasks such as providing accurate responses or invoking the appropriate API from thousands of available tools (Su et al., 2025). In addition to these requirements, MAC further focus on challenge at scale through its carefully designed multi-agent setup, which is specifically tailored for user clarification. As a future direction, agents could learn optimal timing for seeking user clarification by monitoring environmental signals during interactions, leveraging recent reinforcement learning techniques (Lambert et al., 2024;Guo et al., 2025) to continuously self-update and become increasingly successful over time as a promising path towards self-improving agents (Zhang et al., 2025;Acikgoz et al., 2025d). Moreover, while our evaluation focuses on task success and the number of conversational turns as proxies for efficiency, quantifying overall user satisfaction remains an open question (Terry et al., 2023), where factors such as dialogue naturalness and other user-centric elements may play a pivotal role.
We evaluate the performance of our MAC framework using dialogue-level metrics that capture both the effectiveness and efficiency of task completion. Our primary metric is Success Rate, which measures whether the agent fully satisfies all userspecified constraints and successfully completes the task. For each dialogue, we use an LLM-based judge using GPT-4o-2024-11-20 (Hurst et al., 2024) to assess if the agent’s final response fulfills every requirement defined by the user’s goal, including both requested attributes (such as hotel name or train arrival time) and booking constraints (such as the number of people or destination) following Xu et al. (2024). Formally, a dialogue is considered successful if all constraints in the user’s goal G are met by the end of the interaction:
where I(โข) denotes the indicator function. This score is computed for every dialogue and averaged across the evaluation set. To account for the stochastic nature of both model inference and LLMbased judging, we conduct five independent runs for each experimental configuration.
We report two aggregate Success Rate metrics: Success Max@5, the highest single-run success rate out of five runs, reflecting the best-case performance; and Success Avg@5, the mean and standard deviation of success rates over the five runs, providing a robust measure of typical performance and variance. In addition, we report the Average Number of Turns per conversation as an efficiency metric. This measures the average length of the dialogue required to complete the task, with lower values indicating more concise and effective interactions. This metric is particularly important for assessing the practical impact of clarification strategies on user burden and overall system efficiency.
During our experiments, we use the Ope-nAI API 2 to evaluate the GPT models GPT-4o-2024-11-20 (Hurst et al., 2024) and gpt-4o-mini.
For open-source model evaluations, we use the TogetherAI API 3 with Qwen3-235B-A22B (Yang et al., 2025). To ensure 2 https://openai.com/api/
3 https://api.together.xyz/
reproducibility, we use default generation settings for all models without tuning any inference hyperparameters. In terms of runtime, a single model inference takes approximately 15 minutes, while evaluation requires around 4-5 minutes. Since LLM-based judge metrics are non-deterministic, the results may vary slightly across runs. To account for this variability, we perform all evaluations five times and report scores with standard deviations.
Given the following User Goal and Ground Truth Conversation, update the conversation to introduce ambiguity or underspecification in the user’s turns, such that the agent must ask for clarification at least once and at most three times. For every agent clarification, enclose the clarification in ‘’ and ‘’ tokens. After each agent clarification, update the following user turn(s) to resolve the ambiguity. Do not change the overall goal or successful task completion. Only modify the conversation for clarification needs.
-Carefully read the User Goal and the Ground Truth Conversation.
-Rewrite the conversation so that some user turns are ambiguous or missing key information, requiring the agent to clarify at least once and at most three times (with the ‘’ tokens).
-Keep the rest of the conversation as natural as possible and ensure the final output still accomplishes the user goal.
Output Format:
-Output the updated conversation only as a valid JSON array in the format:
“‘json { “from”: “user”, “value”: “…”, “from”: “agent”, “value”: “…”, “from”: “user”, “value”: “…”, “from”: “agent”, “value”: “…”, … } “’ -No extra text, no comments, no explanations, no markdown-just the JSON array.
-The output must be valid JSON.
-If the format is not exactly correct, data loading will fail.
New Conversation with User Clarification {Your JSON Data Here} Supervisor Agent Prompt ### Task You are an expert routing agent in a multi-domain conversational AI system for the MultiWOZ dataset. Your specific task is to analyze a user’s query and determine which of the following five domain experts is best suited to handle it:
restaurant (for queries related to finding, booking, or getting information about restaurants) 2. hotel (for queries related to finding, booking, or getting information about hotels or other accommodations) 3. attraction (for queries related to finding or getting information about tourist attractions, landmarks, or points of interest)
train (for queries related to finding, booking, or getting information about train travel) 5. taxi (for queries related to booking or getting information about taxi services)
Read the user’s query provided below (### User Query). Your goal is to identify the single, most dominant domain relevant to the query.
You MUST output ONLY the exact lowercase label corresponding to the selected domain, enclosed in and . For example, if the query is about a hotel, your output must be hotel. Do NOT include any other words, phrases, explanations, or punctuation outside of these tags. Your entire response should be just one of these five labels, wrapped in the domain tags as shown below:
restauranthotelattractiontraintaxi
If the query seems to touch on multiple domains, select the one that appears to be the primary focus or the one that needs to be addressed first. If no domain is clearly identifiable from the list, you must still choose the closest possible one or a default agreed upon (though for this specific instruction, you must pick one of the five and wrap it in the tags).
Selected Domain Label: Hotel Domain Expert Prompt ### Role Description You are an advanced AI assistant specializing in conversational dialogues focused on the hotel domain.
You can act both as a system (providing hotel information and booking services) and a user (interacting with the hotel database) to assist users in completing hotel-related tasks.
Task Information -Each time, you must determine whether to call an API by reasoning through “Thought:”.
-If you decide that an API call is necessary, include “Thought:” for reasoning, followed by “API Name:”, “API Input:”, “API Result:”.
-If you determine that an API call is not necessary, include a “Thought:” for reasoning, followed by a response to the user as “Response:”.
-If the user asks for some attributes of a hotel (e.g., address, phone number, price range, parking, internet), then an API call is necessary.
-You are not allowed to use APIs not mentioned below. If you decide that the mentioned APIs are not sufficient for the user’s request, you should inform the user that you can only assist with hotel queries and bookings.
-If you decide that more than one API call is needed (e.g., query first, then book), you should call one API first and wait for the API result. After obtaining that result, you may think and call the next API or think and make a response.
-The user can sometimes not care about the value of an API input slot and may mention it explicitly in the conversation (e.g., “I don’t care about the price range”). In such cases, predict “dontcare” as a slot value for that particular slot.
-If you decide that there is an API input slot that the user has never mentioned and is required for the API, please put “any” as the slot value as a placeholder.
-You can put only one value in each API input slot per query.
-Predict “dontcare” as a slot value ONLY if the user has explicitly mentioned it in the conversation. Your entire output must be the single, lowercase domain label wrapped in the ‘’ tag. You MUST output ONLY the exact lowercase label corresponding to the selected domain, enclosed in and .
For example, if the query is about a hotel, your output must be hotel.Your entire response should be just one of these five labels, wrapped in the domain tags as shown below:
-restaurant -hotel -attraction -train -taxi
If the query seems to touch on multiple domains, select the one that appears to be the primary focus or the one that needs to be addressed first. If no domain is clearly identifiable from the list, you must still choose the closest possible one or a default agreed upon (though for this specific instruction, you must pick one of the five and wrap it in the tags).
User Query user_query ### Output: Hotel Expert Prompt with User Clarification ### Role Description You are an advanced AI assistant specializing in conversational dialogues focused on the hotel domain.
You can act both as a system (providing hotel information and booking services) and a user (interacting with the hotel database) to assist users in completing hotel-related tasks.
Task Information -Each time, you must determine whether to call an API by reasoning through “Thought:”.
-If you decide that an API call is necessary, include “Thought:” for reasoning, followed by “API Name:”, “API Input:”, “API Result:”. If you determine that an API call is not necessary, include a “Thought:” for reasoning, followed by a response to the user as “Response:”.
-If the user asks for some attributes of a hotel (e.g., address, phone number, price range, parking, internet), then an API call is necessary.
-You are not allowed to use APIs not mentioned below. If you decide that the mentioned APIs are not sufficient for the user’s request, you should inform the user that you can only assist with hotel queries and bookings.
-If you decide that more than one API call is needed (e.g., query first, then book), you should call one API first and wait for the API result. After obtaining that result, you may think and call the next API or think and make a response. The user can sometimes not care about the value of an API input slot and may mention it explicitly in the conversation (e.g., “I don’t care about the price range”). In such cases, predict “dontcare” as a slot value for that particular slot.
-If you decide that there is an API input slot that the user has never mentioned and is required for the API, please put “any” as the slot value as a placeholder. You can put only one value in each API input slot per query.
Clarification Taxonomy (When to Ask) Before calling an API, determine if you have all the necessary information. If not, ask a clarifying question using this taxonomy:
-Parameter Underspecification: Key details for a search or booking are missing (e.g., location, cuisine, number of people, time).
-Value Ambiguity/Vagueness: A user’s term is subjective and needs clarification (e.g., “a nice place,” “somewhere soon”).
-Constraint Conflict: The user provides contradictory information (e.g., “a cheap but expensive restaurant”).
-Entity Disambiguation/Not Found: A specific restaurant name is ambiguous or cannot be found.
-Confirmation of Inferred Information: You have inferred a detail from context and need to confirm it before proceeding.
-It is also important not to burden the user with repetitive or similar clarification questions in your overall conversation; please be mindful of this during your conversation.
Each Domain Expert agent is responsible for executing user goals within a specific task domain. We instantiate Algorithm 1 MAC: Multi-Agent Clarification Workflow Require: User query ut at dialogue turn t; supervisor agent AS; domain experts AE = {A d 1 , . . . , A dn } Ensure: Either CLARIFY(q) or RESPOND(r) 1 function MAC(ut) 2 if AS.is_ambiguous(ut) then 3 q โ AS.ask_clarification(ut) 6 d โ AS.select_domain(ut) 7 A d โ the expert for domain d โท Route to best-fit expert 8 if A d .is_ambiguous(ut) then 9 q d โ A d .ask_clarification(ut) 10 return CLARIFY(q d ) โท Expert requests a targeted follow-up 11 else 12 r โ A d .execute_domain_response(ut)
MAC’s Performance Compared to Existing TOD Systems. Evaluation of various TOD methods using a standardized framework. Results for baseline models are sourced from AutoTOD to ensure fair and consistent comparison. Results taken from(Xu et al., 2024), following same evaluation protocol to ensure fairness in our comparison.
MAC’s Performance Compared to Existing TOD Systems. Evaluation of various TOD methods using a standardized framework. Results for baseline models are sourced from AutoTOD to ensure fair and consistent comparison. Results taken from(Xu et al., 2024)
MAC’s Performance Compared to Existing TOD Systems. Evaluation of various TOD methods using a standardized framework. Results for baseline models are sourced from AutoTOD to ensure fair and consistent comparison. Results taken from
-Response: [Your response question to the user for clarification] ### Output Instructions 1. For user clarification, you should provide your reasoning as “Thought: [your reasoning]” and your user clarification question response as “Response: [your high-level clarification question]”. 2. For domain selection, you should provide your response between and tags.