멀티모달 도구 사용 벤치마크 M3
📝 Abstract
We present M 3 -Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similaritydriven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similaritybucketed Hungarian matching to obtain auditable one-toone correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 27 servers with 232 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-theart Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs.
💡 Analysis
We present M 3 -Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similaritydriven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similaritybucketed Hungarian matching to obtain auditable one-toone correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 27 servers with 232 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-theart Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs.
📄 Content
We introduce M 3 -Bench, the Multi-Modal, Multiplex, Matching-aware MCP Benchmark, as a principled evaluation suite for multimodal tool use under the Model Context Protocol (MCP). Multimodal Large Language Models (MLLMs) have recently shown that, once they are allowed to perform function calling (tool use), they can query ex-* Equal contribution. ternal services and reason over information that is not contained in their parameters [3,7,8,20,30,42,43,48,49,53]. MCP specifies how models communicate with heterogeneous tools through standardized servers, which makes execution more reproducible across systems [29]. However, existing MCP benchmarks are mostly text-only and focus on linear API planning or database queries [12,13,24,25 [41] 40+ 300+ ✓ ✓ ✗ ✗ ✗ ✗ MCP-Universe [25] 6 113 ✓ ✓ ✓ ✗ ✗ ✗ MCP-Bench [45] 28 250
Table 1. Comparisons to existing tool-using benchmarks (compact single-column). 35,36,41,50]. A systematic evaluation of multimodal MCP workflows, where images and text jointly condition tool calls and results, is still missing [2,25].
The core challenge in real-world MCP trajectories is visual grounding: multimodal tool invocation hinges on correctly interpreting the image before any tool can be parameterized. In Figure 2a, the agent receives a photograph and must first resolve the landmark/city from visual cues, only then can it condition subsequent MCP calls; The agent must fully recognize the fully stocked and well-displayed products on the shelf in a photograph before proceeding to the next MCP tools call. Second, real-world MCP trajectories are not single-shot calls. They are multi-hop, they contain causally dependent operations, and they frequently execute several tool calls in parallel within one step [22,27,52,54]. The task in Figure 2b illustrates this setting. The agent receives an image of a hazardous construction scene and a reporting-style instruction. Inside one step, actions that do not depend on each other, such as adding an image and updating bullets, can be executed concurrently. Across steps, operations that do depend on earlier results, such as annotating before inserting or creating the deck before saving, must follow the ground-truth order. These properties make simple string matching or linear-sequence scoring inadequate [27,54], and it is precisely what our benchmark aims to stress. We define two important concepts in what follows: Multi-Hop refers to workflows with more than one causally dependent step, where later actions consume artifacts produced earlier [14,51]; Multi-Threaded refers to order-independent tool calls executed within a single step under shared state, allowing safe parallelism while preserving cross-step causality (Mathematical definition in 1 & 2).
Table 1 contrasts M 3 -Bench with prior tool-use evaluations along 8 axes. Domains and Tools enumerate functional breadth and unique tool count. MCP ecosystem marks whether the benchmark connect the agent directly to a set of production-grade, live MCP servers, and Information grounding indicates that answers must be supported by ev-idence returned from tools. Fuzzy task description denotes underspecified, natural instructions without a clear trajectory. Critically, Multi-Hop & Threaded captures cross-step causal dependencies together with safe within-step parallelism; Multimodality requires joint image-text conditioning; and Similarity Metric denotes large language models (LLM) free, similarity-aware alignment of predicted to reference calls (see Section 3). As discussed above, Multimodality and Multi-Hop/Threaded causality are central to realistic MCP workflows, and a Similarity-Based scorer is necessary to credit semantically correct calls. Most existing benchmarks are text-only and/or lack explicit multithreaded causality, and none pair these with a similarityaware alignment; in contrast, M 3 -Bench satisfies all three while retaining the key advantages of prior benchmarks.
We align predicted and reference tool calls with a similarity-bucketed Hungarian alignment: each call is serialized, embedded with a fixed sentence encoder, and one-toone matched within tool-name buckets under weak/strong cosine thresholds. This gives deterministic, auditable correspondences without relying on an LLM judge for call-level scoring. On top of this alignment, we report a compact, recall-aware suite that separates Section 3. A small fourmodel judge ensemble is used for evaluating the overall quality of the trajectory to retain some of the advantages of the LLM judge feature. Besides, to standardize references, our experiments also provide a best trajectory obtained via an Executor-Judge loop.
- We present, to our knowledge, the first benchmark explicitly targeting multimodal MCP workflows. 2. Our repository provides an end-to-end pipeline for generating MCP best trajectories, with an optional lightweight human verification pass to enhance stability. 3. We introduce a structure-aware metric suite that aligns tool calls via
This content is AI-processed based on ArXiv data.