멀티모달 도구 사용 벤치마크 M3

Reading time: 6 minute
...

📝 Original Info

  • Title: 멀티모달 도구 사용 벤치마크 M3
  • ArXiv ID: 2511.17729
  • Date: 2025-11-21
  • Authors: Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, Dimitris N. Metaxas

📝 Abstract

We present M 3 -Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similaritydriven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similaritybucketed Hungarian matching to obtain auditable one-toone correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 27 servers with 232 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-theart Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs.

💡 Deep Analysis

Deep Dive into 멀티모달 도구 사용 벤치마크 M3.

We present M 3 -Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similaritydriven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similaritybucketed Hungarian matching to obtain auditable one-toone correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 27 servers with 232 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-theart Multimodal LLMs (MLLMs) re

📄 Full Content

We introduce M 3 -Bench, the Multi-Modal, Multiplex, Matching-aware MCP Benchmark, as a principled evaluation suite for multimodal tool use under the Model Context Protocol (MCP). Multimodal Large Language Models (MLLMs) have recently shown that, once they are allowed to perform function calling (tool use), they can query ex-* Equal contribution. ternal services and reason over information that is not contained in their parameters [3,7,8,20,30,42,43,48,49,53]. MCP specifies how models communicate with heterogeneous tools through standardized servers, which makes execution more reproducible across systems [29]. However, existing MCP benchmarks are mostly text-only and focus on linear API planning or database queries [12,13,24,25 [41] 40+ 300+ ✓ ✓ ✗ ✗ ✗ ✗ MCP-Universe [25] 6 113 ✓ ✓ ✓ ✗ ✗ ✗ MCP-Bench [45] 28 250

Table 1. Comparisons to existing tool-using benchmarks (compact single-column). 35,36,41,50]. A systematic evaluation of multimodal MCP workflows, where images and text jointly condition tool calls and results, is still missing [2,25].

The core challenge in real-world MCP trajectories is visual grounding: multimodal tool invocation hinges on correctly interpreting the image before any tool can be parameterized. In Figure 2a, the agent receives a photograph and must first resolve the landmark/city from visual cues, only then can it condition subsequent MCP calls; The agent must fully recognize the fully stocked and well-displayed products on the shelf in a photograph before proceeding to the next MCP tools call. Second, real-world MCP trajectories are not single-shot calls. They are multi-hop, they contain causally dependent operations, and they frequently execute several tool calls in parallel within one step [22,27,52,54]. The task in Figure 2b illustrates this setting. The agent receives an image of a hazardous construction scene and a reporting-style instruction. Inside one step, actions that do not depend on each other, such as adding an image and updating bullets, can be executed concurrently. Across steps, operations that do depend on earlier results, such as annotating before inserting or creating the deck before saving, must follow the ground-truth order. These properties make simple string matching or linear-sequence scoring inadequate [27,54], and it is precisely what our benchmark aims to stress. We define two important concepts in what follows: Multi-Hop refers to workflows with more than one causally dependent step, where later actions consume artifacts produced earlier [14,51]; Multi-Threaded refers to order-independent tool calls executed within a single step under shared state, allowing safe parallelism while preserving cross-step causality (Mathematical definition in 1 & 2).

Table 1 contrasts M 3 -Bench with prior tool-use evaluations along 8 axes. Domains and Tools enumerate functional breadth and unique tool count. MCP ecosystem marks whether the benchmark connect the agent directly to a set of production-grade, live MCP servers, and Information grounding indicates that answers must be supported by ev-idence returned from tools. Fuzzy task description denotes underspecified, natural instructions without a clear trajectory. Critically, Multi-Hop & Threaded captures cross-step causal dependencies together with safe within-step parallelism; Multimodality requires joint image-text conditioning; and Similarity Metric denotes large language models (LLM) free, similarity-aware alignment of predicted to reference calls (see Section 3). As discussed above, Multimodality and Multi-Hop/Threaded causality are central to realistic MCP workflows, and a Similarity-Based scorer is necessary to credit semantically correct calls. Most existing benchmarks are text-only and/or lack explicit multithreaded causality, and none pair these with a similarityaware alignment; in contrast, M 3 -Bench satisfies all three while retaining the key advantages of prior benchmarks.

We align predicted and reference tool calls with a similarity-bucketed Hungarian alignment: each call is serialized, embedded with a fixed sentence encoder, and one-toone matched within tool-name buckets under weak/strong cosine thresholds. This gives deterministic, auditable correspondences without relying on an LLM judge for call-level scoring. On top of this alignment, we report a compact, recall-aware suite that separates Section 3. A small fourmodel judge ensemble is used for evaluating the overall quality of the trajectory to retain some of the advantages of the LLM judge feature. Besides, to standardize references, our experiments also provide a best trajectory obtained via an Executor-Judge loop.

  1. We present, to our knowledge, the first benchmark explicitly targeting multimodal MCP workflows. 2. Our repository provides an end-to-end pipeline for generating MCP best trajectories, with an optional lightweight human verification pass to enhance stability. 3. We introduce a structure-aware metric suite that aligns tool calls via

…(Full text truncated)…

📸 Image Gallery

00020003.png 00020003.webp 00050007.png 00050007.webp 00080006.png 00080006.webp 00100004.png 00100004.webp 00110001.png 00110001.webp 00120002.png 00120002.webp 00140001.png 00140001.webp 00160004.png 00160004.webp 00170000.png 00170000.webp 00180000.png 00180000.webp call_eval_pies.png call_eval_pies.webp claude.png claude.webp gemini.png gemini.webp glm.png glm.webp gpt.png gpt.webp grok.png grok.webp index.png index.webp internvl.png internvl.webp llama.png llama.webp mcp_tools_per_server.png mcp_tools_per_server.webp metrics_mllm_step_eval.png metrics_mllm_step_eval.webp pipeline_generation.png pipeline_generation.webp qwen.png qwen.webp realistic_challenges.png realistic_challenges.webp similarity_high.png similarity_high.webp similarity_low.png similarity_low.webp similarity_mid.png similarity_mid.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut