AMIGO: Agentic Multi-Image Grounding Oracle Benchmark
Agentic vision-language models increasingly act through extended interactions, but most evaluations still focus on single-image, single-turn correctness. We introduce AMIGO (Agentic Multi-Image Grounding Oracle Benchmark), a long-horizon benchmark fo…
Authors: Min Wang, Ata Mahjoubfar
AMIGO: Agen tic Multi-Image Grounding Oracle Benc hmark Min W ang, A ta Mahjoubfar T arget Corp oration { Min.Wang, Ata.Mahjoubfar } @target.com W ork in Progress Abstract Agen tic vision-language models increasingly act through extended in- teractions, but most ev aluations still fo cus on single-image, single-turn correctness. W e introduce AMIGO ( A gentic M ulti- I mage G rounding O racle Benc hmark), a long-horizon b enc hmark for hidden-tar get identi- fication o ver galleries of visually similar images. In AMIGO, the oracle priv ately selects a target image, and the mo del m ust reco ver it by asking a sequence of attribute-fo cused Y es/No/Unsure questions under a strict proto col that p enalizes in v alid actions with Skip . This setting stresses (i) question selection under uncertain ty , (ii) consistent constrain t tracking across turns, and (iii) fine-grained discrimination as evidence accumulates. AMIGO also supp orts controlled oracle imperfections to prob e robustness and verification behavior under inconsistent feedbac k. W e instan tiate AMIGO with Guess My Pr eferr e d Dr ess task and rep ort metrics cov ering b oth outcomes and in teraction quality , including iden tification success, evidence verification, efficiency , proto col compliance, noise tolerance, and tra jectory-level diagnostics. 1 In tro duction Vision-language mo dels (VLMs) [ 1 , 2 , 3 , 4 , 5 ] ha ve adv anced rapidly in instruction follo wing and grounded reasoning, enabling strong performance on captioning, visual question answering, and multimodal dialogue. As VLMs b ecome more agen tic (planning and acting through extended exc hanges), ev aluation protocols increasingly need to m easure not only correctness, but also interaction p olicies: ho w models gather information, main tain state, and adapt under uncertaint y . Y et many b enc hmarks [ 6 , 7 , 8 , 9 , 10 ] remain static and short-horizon, typically scoring one-shot answers given one image (or a small fixed set), with limited visibilit y into strategy , memory , and b elief revision ov er time. This work is still in progress. W e’ll up date with more results and analysis so on. The dataset is provided as an ancillary file, anc/data.json . 1 W e study a complementary setting: interactiv e hidden-target identification o ver image collections. A user (or oracle) priv ately selects a target image from a gallery of visually similar candidates, and the mo del must disco ver the target b y asking constrained questions ab out observ able attributes. This formulation surfaces agen tic challenges that are often in visible in single-turn ev aluation: c ho osing questions that efficiently reduce ambiguit y , preserving and applying constrain ts without drift, and p erforming cross-image comparisons that dep end on the evolving candidate set rather than any single image in isolation. T o op erationalize this setting, we presen t AMIGO, an interactiv e b enchmark cen tered on a simple but diagnostic lo op. The mo del asks exactly one binary question per turn, receives Y es/No/Unsure feedback, and iteratively narrows the feasible candidate set. If the model violates protocol rules, the oracle returns Skip , rev ealing no information. This makes non-compliance measurable and separates w asted interaction from evidence-driv en progress. AMIGO additionally allo ws o ccasional oracle inconsistencies to test whether mo dels detect contradictions and seek verification instead of committing prematurely . W e instan tiate AMIGO with Guess My Pr eferr e d Dr ess task, where eac h gallery contains 6 to 40+ highly similar dress images. Dresses offer rich fine- grained v ariation (e.g., neckline construction, closures, seam placement, drap e, and embellishments), making the task sensitive to careful observ ation and long- horizon constraint tracking. Con tributions. (1) W e introduce AMIGO, an in teractive multi-image b enc h- mark for hidden-target identification that foregrounds long-horizon planning and fine-grained cross-image grounding. (2) W e define a constrained Y es/No/Unsure proto col with explicit inv alid-action feedback ( Skip ) to diagnose sustained in- struction following and common agen tic failure mo des. (3) W e incorp orate con trolled oracle imp erfections to test robustness and v erification b ehaviors under inconsistent feedback. (4) W e pro vide an ev aluation suite that jointly mea- sures outcomes and interaction qualit y , including verified iden tification, efficiency , proto col compliance, and tra jectory-lev el diagnostics. Broader impact. AMIGO provides a controlled testbed for agentic VLM b eha viors that matter in practice: planning under uncertain ty , main taining consisten t state o ver extended in teractions, and resp onding conserv atively to am biguity or con tradictions. The resulting traces (candidate sets as state, questions as actions, and oracle resp onses, including Skip , as observ ations) can also supp ort training and analysis of multimodal policies via offline learning. At the same time, hidden-target identification highlights calibration considerations: systems should av oid ov erconfident early commitments and remain protocol- complian t throughout long dialogues. 2 2 Related W ork 2.1 Multi-image and m ulti-turn multimodal ev aluation A growing set of b enc hmarks and datasets extends ev aluation b ey ond single- image, single-turn QA [ 8 , 7 ] to multi-image and/or multi-turn settings. Multi- image b enchmarks (e.g., MMIU [ 11 ], MuirBench [ 12 ], mPLUG-Owl3 [ 13 ]) prob e cross-image grounding and long visual con text handling. Multi-image instruction- tuning datasets (e.g., Mantis [ 14 ]) and multi-image multi-turn resources (e.g., MMDU [ 15 ], MMCR [ 16 ]) further study sup ervision ov er interlea ved images and dialogue. Separately , multi-turn multimodal conv ersation b enchmarks (e.g., MultiV erse [ 17 ], ConvBenc h [ 18 ], MMMT-IF [ 19 ], MMCoQA [ 20 ]) ev aluate con textual coherence and instruction adherence ov er longer dialogues. These efforts primarily ev aluate resp onding giv en a pro vided context, rather than in teractive hidden-target identification where a mo del must actively uncov er a user-priv ately selected target by asking constrained questions and trac king constrain ts ov er a long horizon. 2.2 Multimo dal reasoning b enc hmarks General multimodal reasoning b enc hmarks assess whether mo dels can integrate vision and language for m ulti-step inference beyond shallo w cue matc hing. F or ex- ample, EMMA [ 21 ] targets “organic” multimodal reasoning across domains, and MLLM-CompBenc h [ 22 ] emphasizes comparative reasoning under multimodal inputs. MMMU-Pro [ 23 ] further strengthens this line by making ev aluation more resistan t to text-only shortcuts and by requiring tighter visual grounding in c hallenging multimodal questions. These b enchmarks are v aluable for measuring reasoning capability on difficult perception-and-inference tasks, but they still primarily assess whether a mo del can solv e a presented problem instance. In con trast, AMIGO e v aluates whether an agentic VLM can acquire the missing in- formation needed to solve the task: selecting informative questions ov er multiple turns, maintaining and revising a belief state o ver a candidate po ol, and enforcing consistency under a constrained interaction proto col to identify a hidden target image. 2.3 In teractiv e clarification with hidden targets AMIGO is closely related to interactiv e clarification settings where an agent m ust unco ver user inten t through dialogue. InfoQuest [ 24 ] ev aluates text-only agents that ask clarifying questions when critical con text is hidden. ClariMM [ 25 ] extends this idea to multimodal clarification for undersp ecified user queries. More broadly , multimodal agent frameworks such as M3Searcher [ 26 ] and exp ert- guided b enchmarks suc h as MIRAGE [ 27 ] follow a “seek information then decide” paradigm, often with retriev al or to ol-mediated evidence acquisition. AMIGO differs in three w ays. First, it studies hidden-target identification o ver a closed set of visually similar images, where the mo del must narrow 3 the candidate po ol via attribute queries answered with Y es/No/Unsure/ Skip feedbac k (whic h ma y be noisy). Second, it ev aluates the interaction p olicy itself via tra jectory-lev el signals suc h as compliance, redundancy , elimination dynamics, and contradiction detection/verification. Third, the constrained proto col yields reusable tra jectories with a w ell-defined action space (attribute questions), observ ations (oracle resp onses), and automatically scored in termediate signals, making the data naturally suited for diagnosing and training long-horizon agen tic m ultimo dal p olicies. 3 Benc hmark W e introduce AMIGO, a b enchmark for ev aluating agentic VLMs on hidden- target, m ulti-image, multi-turn identification. AMIGO comprises a primary task, Guess My Pr eferr e d Dr ess , where the mo del m ust iden tify a user-priv ately selected target image from a gallery of highly similar candidates b y asking discriminativ e attribute questions under a strict protocol. 3.1 Guess My Preferred Dress T ask. A user uploads one or more batches of dress images that together form a single gallery and there is one image that is priv ately selected by user as a hidden target. After the signal End of uploading , the model asks questions to narro w the candidate set and finally outputs the 1-indexed p osition of the target in the gallery . Proto col. The in teraction follows these rules: 1. Eac h turn, the mo del must ask exactly one question ab out an observ able attribute of the target dress. 2. The question must b e answerable with Y es , No , or Unsure . 3. If the mo del violates an y rule, the oracle resp onds with Skip . 4. The mo del must not ask ab out: sleeve length, garmen t length, color, prin ts/pattern, age group, size, shoes, nec klace, hat, bag, background, or the h uman mo del. 5. The model must not enumerate attribute v alues across turns for the same attribute type (e.g., rep eatedly cycling through neckline types of V-neckline, square neck, crewneck in differen t turns). 6. The mo del m ust not reference sp ecific indices or insp ect images one-by-one (e.g., “Is it image #1?”). 7. The mo del must not guess un til its constraints narrow the feasible candidate set to exactly one image. 4 8. If constrain ts app ear inconsisten t (e.g., due to uncertaint y or noise), the mo del should contin ue asking more questions rather than stopping early . 9. When ready , the mo del must output: My guess of your favorite dress: #. 10. The mo del must not ask any questions b efore the user sends End of uploading . Rationale. The proto col delib erately excludes highly salien t cues (e.g., color, patterns, sleev e length, and garmen t length) to encourage reliance on subtler construction details (e.g., nec kline structure, closures, waist shaping, seam placemen t, drap e, and em b ellishments). This design emphasizes long-horizon constrain t tracking and rew ards question sequences that are informative, non- redundan t, and robust to uncertaint y . Example interactions. Figure 1 shows tw o full episo des. In Figure 1a , the mo del systematically asks ab out fine-grained details and narrows the p o ol ov er 10 turns, but makes an incorrect final guess, illustrating how visually similar candidates can still b e confusable. In Figure 1b , the mo del identifies the target in 9 turns by asking progressively more discriminative questions. 3.2 Dataset collection Source and curation. W e collect dress images from T arget’s online fashion catalog and deduplicate them to ensure each product is unique. A ttribute metadata extracted from the historical pro duct database is often missing, defined inconsisten tly across brands, and sometimes contains erroneous or irrelev ant v alues. W e therefore prioritize visually verifiable attributes and build a semi- automatic pip eline to pro duce standardized lab els at scale. Semi-automatic attribute lab eling. Our pip eline has three stages: (1) A ttribute disco very and normalization: w e aggregate attribute v alues from catalog annotations and use multiple LLMs to normalize synonyms and naming v ariants, compile a comprehensive attribute v alue list, and expand it with as man y visually grounded alternativ es as p ossible; (2) Binary question template construction: for each normalized attribute v alue, we generate multiple binary (Y es/No) question templates (e.g., “Do es the dress hav e a wrap-style front?”, “Do es the dress ha ve a wrap-fron t design?”, “Is the dress featured with wrap-style fron t?”); (3) VLM-based lab eling: we apply m ultiple VLMs to answer these templates for each image, yielding p er-image attribute lab els. Ensem bled VLM lab eling and quality control. Given budget constrain ts, w e adopt op en-source VLMs for automated lab eling. T o improv e lab eling reliabil- it y , we ensem ble multiple VLMs ( Qwen3-VL-235B-FP8 [ 28 ], Intern-S1 [ 29 ], GLM-4.5V [ 30 ]) and com bine their predictions through weigh ted ma jority 5 Image Gallery (8 candidates — target hidden from model) #1 👗 dress_2 #2 ♥ T arget (hidden) 👗 dress_3 #3 👗 dress_4 #4 👗 dress_5 #5 👗 dress_6 #6 👗 dress_7 #7 👗 dress_8 #8 T ask: Guess the user's favorite dress by asking Y es/No questions. Constraints: one binary question per turn · no color/pattern/sleeve/length · no image-by-image inspection · guess only when confident · responses: Y es / No / Unsure / Skip User: "End of uploading" T1 🤖 Does your favorite dress have a separate belt with a buckle? 👤 No T2 🤖 Does your favorite dress have buttons down the front? 👤 No T3 🤖 Does your favorite dress have a smocked (elastic, gathered) bodice? 👤 No T4 🤖 Does your favorite dress have an off-the-shoulder neckline? 👤 Yes T5 🤖 Does your favorite dress have a fabric tie at the waist (a sash you knot, no buckle)? 👤 Yes T6 🤖 Does your favorite dress have an asymmetrical cascading ruffle panel on the skirt? 👤 No T7 🤖 Does your favorite dress have more than one distinct tier/ruffle layer in the skirt? 👤 Yes T8 🤖 Does it have a visible elasticated waistband seam rather than a smooth fitted waist? 👤 Unsure T9 🤖 Is your favorite dress a two-piece set (separate top and skirt) rather than one single dress? 👤 No T10 🤖 Does your favorite dress have a peplum-like flounce at the waist (a short ruffled layer over the skirt)? 👤 Yes ★ 🤖 My guess of your favorite dress: #4. ✗ Incorrect (ground truth: #2) Summary: 10 turns · 0 Skips · 4× No · 4× Y es · 1× Unsure · Final: Incorrect (guessed #4, target #2) Legend Yes Confirmed attribute No Ruled-out attribute Unsure Ambiguous / uncertain (a) Unsuccessful attempt: after 10 turns the mo del guesses dress #4, but the hidden target is dress #2 (red border), illustrating how visually similar candidates can mislead ev en systematic questioning. Figure 1: Two example multi-turn interactions for Guess My Pr eferr e d Dr ess . The VLM (blue) asks constrained Y es/No questions about fine-grained attributes; the user replies with Y es (green), No (red), or Unsure (yello w). P anel (a) shows an incorrect final guess and panel (b) shows a correct one. 6 Image Gallery (6 candidates — target hidden from model) 👗 dress_1 #1 👗 dress_2 #2 👗 dress_3 #3 👗 dress_4 #4 👗 dress_5 #5 ♥ T arget (hidden) 👗 dress_6 #6 T ask: Guess the user's favorite dress by asking Yes/No questions. Constraints: one binary question per turn · no color/pattern/sleeve/length · no image-by-image inspection · guess only when confident · responses: Y es / No / Unsure / Skip User: "End of uploading" T1 🤖 Does your favorite dress have pockets that you can comfortably put your hands into? 👤 Unsure T2 🤖 Does your favorite dress have a waist tie/belt that wraps around and can be knotted or bowed? 👤 Yes T3 🤖 Does your favorite dress have a wrap-style front where one side overlaps the other across the torso? 👤 Yes T4 🤖 Does your favorite dress have a ruffled hem or ruffle detailing near the bottom edge? 👤 Yes T5 🤖 Does your favorite dress have multiple horizontal 'tier' seams (stacked layers) in the skirt portion? 👤 Yes T6 🤖 Does the skirt have an overlapping 'wrap' opening in front (so the front edge is formed by one panel laying over another, rather than being a continuous closed skirt)? 👤 No T7 🤖 Does your favorite dress have a V-shaped neckline? 👤 Yes T8 🤖 Does your favorite dress's fabric have a noticeable sheen (it looks somewhat shiny rather than matte)? 👤 Unsure T9 🤖 Does your favorite dress have noticeably puffed shoulders (gathering/volume right at the shoulder seam)? 👤 No ★ 🤖 My guess of your favorite dress: #5. ✓ Correct (target: #5) Summary: 9 turns · 0 Skips · 2× No · 5× Y es · 2× Unsure · Final: Correct (guessed #5, target #5) Legend Yes Confirmed attribute No Ruled-out attribute Unsure Ambiguous / uncertain (b) Successful attempt: after 9 turns the mo del correctly guesses dress #5—the hidden target (red b order)—showing that effective questioning can resolve visually similar candidates. Figure 1: Two example multi-turn interactions ( c ontinue d ). 7 v oting. In our experiments, Qwen3-VL-235B-Instruct-FP8 ac hieved the b est p erformance among the models ev aluated prior to the release of Qw en3.5-397B- A17B-FP8 , and is therefore assigned a larger v oting w eight. W e further enhance robustness b y (i) using paraphrased question templates to reduce sensitivity to prompt phrasing and (ii) applying m ulti-resolution image augmentation to mitigate failures in recognizing small visual details. Finally , we manually audit a subset of the lab els to assess quality , identify systematic errors, and refine the templates and normalization rules accordingly . Figure 2: The semi-automatic attribute lab eling pip eline: attribute discov ery and normalization, binary question template construction, and ensembled VLM- based lab eling with quality con trol. A ttribute-based similarity . Let Attr ( X ) denote the set of attribute v alues assigned to an image X . W e define an asymmetric similarit y score: Sim( A, B ) = | Attr ( A ) ∩ Attr ( B ) | | Attr ( A ) | , (1) and analogously Sim ( B , A ). W e rank candidates B b y Sim ( A, B ) to retrieve candidate images that are most visually similar to the reference image A in terms of attribute ov erlap (i.e., candidates that b est cov er A ’s attributes). Episo de generation and difficult y con trol. F or each target image A , we construct a distractor p o ol by iterating o ver its attribute v alues. F or eac h attribute v alue, w e retriev e a fixed n umber of images that share that v alue with A and satisfy Sim ( A, B ) ≥ τ , where τ is a similarity threshold. W e then merge the retrieved sets across all attribute v alues to obtain the final candidate distractors for A . W e k eep only targets for which this merged p o ol con tains more than five candidates. T o form an episo de, we combine the target with a subset of its candidate distractors to create a gallery . W e control difficulty via (i) the threshold τ (higher τ yields more visually similar distractors) and (ii) the gallery size (larger galleries expand the search space). 8 Figure 3: Image gallery generation pip eline: for a given target image, distrac- tors are retrieved by attribute-based similarity and merged in to a gallery with con trolled difficulty via threshold τ and gallery size. 3.3 Sample image galleries Figure 4 sho ws four representativ e galleries. Each gallery is curated to b e similar along salien t dimensions (e.g., silhouette and o verall style) while differing in subtle construction details. The hidden target is marked with a red outline. Scale. W e collect 4,880 unique dress images for episo de generation. W e use fiv e similarit y thresholds, τ ∈ { 0 . 3 , 0 . 4 , 0 . 5 , 0 . 6 , 0 . 8 } , to control difficulty . Higher thresholds yield smaller but m ore confusable galleries; low er thresholds typically yield larger galleries with more diverse distractors (Figure 5 ). W e obtain 587 episo des at τ = 0 . 8 b ecause it is difficult to find enough highly similar distractors under a strict threshold. F or ev aluation at other thresholds, we randomly sample 1,000 episo des p er threshold. 4 Benc hmark framew ork Benc hmark comp onents. Our framework comprises four mo dules (Figure 6 ): (i) a b enc hmark model (the ev aluated VLM) that generates questions and pro duces a final guess, (ii) a question-violation detector (LLM-based) that enforces the protocol, flags inv alid questions with Skip , and pro vides standardized feedbac k on rule adherence, (iii) a user/oracle simulator (VLM agen t) that answ ers v alid questions with Y es/No/Unsure, and (iv) a verification mo dule that maintains and audits the feasible candidate set implied b y the dialogue. Giv en the uploaded gallery and dialogue history , the b enchmark model outputs either the next question or a terminal guess. The violation detector 9 (a) Image Gallery 1 (b) Image Gallery 2 (c) Image Gallery 3 (d) Image Gallery 4 Figure 4: F our sample dress galleries from AMIGO. Eac h gallery contains one target image and visually similar distractors. The target is highligh ted with a red outline. 10 (a) τ = 0 . 3 (b) τ = 0 . 4 (c) τ = 0 . 5 (d) τ = 0 . 6 (e) τ = 0 . 8 Figure 5: Distribution of gallery sizes across similarit y thresholds τ . Lo wer thresholds tend to yield larger galleries with more diverse distractors; higher thresholds pro duce smaller but more visually confusable candidate p o ols. 11 c hecks eac h question against the benchmark constraints. The violation detector op erates in a few-shot manner and w orks on tw o tasks: (1) if the question is related to a prohibited attribute (e.g., color, sleeve length) or image index reference; or (2) if the attribute in the question has already been asked about in previous turns. In either case, the violation detector flags the question as in v alid and returns Skip (and no oracle information is rev ealed). W e record the n umber (and rate) of Skip resp onses p er episo de as the primary signal of proto col compliance. Implementing the violation detector as a separate mo dule ensures that all mo dels receive consisten t, standardized feedback on rule adherence regardless of their internal arc hitecture or reasoning style. The user/oracle sim ulator answers valid questions with Y es / No / Unsur e . Figure 6: Ov erview of the AMIGO b enc hmark framew ork, illustrating the inter- action among the b enc hmark mo del, user/oracle simulator, question-violation detector, and verification mo dule. V erification mo dule (state tracking and consistency chec king). After eac h valid question-answer pair, the verification mo dule conv erts the interaction in to an explicit constraint and applies it to the gallery to up date the fe asible c andidate set , the subset of images consisten t with all constrain ts observ ed so far. This mo dule serves three purp oses: (1) Evidence verification: it determines whether the dialogue has accumulated sufficien t evidence to uniquely identify the target (i.e., the feasible set has size one). (2) T ra jectory auditing: it logs candidate-set reduction dynamics (e.g., elimination rates, stalls) and supp orts tra jectory-level diagnostics. (3) Consistency chec king under noise: it detects contradictions signaled by an empty feasible set or conflicting constraints, 12 enabling analysis of whether the b enc hmark mo del responds conserv atively (e.g., b y asking follo wing verification questions) rather than guessing prematurely . Imp ortan tly , the feasible set is up dated after non- Skip turns if an y candidate elimination o ccurs, cleanly separating proto col violations from evidence-based candidate elimination. Ev aluation and verification. W e include an indep endent ev aluation mo dule that logs the full in teraction trace and scores b oth outcome and pro cess. W e c heck whether the mo del’s final guess matches the hidden target (non-verified accuracy). In addition, to rule out random-but-correct guesses, we compute verifie d ac cur acy using the feasible candidate set maintained b y the v erification mo dule: an episo de counts as verified-successful only if the feasible set is reduced to exactly one image (the target) b efore the mo del guesses. This design allo ws us to measure not only whether the mo del ultimately identifies the target, but also whether it do es so through a coherent, evidence-driv en tra jectory that adheres to the proto col and effectively narro ws the searc h space. 5 Ev aluation Metrics W e ev aluate mo dels along four complementary axes. Unless noted otherwise, all constrain t application and candidate-set up dates are computed b y the verifica- tion mo dule using only valid (non- Skip ) question-answer pairs. • Iden tification accuracy: Ov erall (Non-V erified), V erified, Random-Guess. Over al l (Non-verifie d) ac cur acy measures whether the mo del’s final guess matches the hidden target. V erifie d ac cur acy excludes random-but-correct guesses: an episo de is counted as verified-successful only if (i) the final guess is correct and (ii) the verification mo dule’s feasible candidate set has size exactly one and con tains the hidden target immediately b efore the guess. This metric captures whether the mo del’s tra jectory effectiv ely narro ws down to the correct answer rather than succeeding by ch ance. R andom-guess ac cur acy counts episo des where the final guess is correct but the feasible candidate set has size greater than one, indicating that the mo del guessed correctly without sufficien t evidence. • In teraction efficiency . W e measure in teraction cost as the n umber of total turns (including Skip ) b efore the final guess. W e rep ort efficiency on verified-successful episo des, random-guess correct episo des, incorrect episo des, and all episo des, so fewer turns corresp ond to faster evidence- driv en narrowing rather than premature guessing. • Proto col compliance. W e quantify instruction follo wing via (i) the a verage Skip rate resp onses and (ii) the av erage question generation rate b efore the End of uplo ading signal (premature outputs). Low er a verage Skip 13 rates and zero premature outputs indicate stronger adherence; repeated violations reflect failures to reco ver to v alid questioning. • Robustness to noisy feedbac k. W e test robustness under imp erfect oracle answers b y injecting controlled noise (e.g., flipping one Y es ↔ No resp onse or p erturbing an Unsur e response). W e then rep ort non-verified and verified accuracy under noise. The verified metric highlights whether the mo del can reco v er an evidence-consistent tra jectory (e.g., by re-ch ecking critical attributes) rather than succeeding via chance. 6 Exp erimen tal Results W e ev aluate several op en-source VLMs on Guess My Pr eferr e d Dr ess across m ultiple difficulty settings, including Qwen3-VL-235B-Instruct-FP8 [ 28 ], Qw en3.5-397B-A17B-FP8 [ 2 ], and Step3-VL-10B [ 5 ]. W e report av erage v erified and o verall (non-v erified) accuracy , interaction efficiency (a v erage n umber of turns/rounds b efore the final guess and after the “End of uploading” signal), and proto col compliance (av erage Skip rate and premature outputs rate). Ev aluation setup. In our framework, w e employ Qwen3-VL-235B- Instruct-FP8 [ 28 ] as a few-shot violation detector. F or the oracle answering mo dule, we use an agen tic Qwen-Agen t pip eline with an image zo om-in to ol to supp ort insp ection of fine-grained details; the VLM bac kb one is Qw en3-VL-235B-Instruct-FP8 [ 28 ]. Figure 7: Av erage verified, random-guess correct, and ov erall (non-verified) accuracy across similarity thresholds for Qw en3-VL-235B-Instruct-FP8 , Qw en3.5-397B-A17B-FP8 , and Step3-VL-10B . Accuracy . Figure 7 compares three accuracy metrics across similarity thresh- olds for the three ev aluated mo dels. Qwen3.5-397B-A17B-FP8 attains the highest verified accuracy at thresholds 0.3 through 0.6 with p eak at threshold 0.4, whereas Step3-VL-10B p erforms b est at threshold 0.8. Remark ably , despite b eing the smallest mo del with only 10B parameters, Step3-VL-10B attains the highest verified accuracy , random-guess accuracy and ov erall (Non-v erified) accu- racy at the highest similarity threshold, where the av erage gallery size is smaller but the candidates are most visually similar and therefore hardest to distinguish. 14 Figure 8: Average verified accuracy with 95% confidence interv als across similar- it y thresholds for Qwen3-VL-235B-Instruct-FP8 , Qw en3.5-397B-A17B- FP8 , and Step3-VL-10B . By contrast, Qw en3-VL-235B-Instruct-FP8 generally attains the highest random-guess accuracy at thresholds 0.3 through 0.6, but this adv antage do es not translate into the highest verified accuracy . This gap indicates that higher final answ er accuracy alone do es not necessarily reflect b etter grounded or verifiable reasoning. F or ov erall (non-verified) accuracy , Qwen3-VL-235B-Instruct- FP8 leads at low er thresholds, likely b enefiting in part from its relatively high random-guess accuracy , while Qw en3.5-397B-A17B-FP8 p eaks at 0.6 and Step3-VL-10B p erforms b est at 0.8. Ov erall, these patterns suggest that Step3-VL-10B is more affected b y gallery size, whereas Qwen3.5-397B-A17B-FP8 is more sensitive to visually confusable candidates. Additionally , the consisten tly higher random-guess accu- racy of Qwen3-VL-235B-Instruct-FP8 ma y indicate a greater tendency to guess early , while Qwen3.5-397B-A17B-FP8 appears more conserv ative to ask more questions rather than random-guess. Figure 8 shows the verified accuracy with 95% confidence interv als across similarit y thresholds for the three mo dels. Notably , Qwen3.5-397B-A17B- FP8 is strongest at lo wer and intermediate thresholds, whereas Step3-VL-10B b ecomes strongest at the highest threshold. Efficiency Figure 9 illustrates interaction length across similarit y thresholds for the three ev aluated mo dels in four outcome categories: v erified correct, random-guess correct, incorrect, and all episo des. In the verified-correct category , Step3-VL-10B , despite b eing the smallest mo del, generally shows the longest in teraction length across thresholds. In con trast, Qwen3-VL-235B-Instruct- FP8 consistently has the shortest in teraction length, suggesting that it may b e more prone to guessing early without sufficient evidence, esp ecially when the gallery contains more than 20 candidates (at thresholds, 0.3 and 0.4). Qw en3.5- 397B-A17B-FP8 , by comparison, maintains a relatively constan t interaction 15 Figure 9: Average num b er of dialogue rounds b efore guessing across similarit y thresholds for four outcome categories (v erified correct, random-guess correct, incorrect, all) for three differen t mo dels. 16 length across thresholds, consisten t with a more conserv ative strategy of asking additional questions regardless of gallery size or task difficulty . In the incorrect category , Qw en3.5-397B-A17B-FP8 generally exhibits the longest interaction length, suggesting that it con tinues querying ev en when discriminativ e cues are difficult to identify . By con trast, Qwen3-VL-235B- Instruct-FP8 and Step3-VL-10B app ear more likely to stop earlier when they struggle, particularly at low er similarit y thresholds where gallery sizes are larger. Figure 10: Av erage Skip rates across similarity thresholds for three different mo dels. Proto col Compliance Figure 10 sho ws the av erage Skip rates across simi- larit y thresholds for the three ev aluated mo dels. In general, Qw en3.5-397B- A17B-FP8 exhibits the highest Skip rates across thresholds in all four outcome categories. This suggests that, while it attains stronger verified accuracy than Qw en3-VL-235B-Instruct-FP8 , it also struggles more with proto col adher- ence, p ossibly b ecause its more p ersistent questioning strategy results in more frequen t rule violations. Figure 11 shows the premature output rates across similarit y thresholds for the three mo dels. Unlike skip rates, premature output rates decrease steadily as the similarity threshold rises and the av erage gallery size falls. Step3-VL-10B remains the worst across thresholds, while b oth Qwen3-VL-235B-Instruct- FP8 and Qwen3.5-397B-A17B-FP8 approac h zero at the highest threshold. 17 Figure 11: Premature output rates across similarit y thresholds for three differen t mo dels. These results suggest that premature questioning is driven more b y larger gallery size than by visual confusability , and that Step3-VL-10B is more sensitive to gallery size than the other t wo mo dels. 7 F ailure Cases 7.1 Qw en3-VL-235B-Instruct-FP8 failure cases Figure 12 shows a failure case for Qwen3-VL-235B-Instruct-FP8 at τ = 0 . 8 with 6 candidate images. Over 9 rounds, the mo del asks ab out tiered skirts, fabric sheen, wrap fron ts, hem shap e, and ruffled hems, but also violates constraints b y asking ab out sleeve length, garmen t length, and color; these in v alid turns are returned as Skip . Despite some narrowing from the v alid turns, the final guess is incorrect. This episo de illustrates tw o common failure mo des: (i) p oor proto col compliance that wastes interaction budget and (ii) insufficiently discriminative questions among highly similar candidates. Figure 13 shows a second failure case for Qw en3-VL-235B-Instruct-FP8 at τ = 0 . 3 with 35 candidates across sev en upload batc hes. The model b egins gen- erating questions during the upload phase, including echoing End of uploading and asking questions b efore the actual signal, violating the upload proto col. After the upload completes, it asks only one v alid question and immediately guesses, effectively a random guess. This illustrates how large galleries can amplify (i) premature, proto col-violating b ehavior and (ii) early stopping with insufficien t evidence. 7.2 Qw en3.5-397B-A17B-FP8 failure cases Figure 14 shows a failure case for Qwen3.5-397B-A17B-FP8 at τ = 0 . 6 with 7 candidate images. The mo del uses all 20 allow ed turns without ev er committing 18 Prompt T ask: I will upload several batches of dress options. Please guess which one is my fav orite dress and provide its index in the image gallery (starting index: 1). Constrain ts: 1. Ask questions to gather attribute information about the target dress. 2. Ask exactly one Y es/No question p er turn ab out one distinguishable feature. I will resp ond with Y es, No, Unsure (if not confiden t), or Skip (if you violate the rules). 3. If you receive Skip, your latest question was inv alid and provides no information. 4. Do not ask ab out sleev e length, garmen t length, color, pattern/prin ts, age group, size, sho es, necklace, hat, bag, background, or the human mo del. 5. Do not en umerate attribute v alues across turns for the same attribute type (e.g., square neck, V-neck, crew neck). 6. Do not inspect images one-by-one or reference indices (e.g., “is y our fa vorite dress the first image?”). 7. Do not guess until you have narrowed the target down to one image. 8. If your confidence is low or y ou find contradictions, keep asking v erification questions. 9. When ready , output: My guess of your favorite dress: #. 10. Only start generating questions after you receive End of uploading . T urn Sp eak er Con tent 1 Model Do es the dress ha ve a tiered skirt? User Y es 2 Model Is the dress made of a shiny or satin-like fabric? User Y es 3 Model Do es the dress hav e long sleeves? User Skip 4 Model Is the dress sleeveless? User Skip 5 Model Do es the dress ha ve a wrap-style front? User Y es 6 Model Is the dress flo or-length? User Skip 7 Model Do es the dress ha ve a high-low hemline? User No 8 Model Is the dress in a solid color? User Skip 9 Model Do es the dress ha ve a ruffled hem? User Y es Mo del My guess: dress #2 ( ✗ Incorrect) Figure 12: F ailure case of Qw en3-VL-235B-Instruct-FP8 on Guess My Pr eferr e d Dr ess . Out of 9 turns, 4 are Skip (proto col violations), and the final guess is incorrect. 19 Prompt T ask: I will upload several batches of dress options. Please guess which one is my fav orite dress and provide its index in the image gallery (starting index: 1). Constrain ts: 1. Ask questions to gather attribute information about the target dress. 2. Ask exactly one Y es/No question p er turn ab out one distinguishable feature. I will resp ond with Y es, No, Unsure (if not confiden t), or Skip (if you violate the rules). 3. If you receive Skip, your latest question was inv alid and provides no information. 4. Do not ask ab out sleev e length, garmen t length, color, pattern/prin ts, age group, size, sho es, necklace, hat, bag, background, or the human mo del. 5. Do not en umerate attribute v alues across turns for the same attribute type (e.g., square neck, V-neck, crew neck). 6. Do not inspect images one-by-one or reference indices (e.g., “is y our fa vorite dress the first image?”). 7. Do not guess until you have narrowed the target down to one image. 8. If your confidence is low or y ou find contradictions, keep asking v erification questions. 9. When ready , output: My guess of your favorite dress: #. 10. Only start generating questions after you receive End of uploading . T urn Sp eak er Con tent User Prompt + Mo del End of uploading User Here is the next batc h of dress options. Mo del End of uploading User Here is the next batc h of dress options. Mo del Does the dress hav e a ruffled hem? User Here is the next batc h of dress options. Mo del Does the dress hav e a lace ov erlay? User Here is the next batc h of dress options. Mo del Does the dress hav e a high neckline? User Here is the next batc h of dress options. Mo del Does the dress hav e a V-nec kline? User End of uploading 1 Model Do es the dress ha ve a front tie detail? User Y es Mo del My guess: dress #1 ( ✗ Incorrect) Figure 13: F ailure case at τ = 0 . 3 with 35 candidates. The mo del violates the upload proto col by generating outputs before the true End of uploading signal, then guesses after only one answ ered question. 20 to a final guess. T en of the tw ent y turns result in Skip : Questions 2–4 enumerate nec kline types (off-the-shoulder, square, sweetheart) after already asking ab out V-nec kline in Question 1, violating Rule 5; Question 9 asks ab out sleeve length (Rule 4); Questions 11–12 ask ab out print types (Rule 4); Question 16 re- en umerates skirt style after asking ab out tiered skirts in Question 6 (Rule 5); and Questions 18–20 ask about garmen t length and color (Rule 4). The remaining v alid turns establish that the target has a smo ck ed b o dice, a side slit, and a fitted waist, but the model fails to syn thesize these constraints to narrow the candidate set to a unique target, exhausting the budget without a guess. This episo de illustrates (i) rep eated violation of the same rule types throughout the in teraction and (ii) insufficien t constraint integration despite accumulating v alid discriminativ e evidence. Figure 15 sho ws a failure case for Qwen3.5-397B-A17B-FP8 at τ = 0 . 5 with 14 candidate images. The model again exhausts all 20 allow ed turns without committing to a final guess at low er similarit y threshold. Tw elve of the 20 turns result in Skip : T urn 3 asks ab out floral print (Rule 4); T urns 9–10 enumerate strapless and sweetheart nec klines after V-nec kline was confirmed in T urn 4 (Rule 5); and T urns 12–20 cycle through nine comp ound combinations of already- confirmed attributes—wrap-style, V-neckline, slit, ruched detail, and high-low hemline—re-en umerating established facts rather than posing new discriminativ e questions (Rule 5). The v alid turns collectively establish five features (wrap- st yle, V-nec kline, slit, ruc hed detail, high-low hemline), yet the model fails to syn thesize this evidence to narrow the 14-candidate p o ol to a unique target. This episo de illustrates (i) rep eated Rule 4 and Rule 5 violations identical in kind to those in Figure 14 , and (ii) a distinctive re-enumeration lo op in which the mo del exhausts its remaining budget cycling through comp ound combinations of already-confirmed attribute pairs and triples rather than probing gen uinely new features. 8 Reinforcemen t Learning T ra jectories Collec- tion AMIGO can also serv e as a tra jectory collection framework for m ultimo dal reinforcemen t learning. Eac h step provides a well-defined state (gallery , dialogue con text, accumulated constraints), an action (a constrained question), and an observ ation (Y es/No/Unsure/Skip feedback, optionally noisy), with a naturally long-horizon ob jectiv e. This structure supp orts dense rewards without additional annotation, e.g., p enalties for in v alid questions (Skip), progress rewards based on reduction in feasible set size or estimated information gain, and terminal rew ards for correct identification only when the target is uniquely determined. These tra jectories can supp ort offline RL, preference optimization o ver alternative questions, and robustness training under imp erfect feedback. 21 Pr ompt identic al to Figure 12 . T urn Speaker Con tent 1 Model Does your fav orite dress feature a V-neckline? User No 2 Model Does your fav orite dress hav e an off-the-shoulder neckline? User Skip Rule 5: enumerating neckline typ es 3 Model Does your fav orite dress hav e a square neckline? User Skip Rule 5: enumerating neckline typ es 4 Model Does your fav orite dress hav e a sweetheart neckline? User Skip Rule 5: enumerating neckline typ es 5 Model Does your fav orite dress hav e a wrap-style b odice? User No 6 Model Does your fav orite dress hav e a tiered skirt? User No 7 Model Does your fav orite dress hav e a smo ck ed bo dice? User Y es 8 Model Does your fav orite dress hav e flutter sleeves? User No 9 Model Does your fav orite dress hav e long sleeves? User Skip Rule 4: sleeve length 10 Model Does your fav orite dress hav e a side slit? User Y es 11 Model Does your fav orite dress hav e a floral print? User Skip Rule 4: pattern/prints 12 Model Does your fav orite dress hav e a leaf print? User Skip Rule 4: pattern/prints 13 Model Does your fav orite dress hav e a high-low hemline? User No 14 Model Does your fav orite dress hav e a straight hemline? User No 15 Model Does your fav orite dress hav e a ruffled hemline? User No 16 Model Does your fav orite dress hav e a flared skirt? User Skip Rule 5: enumerating skirt types 17 Model Does your fav orite dress hav e a fitted waist? User Y es 18 Model Does your fav orite dress hav e a midi length? User Skip Rule 4: garment length 19 Model Does your fav orite dress hav e a maxi length? User Skip Rule 4: garment length 20 Model Does your fav orite dress hav e a solid color? User Skip Rule 4: color Model No guess made — 20-question budget exhausted ( ✗ Incorrect) Figure 14: F ailure case of Qw en3.5-397B-A17B-FP8 on Guess My Pr eferr e d Dr ess at τ = 0 . 6 with 7 candidates. Out of 20 turns, 10 are Skip (proto col violations highlighted in y ellow with rule annotations), and the mo del exhausts its en tire question budget without making a final guess. Rep eated Rule 4 and Rule 5 violations waste turns, while v alid evidence ab out smo c ked bo dice, side slit, and fitted waist is not synthesized into a unique identification. 22 Pr ompt identic al to Figure 12 . T urn Speaker Con tent 1 Model Is your favorite dress made of velv et fabric? User No 2 Model Is your favorite dress a wrap-style dress? User Y es 3 Model Does your fav orite dress hav e a floral print? User Skip Rule 4: pattern/prints 4 Model Does your fav orite dress hav e a V-neckline? User Y es 5 Model Does your fav orite dress hav e a tiered skirt? User No 6 Model Does your fav orite dress hav e a slit? User Y es 7 Model Does your fav orite dress hav e a b elt or tie at the waist? User No 8 Model Does your fav orite dress hav e a ruched detail? User Y es 9 Model Does your fav orite dress hav e a strapless neckline? User Skip Rule 5: enumerating neckline typ es 10 Model Does your fav orite dress hav e a sweetheart neckline? User Skip Rule 5: enumerating neckline typ es 11 Model Does your fav orite dress hav e a high-low hemline? User Y es 12 Model Does your fav orite dress have a wrap-st yle b odice with a V-neckline, a ruched detail, and a high-low hemline? User Skip Rule 5: re-enumer ating confirme d attributes 13 Model Does your fav orite dress have a wrap-st yle b odice with a V-neckline and a high-low hemline? User Skip Rule 5: re-enumer ating confirme d attributes 14 Model Does y our favorite dress have a V-neckline and a high-low hemline? User Skip Rule 5: re-enumer ating confirme d attributes 15 Model Does your fav orite dress hav e a V-neckline and a slit? User Skip Rule 5: re-enumer ating confirme d attributes 16 Model Does your fav orite dress hav e a V-neckline and ruched detail- ing? User Skip Rule 5: re-enumer ating confirme d attributes 17 Model Does your fav orite dress hav e a wrap-style bo dice and a high- low hemline? User Skip Rule 5: re-enumer ating confirme d attributes 18 Model Does your fav orite dress have a wrap-style b o dice and a slit? User Skip Rule 5: re-enumer ating confirme d attributes 19 Model Does your fav orite dress hav e a V-neckline and a wrap-style bo dice? User Skip Rule 5: re-enumer ating confirme d attributes 20 Model Does your fav orite dress have a V-neckline and a ruched detail? User Skip Rule 5: re-enumer ating confirme d attributes Model No guess made — 20-question budget exhausted ( ✗ Incorrect) Figure 15: F ailure case of Qw en3.5-397B-A17B-FP8 on Guess My Pr eferr e d Dr ess at τ = 0 . 5 with 14 candidates. Out of 20 turns, 12 are Skip (proto col violations highlighted in y ello w with rule annotations), and the mo del exhausts its en tire question budget without making a final guess. After v alid turns establish wrap-st yle (T urn 2), V-neckline (T urn 4), slit (T urn 6), ruched detail (T urn 8), and high-low hemline (T urn 11), the mo del enters a nine-turn re-enumeration lo op cycling through comp ound com binations of these already-confirmed attributes (T urns 12–20) rather than probing new discriminative features. 23 References [1] Jian wei Y ang, Reub en T an, Qianhui W u, Ruijie Zheng, Baolin P eng, Y ongyuan Liang, Y u Gu, Mu Cai, Seonghy eon Y e, Jo el Jang, et al. Magma: A foundation mo del for multimodal ai agen ts. In Pr o c e e dings of the c omputer vision and p attern r e c o gnition c onfer enc e , pages 14203–14214, 2025. [2] Qw en T eam. Qwen3.5: T ow ards native multimodal agen ts, F ebruary 2026. [3] V T eam, W enyi Hong, W enmeng Y u, Xiaotao Gu, Guo W ang, Guobing Gan, Haomiao T ang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang P an, Shuaiqi Duan, W eihan W ang, Y an W ang, Y ean Cheng, Zehai He, Zhe Su, Zhen Y ang, Ziy ang Pan, Aohan Zeng, Baoxu W ang, Bin Chen, Boy an Shi, Changyu P ang, Chenhui Zhang, Da Yin, F an Y ang, Guo qing Chen, Jiazheng Xu, Jiale Zh u, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang W ang, Junjie Chen, Leqi Lei, Letian Gong, Leyi P an, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qink ai Zheng, Sheng Y ang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin T u, Shengbiao Meng, Tiansh u Zhang, Tian wei Luo, Tianxiang Hao, Tian yu T ong, W enk ai Li, W ei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue F an, Xuancheng Huang, Y anling W ang, Y adong Xue, Y anfeng W ang, Y anzi W ang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Y uan W ang, Y uanchang Y ue, Y uchen Li, Y utao Zhang, Y uting W ang, Y u W ang, Y uxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan W ang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Y uxiao Dong, and Jie T ang. Glm-4.5v and glm-4.1v-thinking: T ow ards v ersatile multimodal reasoning with scalable reinforcement learning, 2025. [4] Kimi T eam, T ongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Y uan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agen tic intelligence. arXiv pr eprint arXiv:2602.02276 , 2026. [5] Ailin Huang, Chengyuan Y ao, Ch unrui Han, F anqi W an, Hangyu Guo, Haoran Lv, Hongyu Zhou, Jia W ang, Jian Zhou, Jianjian Sun, Jingc heng Hu, Kangheng Lin, Liang Zhao, Mitt Huang, Song Y uan, W enw en Qu, Xiangfeng W ang, Y anlin Lai, Yingxiu Zhao, Yinmin Zhang, Y uk ang Shi, Y uyang Chen, Zejia W eng, Ziyang Meng, Ang Li, Aob o Kong, Bo Dong, Changyi W an, Da vid W ang, Di Qi, Dingming Li, En Y u, Guopeng Li, Haiquan Yin, Han Zhou, Hanshan Zhang, Haolong Y an, Hebin Zhou, Hongbo Peng, Jiaran Zhang, Jiashu Lv, Jiayi F u, Jie Cheng, Jie Zhou, Jisheng Yin, Jing jing Xie, Jingw ei W u, Jun Zhang, Junfeng Liu, Kaijun T an, Kaiwen Y an, Liangyu Chen, Lina Chen, Mingliang Li, Qian Zhao, Quan Sun, Shaoliang Pang, Sheng jie F an, Shijie Shang, Siyuan Zhang, Tianhao Y ou, W ei Ji, W uxun Xie, Xiaob o Y ang, Xiao jie Hou, Xiaoran Jiao, Xiao xiao Ren, Xiangw en Kong, Xin Huang, Xin W u, Xing Chen, Xinran W ang, Xuelin Zhang, Y ana W ei, Y ang Li, Y anming Xu, Y eqing Shen, Y uang P eng, Y ue P eng, Y u Zhou, Y usheng Li, Y uxiang Y ang, Y uyang Zhang, Zhe Xie, Zhew ei Huang, Zhenyi Lu, Zhimin F an, Zihui Cheng, Daxin Jiang, Qi Han, Xiangyu Zhang, Yib o Zh u, and Zheng Ge. Step3-vl-10b technical rep ort, 2026. 24 [6] Y ash Goy al, T ejas Khot, Douglas Summers-Stay , Dhruv Batra, and Devi P arikh. Making the v in vqa matter: Elev ating the role of image understand- ing in visual question answering. In Pr o c e e dings of the IEEE c onfer enc e on c omputer vision and p attern r e c o gnition , pages 6904–6913, 2017. [7] Minesh Mathew, Vira j Bagal, Rub` en Tito, Dimosthenis Karatzas, Ernest V alven y , and CV Ja wahar. Infographicvqa. In Pr o c e e dings of the IEEE/CVF Winter Confer enc e on Applic ations of Computer Vision , pages 1697–1706, 2022. [8] Xiang Y ue, Y uansheng Ni, Kai Zhang, Tianyu Zheng, Ruo qi Liu, Ge Zhang, Sam uel Stevens, Dongfu Jiang, W eiming Ren, Y uxuan Sun, et al. Mmmu: A massive m ulti-discipline multimodal understanding and reasoning b ench- mark for expert agi. In Pr o c e e dings of the IEEE/CVF c onfer enc e on c omputer vision and p attern r e c o gnition , pages 9556–9567, 2024. [9] Y uan Liu, Hao dong Duan, Y uanhan Zhang, Bo Li, Songy ang Zhang, W angb o Zhao, Yike Y uan, Jiaqi W ang, Conghui He, Ziwei Liu, et al. Mmbench: Is y our multi-modal mo del an all-around play er? In Eur op e an c onfer enc e on c omputer vision , pages 216–233. Springer, 2024. [10] Yi-F an Zhang, Huan yu Zhang, Haochen Tian, Chaoy ou F u, Shuangqing Zhang, Junfei W u, F eng Li, Kun W ang, Qingsong W en, Zhang Zhang, et al. Mme-realw orld: Could your m ultimo dal llm c hallenge high-resolution real-w orld scenarios that are difficult for humans? arXiv pr eprint arXiv:2408.13257 , 2024. [11] F anqing Meng, Jin W ang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Y u Qiao, Ping Luo, et al. Mmiu: Multimo dal multi- image understanding for ev aluating large vision-language mo dels. arXiv pr eprint arXiv:2408.02718 , 2024. [12] F ei W ang, Xingyu F u, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, W enxuan Zhou, Kai Zhang, et al. Muirb ench: A comprehensive b enchmark for robust multi-image understanding. arXiv pr eprint arXiv:2406.09411 , 2024. [13] Jiab o Y e, Haiyang Xu, Haow ei Liu, An wen Hu, Ming Y an, Qi Qian, Ji Zhang, F ei Huang, and Jingren Zhou. mplug-o wl3: T o wards long image-sequence understanding in multi-modal large language mo dels. arXiv pr eprint arXiv:2408.04840 , 2024. [14] Dongfu Jiang, Xuan He, Huay e Zeng, Cong W ei, Max Ku, Qian Liu, and W enhu Chen. Mantis: Interlea ved multi-image instruction tuning. arXiv pr eprint arXiv:2405.01483 , 2024. [15] Ziyu Liu, T ao Ch u, Y uhang Zang, Xilin W ei, Xiao yi Dong, Pan Zhang, Zijian Liang, Y uanjun Xiong, Y u Qiao, Dah ua Lin, et al. Mmdu: A multi- turn m ulti-image dialog understanding benchmark and instruction-tuning 25 dataset for lvlms. A dvanc es in Neur al Information Pr o c essing Systems , 37:8698–8733, 2024. [16] Da wei Y an, Y ang Li, Qing-Guo Chen, W eihua Luo, Peng W ang, Haokui Zhang, and Chunh ua Shen. Mmcr: Adv ancing visual language mo del in mul- timo dal multi-turn contextual reasoning. arXiv pr eprint arXiv:2503.18533 , 2025. [17] Y oung-Jun Lee, Byung-Kwan Lee, Jiansh u Zhang, Y ec han Hw ang, Byungso o Ko, Han-Gyu Kim, Dongyu Y ao, Xuankun Rong, Eo jin Jo o, Seung-Ho Han, et al. Multiv erse: A multi-turn conv ersation b enc hmark for ev aluating large vision and language mo dels. In Pr o c e e dings of the IEEE/CVF International Confer enc e on Computer Vision , pages 708–719, 2025. [18] Sh uo Liu, Kaining Ying, Hao Zhang, Y ue Y ang, Y uqi Lin, Tianle Zhang, Ch uanhao Li, Y u Qiao, Ping Luo, W enqi Shao, et al. Convbench: A m ulti-turn con versation ev aluation b enc hmark with hierarchical ablation capabilit y for large vision-language mo dels. A dvanc es in Neur al Information Pr o c essing Systems , 37:100734–100782, 2024. [19] Elliot L Epstein, Kaisheng Y ao, Jing Li, Xin yi Bai, and Hamid Palangi. Mmm t-if: A challenging m ultimo dal multi-turn instruction following b enc h- mark. arXiv pr eprint arXiv:2409.18216 , 2024. [20] Y ongqi Li, W enjie Li, and Liqiang Nie. Mmco qa: Conv ersational question answ ering o ver text, tables, and images. In Pr o c e e dings of the 60th A nnual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 4220–4231, 2022. [21] Y unzhuo Hao, Jiaw ei Gu, Huichen Will W ang, Linjie Li, Zhengyuan Y ang, Lijuan W ang, and Y u Cheng. Can mllms reason in multimodal- it y? emma: An enhanced multimodal reasoning b enchmark. arXiv pr eprint arXiv:2501.05444 , 2025. [22] Jih yung Kil, Zheda Mai, Justin Lee, Arpita Chowdh ury , Zihe W ang, Kerrie Cheng, Lemeng W ang, Y e Liu, and W ei-Lun Harry Chao. Mllm-compbench: A comparativ e reasoning benchmark for multimodal llms. A dvanc es in Neur al Information Pr o c essing Systems , 37:28798–28827, 2024. [23] Xiang Y ue, Tianyu Zheng, Y uansheng Ni, Y ub o W ang, Kai Zhang, Sheng- bang T ong, Y uxuan Sun, Botao Y u, Ge Zhang, Huan Sun, et al. Mmmu-pro: A more robust m ulti-discipline multimodal understanding b enchmark. In Pr o c e e dings of the 63r d Annual Me eting of the Asso ciation for Computational Linguistics (V olume 1: L ong Pap ers) , pages 15134–15186, 2025. [24] Bry an LM de Oliveira, Luana GB Martins, Bruno Brand˜ ao, and Luck e- ciano C Melo. Info quest: Ev aluating multi-turn dialogue agents for open- ended conv ersations with hidden context. arXiv pr eprint arXiv:2502.12257 , 2025. 26 [25] Kimia Ramezan, Alireza Amiri Bav andp our, Yifei Y uan, Clemencia Siro, and Mohammad Aliannejadi. Multi-turn m ulti-mo dal question clarification for enhanced con versational understanding. arXiv pr eprint arXiv:2502.11442 , 2025. [26] Xiaohan Y u, Chao F eng, Lang Mei, and Chong Chen. M 3 searc her: Mo dular m ultimo dal information seeking agency with retriev al-oriented reasoning. arXiv pr eprint arXiv:2601.09278 , 2026. [27] V ardhan Dongre, Chi Gui, Shubham Garg, Ho oshang Nayy eri, Gokhan T ur, Dilek Hakk ani-T ¨ ur, and Vikram S Adve. Mirage: A b enc hmark for m ultimo dal information-seeking and reasoning in agricultural exp ert-guided con versations. arXiv pr eprint arXiv:2506.20100 , 2025. [28] Qw en T eam. Qwen3 tec hnical report, 2025. [29] Lei Bai, Zhongrui Cai, Y uhang Cao, Maosong Cao, W eihan Cao, Chiyu Chen, Hao jiong Chen, Kai Chen, Pengc heng Chen, Ying Chen, et al. In tern-s1: A scien tific multimodal foundation model. arXiv pr eprint arXiv:2508.15763 , 2025. [30] W enyi Hong, W enmeng Y u, Xiaotao Gu, Guo W ang, Guobing Gan, Haomiao T ang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang P an, et al. Glm-4.5 v and glm-4.1 v-thinking: T ow ards versatile m ultimo dal reasoning with scalable reinforcemen t learning. arXiv pr eprint arXiv:2507.01006 , 2025. 27
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment