INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning

INST-IT: Boosting Instance Understanding via Explicit Visual Prompt Instruction Tuning
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large Multimodal Models (LMMs) have made significant breakthroughs with the advancement of instruction tuning. However, while existing models can understand images and videos at a holistic level, they still struggle with instance-level understanding that requires a more fine-grained comprehension and alignment. Instance-level understanding is crucial for LMMs, as it focuses on the specific elements that we are most interested in. Excitingly, existing works find that the SOTA LMMs exhibit strong instance understanding capabilities when provided with explicit visual cues. Motivated by this, we proposed Inst-IT, a solution to enhance LMMs in Instance understanding via explicit visual prompt Instruction Tuning for instance guidance. Inst-IT consists of a benchmark to diagnose multimodal instance-level understanding, a large-scale instruction-tuning dataset, and a continuous instruction-tuning training paradigm to effectively enhance spatial-temporal instance understanding capabilities of existing LMMs. Experimental results show that, enhanced by Inst-IT, our models not only achieve outstanding performance on Inst-IT Bench and other instance understanding benchmarks, but also demonstrate significant improvements across various generic image and video understanding benchmarks. This highlights that our method not only boosts instance-level understanding but also strengthens the overall capabilities of generic image and video comprehension.


💡 Research Summary

The paper introduces Inst‑IT, a framework designed to improve large multimodal models’ (LMMs) ability to understand individual instances within images and videos. While current LMMs excel at holistic perception, they often fail to capture fine‑grained attributes, relationships, and temporal dynamics of specific objects that users care about. Inst‑IT addresses this gap through three main components. First, it provides Inst‑IT Bench, a human‑verified benchmark that evaluates instance‑level comprehension across both static and dynamic visual content. Second, it builds the Inst‑IT Dataset using an automated pipeline that augments each visual input with explicit visual prompts—Set‑of‑Marks (SoM) identifiers that overlay numerical IDs on target instances. These prompts guide GPT‑4o to generate multi‑level annotations: per‑frame instance captions, whole‑image captions, temporal difference descriptions, video‑level summaries, and extensive open‑ended QA pairs focused on instance attributes and interactions. The resulting corpus contains 51 k images and 21 k videos, amounting to over 200 k image captions, 135 k temporal descriptions, 21 k video captions, and 335 k QA pairs. Third, the authors propose a continuous instruction‑tuning recipe that mixes the Inst‑IT data with generic multimodal instruction data, feeding the model inputs that include the visual prompts. This enables the model to learn “which instance” rather than just “where”. Experiments on LLaVA‑2‑based models show substantial gains: on Inst‑IT Bench the tuned model outperforms prior state‑of‑the‑art by 7–12 % absolute accuracy; it also improves on other instance‑understanding benchmarks such as RefCOCOg and ViP‑Bench. Importantly, generic image and video understanding benchmarks (AI2D, ChartQA, EgoSchema, NExT‑QA) see 4–13 % relative improvements, demonstrating that the instance‑centric tuning also strengthens overall multimodal reasoning. Ablation studies confirm that the visual prompts reduce hallucination and that performance scales with dataset size, plateauing after roughly 50 k annotated samples. In sum, Inst‑IT shows that explicit visual prompting combined with large‑scale instance‑level instruction tuning can effectively bridge the fine‑grained understanding gap in current LMMs, opening avenues for more precise visual‑language applications such as robotics, video analytics, and interactive AI assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment