MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Recent advances in large language models have laid the foundation for multimodal LLMs (MLLMs), which unify text, speech, and vision within a single framework. As these models are rapidly evolving toward general-purpose instruction following across diverse and complex tasks, a key frontier is evaluating their crosslingual and multimodal capabilities over both short- and long-form inputs. However, existing benchmarks fall short in evaluating these dimensions jointly: they are often limited to English, mostly focus on a single modality at a time, rely on short-form inputs, or lack human annotations–hindering comprehensive assessment of model performance across languages, modalities, and task complexity. To address these gaps, we introduce MCIF (Multimodal Crosslingual Instruction Following), the first crosslingual human-annotated benchmark based on scientific talks on NLP and beyond. MCIF evaluates instruction following in crosslingual, multimodal settings over different input lengths and spans four macro-tasks: recognition, translation, question answering, and summarization. It covers three core modalities (speech, vision, and text) and four diverse languages (English, German, Italian, and Chinese), fully aligned across all dimensions. This parallel design enables a systematic evaluation of MLLMs’ abilities to interpret instructions across languages and effectively integrate multimodal contextual information. Our benchmarking and analysis of 23 models highlight universal challenges across modalities and tasks, indicating substantial room for improvement in future MLLMs development. MCIF is released under CC-BY 4.0 license to promote open research.

💡 Research Summary

The paper introduces MCIF (Multimodal Crosslingual Instruction‑Following Benchmark), a novel evaluation suite designed to assess the instruction‑following capabilities of modern multimodal large language models (MLLMs) across languages, modalities, and input lengths. Existing benchmarks typically focus on a single modality, are limited to English, or only handle short inputs, which hampers comprehensive evaluation of emerging MLLMs that are expected to process speech, vision, and text jointly while following natural language instructions in multiple languages. MCIF fills this gap by constructing a human‑annotated dataset derived from scientific presentation videos (primarily from ACL 2023), covering three core modalities—text (transcripts and abstracts), speech (audio recordings), and video (slides and presenter gestures)—and four typologically diverse languages: English, German, Italian, and Chinese.

Data collection involved selecting 21 primary talks (≈2 hours) and an additional 79 talks (≈8 hours) to reach a total of about 10 hours of content. Professional linguists produced high‑quality English transcripts and the original abstracts were used as summaries. These were then translated by experts into the three target languages, ensuring perfect alignment across modalities and languages. For each talk, at least ten question‑answer (QA) pairs were crafted by 16 domain experts, following a structured distribution: generic questions, transcript‑based questions that require fine‑grained information from the full talk, and abstract‑based questions that can be answered from the abstract alone. Each QA pair is annotated with the modality (audio, video, both, or none) that contains the answer, enabling systematic evaluation of multimodal reasoning and handling of unanswerable cases.

The benchmark defines 13 distinct tasks grouped into four macro‑tasks: recognition (ASR, audio‑video recognition), translation (machine translation, speech‑to‑text, audio‑video translation), question answering (text QA, spoken QA, video QA, audio‑video QA), and summarization (text summarization, speech summarization, video summarization, audio‑video summarization). Each task is characterized by input modality, output modality, source and target language, and context length (short ≈ 16 seconds segments versus long full‑talk context). Prompts are provided in two styles: a fixed prompt that is identical across samples, and a “mix” prompt that varies naturally, allowing assessment of model robustness to prompt phrasing.

To benchmark current technology, the authors evaluated 23 state‑of‑the‑art systems, including 7 large language models (LLMs), 5 speech‑LLMs, 5 vision‑LLMs, and 6 multimodal LLMs (MLLMs). The evaluation reveals several systematic weaknesses: (1) Long‑form contexts, especially for summarization, remain challenging; models produce low ROUGE scores, indicating insufficient handling of long‑range dependencies. (2) Multimodal integration is still imperfect—while MLLMs outperform single‑modality baselines on audio‑video tasks, error rates are high when both speech and visual cues must be jointly interpreted. (3) Cross‑lingual settings (e.g., audio in German with an English instruction) cause a sharp drop in translation and summarization quality, exposing cascading errors in multilingual‑multimodal pipelines. (4) Performance differences between fixed and mixed prompts are modest, suggesting current models are relatively insensitive to prompt variation.

From these findings, the authors draw two main research directions. First, better mechanisms for compressing and summarizing long contexts are needed, such as hierarchical attention, external memory modules, or dedicated pre‑training objectives for long‑form summarization. Second, more sophisticated multitask learning that simultaneously aligns modalities and languages could mitigate the observed error propagation, particularly in speech‑video joint recognition. The benchmark also underscores the value of high‑quality human annotations; the CC‑BY 4.0‑licensed dataset and accompanying Apache‑2.0 evaluation code are publicly released to foster reproducibility and further development.

In summary, MCIF is the first fully parallel, human‑annotated benchmark that jointly evaluates cross‑lingual, multimodal instruction following over both short‑ and long‑form inputs. Its comprehensive design, extensive task coverage, and rigorous human annotation expose substantial gaps in current MLLM capabilities and provide a clear roadmap for future research in multimodal, multilingual, and instruction‑following AI systems.

MCIF: Multimodal Crosslingual Instruction-Following Benchmark from Scientific Talks

💡 Research Summary

Comments & Academic Discussion

Leave a Comment