IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

💡 Research Summary

The paper introduces IMTBench, a comprehensive benchmark designed to evaluate end‑to‑end In‑Image Machine Translation (IIMT) systems in realistic, multi‑scenario settings. Existing IIMT datasets are largely synthetic, limited to simple layouts (single‑line, horizontal, monochrome text), and evaluate only single‑modality metrics such as BLEU or FID. Consequently, they fail to capture the true difficulty of translating text embedded in real‑world images while preserving visual style, layout, and rendering fidelity. Moreover, they do not measure cross‑modal faithfulness – whether the translated text rendered in the output image matches the model’s textual translation.

IMTBench addresses these gaps by providing 2,500 high‑quality image‑translation pairs covering four practical domains: documents, web pages, natural scenes, and presentation slides. Each domain includes complex layouts, multi‑line text, varied fonts, colors, and non‑horizontal orientations. The benchmark spans nine languages (English, German, French, Spanish, Chinese, Japanese, Korean, Arabic, Russian), enabling multilingual evaluation and analysis of low‑resource language performance.

The authors propose a multi‑aspect evaluation suite:

Translation Quality – measured with COMET, a state‑of‑the‑art neural metric that captures semantic adequacy beyond surface n‑gram overlap.
Background Preservation – quantified by Mask‑LPIPS, which computes perceptual similarity on non‑text regions to assess how well the background remains unchanged after text replacement.
Overall Image Quality – assessed with Perceptual Quality (PQ) metrics, reflecting color fidelity, noise, and artifact levels.
Cross‑Modal Alignment – a novel Alignment Score that extracts the rendered text from the translated image (via OCR) and compares it to the model’s generated translation, directly measuring consistency between the textual and visual outputs.

To construct the dataset, the authors employ three parallel pipelines. For documents and web pages, they start from multilingual parallel corpora, translate the source text with a lightweight MT system, and render the translated content using the SynthDog engine, preserving typographic structures. Web pages are generated from HTML templates and filtered automatically with Qwen3‑VL to remove rendering errors. For natural scenes, they first detect text regions with OCR, translate them using multimodal translation models that incorporate visual context, and then replace the text using advanced image‑editing models such as GPT‑Image and SeedEdit. Human annotators verify each sample for translation correctness and visual realism. For presentation slides, they translate PPT content and render the slides while keeping layout and design intact.

The benchmark is used to evaluate three categories of systems: (i) commercial cascaded pipelines (OCR → MT → rendering), (ii) proprietary unified multimodal models (UMMs) that jointly process image and text, and (iii) open‑source UMMs spanning diffusion‑based, autoregressive, and hybrid architectures. Results show that while UMMs outperform cascaded systems in background preservation and overall image quality, they still struggle with complex layouts and low‑resource language pairs. Common failure modes include missing translations, semantic errors, and inaccurate glyph rendering. Alignment Scores reveal that many models generate correct translations textually but render them inconsistently, indicating a gap in style‑aware text synthesis.

The analysis highlights several research directions: improving cross‑modal consistency through joint loss functions, enhancing low‑resource language support via multilingual pre‑training and language‑specific rendering modules, developing layout‑conditioned style control mechanisms for fonts, colors, and perspective, and making diffusion‑based image editing more computationally efficient for real‑time applications.

In summary, IMTBench establishes the first realistic, multilingual, multi‑scenario benchmark for IIMT, integrating translation, visual fidelity, and cross‑modal alignment metrics. It provides a standardized platform for diagnosing strengths and weaknesses of current models and sets clear targets for future advancements in end‑to‑end image‑text translation.

IMTBench: A Multi-Scenario Cross-Modal Collaborative Evaluation Benchmark for In-Image Machine Translation

💡 Research Summary

Comments & Academic Discussion

Leave a Comment