Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Robust safety of vision-language large models (VLLMs) under joint multilingual and multimodal inputs remains underexplored. Existing benchmarks are typically multilingual but text-only, or multimodal but monolingual. Recent multilingual multimodal red-teaming efforts render harmful prompts into images, yet rely heavily on typography-style visuals and lack semantically grounded image-text pairs, limiting coverage of realistic cross-modal interactions. We introduce Lingua-SafetyBench, a benchmark of 100,440 harmful image-text pairs across 10 languages, explicitly partitioned into image-dominant and text-dominant subsets to disentangle risk sources. Evaluating 11 open-source VLLMs reveals a consistent asymmetry: image-dominant risks yield higher ASR in high-resource languages, while text-dominant risks are more severe in non-high-resource languages. A controlled study on the Qwen series shows that scaling and version upgrades reduce Attack Success Rate (ASR) overall but disproportionately benefit HRLs, widening the gap between HRLs and Non-HRLs under text-dominant risks. This underscores the necessity of language- and modality-aware safety alignment beyond mere scaling.To facilitate reproducibility and future research, we will publicly release our benchmark, model checkpoints, and source code.The code and dataset will be available at https://github.com/zsxr15/Lingua-SafetyBench.Warning: this paper contains examples with unsafe content.

💡 Research Summary

Lingua‑SafetyBench addresses a critical gap in the evaluation of vision‑language large models (VLLMs): the lack of a benchmark that simultaneously tests multilingual and multimodal safety. Existing resources are either text‑only multilingual or multimodal monolingual, and recent multilingual multimodal red‑team efforts rely heavily on typographic images, which do not reflect realistic cross‑modal interactions. To fill this void, the authors construct a large‑scale dataset of 100 440 harmful image‑text pairs covering ten languages (English, Chinese, Arabic, French, German, Japanese, Norwegian, Finnish, Russian, Spanish). Crucially, the benchmark is explicitly split into two risk categories: Image‑Dominant (the harmful intent resides in the visual content while the accompanying text is benign) and Text‑Dominant (the unsafe instruction is in the text, with a neutral image). This partition enables precise attribution of risk sources and controlled cross‑lingual comparisons.

The dataset creation follows a three‑stage pipeline. First, an English base is built covering eight harmful scenarios. Image‑Dominant samples combine curated unsafe images from MM‑SafetyBench and VL‑Guard with low‑risk queries generated by GPT‑5, as well as synthetic images created via diffusion models guided by extracted visual keywords. Text‑Dominant samples pair unsafe textual queries (sourced from XSAFETY and VL‑Guard) with safe background images generated similarly. Human reviewers verify that each pair conforms to its risk definition. Second, a risk‑aligned translation strategy expands the benchmark to the ten target languages. For Text‑Dominant items only the text is translated; for Image‑Dominant items both the textual query and any visual text (e.g., slogans) are translated, and typographic or mixed images are regenerated in the target script. Third, native‑speaker validation ensures linguistic fidelity and visual correctness.

Evaluation uses two automated safety judges—GPT‑5.1 and Qwen‑Guard—to compute Attack Success Rate (ASR). Eleven open‑source VLLMs are tested: Gemma‑3‑12B, InternVL3.5‑8B, LLaMA‑3.2‑V‑11B, MiniCPM‑V‑4.5, Qwen2‑VL (2 B/7 B), Qwen2.5‑VL (3 B/7 B‑Instruct), Qwen3‑VL (2 B/4 B/8 B), among others. The authors also assess three prompt‑based safety enhancements (DPP, Self‑Exam, XSAFETY). All models are evaluated with greedy decoding (temperature = 0, max tokens = 256) for reproducibility.

Results reveal a consistent asymmetry across the two risk subsets. In Image‑Dominant scenarios, high‑resource languages (HRLs: English and Chinese) exhibit the highest ASR (≈55 %), indicating that visual attacks are more effective when the model has abundant language data. Conversely, in Text‑Dominant scenarios, non‑HRLs (e.g., Finnish, Japanese, Arabic) suffer higher ASR (≈45 % or more), showing that textual safety degrades sharply for languages with limited training resources. Language‑script analysis further shows that non‑Latin scripts incur higher risks than Latin‑based languages.

A focused study on the Qwen family demonstrates that scaling (more parameters) and newer versions reduce overall ASR, but the improvement is disproportionately larger for HRLs. For example, Qwen2‑VL (2 B) vs. Qwen3‑VL (8 B) drops HRL ASR from 55 % to 25 % while Non‑HRL ASR only falls from 45 % to 33 %. This widening gap underscores that model scaling alone does not guarantee equitable safety across languages, especially under Text‑Dominant attacks.

Prompt‑based defenses provide modest gains (≈5 % absolute reduction) but remain insufficient, particularly for non‑HRLs and mixed‑modality inputs. The authors argue that current safety alignment techniques are not robust to the combined challenges of multilingualism and multimodality.

In summary, Lingua‑SafetyBench is the first benchmark that unifies multilingual breadth and multimodal depth with explicit risk attribution. It reveals that VLLMs are vulnerable to both visual and textual attacks, with the severity depending on language resource levels. Scaling improves safety overall but exacerbates disparities, highlighting the need for language‑ and modality‑aware alignment strategies. The authors will release the dataset, model checkpoints, and code, facilitating reproducibility and encouraging future work on balanced, cross‑modal safety alignment for VLLMs.

Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

💡 Research Summary

Comments & Academic Discussion

Leave a Comment