Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs

Automated Safety Benchmarking: A Multi-agent Pipeline for LVLMs
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Large vision-language models (LVLMs) exhibit remarkable capabilities in cross-modal tasks but face significant safety challenges, which undermine their reliability in real-world applications. Efforts have been made to build LVLM safety evaluation benchmarks to uncover their vulnerability. However, existing benchmarks are hindered by their labor-intensive construction process, static complexity, and limited discriminative power. Thus, they may fail to keep pace with rapidly evolving models and emerging risks. To address these limitations, we propose VLSafetyBencher, the first automated system for LVLM safety benchmarking. VLSafetyBencher introduces four collaborative agents: Data Preprocessing, Generation, Augmentation, and Selection agents to construct and select high-quality samples. Experiments validates that VLSafetyBencher can construct high-quality safety benchmarks within one week at a minimal cost. The generated benchmark effectively distinguish safety, with a safety rate disparity of 70% between the most and least safe models.


💡 Research Summary

Large Vision‑Language Models (LVLMs) have demonstrated impressive cross‑modal abilities, yet their safety shortcomings—such as toxic outputs, bias, privacy leaks, and misinformation—pose serious risks for real‑world deployment. Existing safety benchmarks for LVLMs (e.g., MM‑SafetyBench, MLLMGuard, SafeBench) are largely built through labor‑intensive manual annotation, resulting in high resource costs, static test suites that cannot keep up with rapid model evolution, and limited discriminative power that clusters model scores within a narrow range. To overcome these challenges, the authors introduce VLSafetyBencher, the first fully automated system for constructing and updating LVLM safety benchmarks.

System Overview
VLSafetyBencher orchestrates four specialized agents in a serial pipeline:

  1. Data Preprocessing Agent – Starts from a raw pool of ~300 K images gathered from four sources: (i) existing safety datasets, (ii) general image collections, (iii) synthetic images generated by diffusion models, and (iv) social‑media scraped content. Using CLIP for coarse filtering, the agent discards low‑resolution images, short prompts (≤24 characters), and duplicates. It then categorizes each sample into a two‑level taxonomy (six top‑level categories: Privacy, Bias, Toxicity, Legality, Misinformation, Health Risk; 20 sub‑categories) by prompting CLIP and LVLM for image‑text matching, achieving >92 % labeling accuracy without human involvement.

  2. Data Generation Agent – Synthesizes malicious image‑question pairs via three complementary strategies:
    Modality Dependence: Harmful content resides solely in the image while the accompanying text is neutral, forcing the model to rely on visual analysis.
    Modality Complementarity: Critical cues are split between image and text; neither modality alone suffices, requiring genuine multimodal fusion.
    Modality Conflict: The text deliberately misleads or induces unsafe behavior, but the image contradicts it, testing the model’s ability to detect and reject contradictory instructions.
    The agent iteratively calls large language models, diffusion image generators, and LVLMs to produce ~150 K candidate pairs.

  3. Data Augmentation Agent – Enhances both harmfulness and diversity through dual‑modal mutations. Image transformations include color jitter, background replacement, object insertion/deletion; text transformations involve synonym substitution, paraphrasing, and insertion of harmful keywords. Each original candidate yields on average three augmented variants, expanding the pool to ~450 K samples.

  4. Selection Agent – Formalizes benchmark construction as an optimization problem balancing three desiderata: separability (ability to distinguish safe vs. unsafe models), harmfulness (intrinsic risk level), and diversity (coverage of categories and modalities). A weighted score is computed for each sample, and an iterative greedy algorithm approximates the global optimum, selecting the top‑scoring 10 K samples for the final benchmark. Human evaluation of the selected set reports an average risk score of 0.87 / 1.0, indicating high harmfulness.

Experimental Validation
The authors evaluate 12 state‑of‑the‑art LVLMs (including LLaVA‑1.5, MiniGPT‑4, InstructBLIP) on the automatically generated benchmark. The safety rate gap between the safest and the least safe model reaches 70 %, substantially larger than the 54 % gap observed on prior human‑crafted benchmarks. Moreover, VLSafetyBencher’s benchmark yields a 15.67 % higher discriminative power compared to SafeBench and MLLMGuard.

Cost‑wise, the entire pipeline runs within one week on cloud GPUs, incurring roughly $2,300 in compute expenses—over 95 % cheaper than manual benchmark construction. Ablation studies confirm each agent’s contribution, with the Selection Agent’s optimization having the greatest impact on separability.

Limitations and Future Work
Current work focuses on image‑text modalities; extending to video, audio, and other sensor data remains open. The generation stage can inadvertently produce harmful content, necessitating stronger safety filters during synthesis. While the system can update benchmarks within days, real‑time adaptation to emerging attack vectors is not yet realized. Future directions include multi‑modal expansion, continuous monitoring pipelines, and tighter integration of safety guards during data generation.

Conclusion
VLSafetyBencher demonstrates that a fully automated, multi‑agent pipeline can produce high‑quality, dynamic LVLM safety benchmarks at a fraction of the traditional cost and time. By systematically addressing data collection, synthesis, augmentation, and optimal selection, the framework offers a scalable solution for keeping safety evaluation in step with the rapid evolution of LVLM technology.


Comments & Academic Discussion

Loading comments...

Leave a Comment