DeepSight: An All-in-One LM Safety Toolkit

DeepSight: An All-in-One LM Safety Toolkit
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.


💡 Research Summary

DeepSight is an open‑source, end‑to‑end toolkit that unifies safety evaluation and internal diagnostics for large language models (LLMs) and multimodal large language models (MLLMs). The authors identify a critical gap in current AI safety practice: evaluation tools (e.g., OpenAI Eval, HELM) focus on external behavior, while diagnostic research (probing neurons, information‑flow analysis) remains isolated from standardized benchmarks. Consequently, developers can see what went wrong but not why it happened inside the model, and alignment work may degrade overall capabilities.

To close this gap, DeepSight introduces two tightly coupled engines—DeepSafe and DeepScan—built around a shared, declarative configuration language (YAML/JSON) and a central Registry that tracks models, datasets, evaluators, and summarizers. This design makes the entire workflow “Configure → Execute → Summarize” reproducible, low‑cost, and highly scalable.

DeepSafe is the evaluation component. It provides a modular, configuration‑driven pipeline that automatically loads a model (local Hugging‑Face weights, vLLM‑accelerated inference, or remote API), selects a dataset, runs inference, and applies one of three judgment mechanisms: (1) native benchmark scripts, (2) rule‑based keyword/regex matching, or (3) a specialized LLM‑as‑Judge called ProGuard (fine‑tuned on 87 k safety pairs). DeepSafe currently bundles over 20 safety benchmarks—including SALAD‑Bench, HarmBench, and a suite of frontier‑AI risk datasets—covering text, image, and multimodal tasks. The Summarizer aggregates raw judgments into statistical scores and produces both Markdown reports and machine‑readable JSON files.

DeepScan is the diagnostic counterpart. It operates without modifying model weights, extracting intermediate activations and token embeddings to answer questions such as: How are safety concepts encoded? Do safe and harmful representations occupy distinct regions of latent space? Where do safety objectives conflict? DeepScan implements several diagnostic protocols: X‑Boundary (measures separation between safe, harmful, and boundary samples), TELLME (evaluates answer similarity vs. difference), MI‑Peaks (identifies information‑theoretic peaks), and SPIN (examines prompt‑response interactions). Each protocol outputs quantitative metrics (e.g., separation score, boundary ratio) and visualizations (t‑SNE plots, diagrams). Like DeepSafe, DeepScan is driven by a single configuration file and can be invoked from the command line or programmatically.

The crucial innovation is the integration of the two engines. Because DeepSafe and DeepScan share the same data and model registries, the output of an evaluation run can be fed directly into a diagnostic run. For example, if a model scores poorly on a manipulation benchmark, DeepScan can automatically locate the layers or neurons where unsafe representations are entangled with benign ones, providing actionable insight for alignment engineers. This transforms safety testing from a disconnected black‑box check into a white‑box debugging loop.

The authors conduct extensive experiments on roughly 30 state‑of‑the‑art LLMs/MLLMs, spanning open‑source and closed‑source families (Llama, Qwen, InternLM, etc.). They evaluate three dimensions: (1) content safety across text‑only, image‑only, and multimodal prompts; (2) frontier‑AI risks such as manipulation, deception, and self‑modification; and (3) joint evaluation‑diagnosis pipelines. Key findings include:

  • Adding visual modalities dramatically expands the attack surface, causing safety alignment to drop for all model tiers; closed‑source models retain a noticeable advantage in cross‑modal scenarios.
  • Reasoning‑enabled multimodal models outperform non‑reasoning counterparts on image‑text splitting attacks, revealing a complex trade‑off between raw reasoning ability and safety.
  • No single model dominates every frontier‑AI risk category; even top‑ranked models can catastrophically fail on specific risks (e.g., Kimi‑K2‑Thinking ranks last on manipulation).
  • Diagnostic analysis shows that both insufficient and excessive separation between safe and harmful latent clusters harms robustness; an optimal geometric structure appears necessary for stable safety behavior.

The paper also discusses limitations. DeepSight currently focuses on transformer‑based language models; extensions to video, audio, or other non‑text modalities are not yet supported. ProGuard, while powerful, may require fine‑tuning for new benchmarks. Cloud‑API costs cannot be fully eliminated, and interpreting diagnostic metrics still benefits from domain‑expert input.

In conclusion, DeepSight offers a novel, unified framework that bridges the gap between external safety evaluation and internal model understanding. By providing a reproducible, extensible, and low‑cost platform, it enables researchers and practitioners to not only detect unsafe behavior but also trace its root causes within model representations, facilitating more informed alignment and remediation. Future work could expand modality coverage, automate corrective feedback loops, and incorporate additional frontier‑AI risk categories, further strengthening the safety pipeline for next‑generation AI systems.


Comments & Academic Discussion

Loading comments...

Leave a Comment