Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Toxicity remains a leading cause of early-stage drug development failure. Despite advances in molecular design and property prediction, the task of molecular toxicity repair, generating structurally valid molecular alternatives with reduced toxicity, has not yet been systematically defined or benchmarked. To fill this gap, we introduce ToxiMol, the first benchmark task for general-purpose Multimodal Large Language Models (MLLMs) focused on molecular toxicity repair. We construct a standardized dataset covering 11 primary tasks and 660 representative toxic molecules spanning diverse mechanisms and granularities. We design a prompt annotation pipeline with mechanism-aware and task-adaptive capabilities, informed by expert toxicological knowledge. In parallel, we propose an automated evaluation framework, ToxiEval, which integrates toxicity endpoint prediction, synthetic accessibility, drug-likeness, and structural similarity into a high-throughput evaluation chain for repair success. We systematically assess 43 mainstream general-purpose MLLMs and conduct multiple ablation studies to analyze key issues, including evaluation metrics, candidate diversity, and failure attribution. Experimental results show that although current MLLMs still face significant challenges on this task, they begin to demonstrate promising capabilities in toxicity understanding, semantic constraint adherence, and structure-aware editing.

💡 Research Summary

The paper introduces a novel benchmark, ToxiMol, to evaluate the ability of general‑purpose multimodal large language models (MLLMs) to perform “molecular toxicity repair”—the generation of structurally valid, low‑toxicity alternatives for toxic compounds. While toxicity prediction and ADMET optimization have been extensively studied, the explicit task of repairing a toxic molecule by removing or modifying the offending substructures has not been systematically defined or benchmarked.

To fill this gap, the authors construct a dataset of 660 toxic molecules drawn from 11 well‑known toxicity endpoints (e.g., AMES, hERG, Tox21, ClinTox). For each endpoint, 60 molecules are sampled in a balanced manner using ECFP4 fingerprints, Tanimoto similarity, and Butina clustering to ensure structural diversity. Each instance includes a SMILES string, a 2‑D RDKit image, and a natural‑language description of the repair objective.

A mechanism‑aware prompt annotation pipeline is proposed. It starts from a base template that defines the model’s role and constraints, injects task‑level and sub‑task‑level instructions that encode the specific toxicity mechanism, and finally assembles a multimodal prompt containing the SMILES and image of the target molecule. This pipeline enables the model to understand both the chemical context and the semantic repair goal.

For evaluation, the authors design ToxiEval, an automated multi‑criteria framework that combines four orthogonal metrics: (1) toxicity endpoint prediction (all relevant labels must switch from toxic to non‑toxic), (2) synthetic accessibility score (SA ≤ 4.5), (3) drug‑likeness (QED ≥ 0.6), and (4) structural similarity (Tanimoto ≥ 0.7 to the original). A candidate is deemed successful only if it satisfies all four thresholds.

The benchmark is used to assess 43 mainstream MLLMs, including closed‑source models such as GPT‑4V, Claude‑3, and open‑source models like LLaVA‑1.5 and InternVL‑3‑78B. Each model is asked to generate three repair candidates per instance. Overall success rates are low, ranging from 2 % on the hardest tasks (e.g., hERG_Karim, DILI) to about 8 % on relatively easier tasks (e.g., AMES, ClinTox). Ablation studies reveal that (i) increasing the candidate set size improves success modestly but reduces average similarity, (ii) mechanism‑specific prompts boost performance by roughly 12 percentage points compared with generic prompts, and (iii) the weighting of evaluation criteria strongly influences reported success.

Error analysis shows two dominant failure modes: (a) generation of syntactically invalid SMILES (≈30 % of failures) and (b) misclassification by the toxicity predictor, leading to false‑negative toxicity assessments (≈45 %).

The authors conclude that current MLLMs possess nascent capabilities in chemical image‑text understanding and constrained structure editing, yet they fall short of reliably performing toxicity repair at scale. They suggest future directions such as toxicity‑focused pre‑training, integration of reaction‑level synthesis models, and human‑in‑the‑loop feedback to improve both chemical validity and safety awareness.

Breaking Bad Molecules: Are MLLMs Ready for Structure-Level Molecular Detoxification?

💡 Research Summary

Comments & Academic Discussion

Leave a Comment