Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding

Reading time: 5 minute
...

📝 Original Info

  • Title: Protect$^*$: Steerable Retrosynthesis through Neuro-Symbolic State Encoding
  • ArXiv ID: 2602.13419
  • Date: 2026-02-13
  • Authors: ** 논문에 명시된 저자 정보가 제공되지 않았습니다. (저자명 및 소속은 원문을 확인해 주세요.) **

📝 Abstract

Large Language Models (LLMs) have shown remarkable potential in scientific domains like retrosynthesis; yet, they often lack the fine-grained control necessary to navigate complex problem spaces without error. A critical challenge is directing an LLM to avoid specific, chemically sensitive sites on a molecule - a task where unconstrained generation can lead to invalid or undesirable synthetic pathways. In this work, we introduce Protect$^*$, a neuro-symbolic framework that grounds the generative capabilities of Large Language Models (LLMs) in rigorous chemical logic. Our approach combines automated rule-based reasoning - using a comprehensive database of 55+ SMARTS patterns and 40+ characterized protecting groups - with the generative intuition of neural models. The system operates via a hybrid architecture: an ``automatic mode'' where symbolic logic deterministically identifies and guards reactive sites, and a ``human-in-the-loop mode'' that integrates expert strategic constraints. Through ``active state tracking,'' we inject hard symbolic constraints into the neural inference process via a dedicated protection state linked to canonical atom maps. We demonstrate this neuro-symbolic approach through case studies on complex natural products, including the discovery of a novel synthetic pathway for Erythromycin B, showing that grounding neural generation in symbolic logic enables reliable, expert-level autonomy.

💡 Deep Analysis

📄 Full Content

The integration of Large Language Models (LLMs) into scientific discovery has unlocked new capabilities, particularly in domains requiring complex reasoning like chemical synthesis. Frameworks such as DeepRetro [5] have advanced the state-of-the-art by combining the generative power of LLMs with structured search algorithms. However, a significant limitation persists: the difficulty of imposing fine-grained, expert-driven constraints on the generative process. Existing systems often lack mechanisms to identify which molecular sites require protection or to suggest appropriate protecting groups, leading to the generation of strategically flawed synthetic pathways. Furthermore, most current retrosynthesis models focus primarily on bond disconnection prediction but fail to account for chemoselectivity and regioselectivity issues -specifically, whether a site will react first or if the intended molecule will form as the major product. This oversight often results in chemically plausible but practically unfeasible pathways.

To address this critical gap, we introduce Protect * , a neuro-symbolic framework that bridges the gap between neural intuition and symbolic validity. Unlike purely neural approaches that must learn chemical rules from data distributions, our system leverages a hybrid architecture where explicit constraints guide neural generation. We employ a rigorous rule-based engine grounded in over 55 SMARTS patterns to automatically infer reactive sites, and a logic-based scoring system to suggest optimal protecting groups from a library of 40+ candidates. Crucially, these constraints are not merely “suggestions” but are enforced through a persistent Protection State. Through our active state tracking, these symbolic constraints are injected into the neural inference context, effectively creating a “guardrail” that steers the LLM away from invalid pathways without requiring expensive model fine-tuning.

The method presented in this paper is an extension of the DeepRetro [5] system, a modular, hybrid framework for retrosynthetic analysis. DeepRetro is a hybrid LLM + Monte Carlo Tree Search (MCTS) based approach to generate retrosynthesis pathways. The framework is designed to be modelagnostic, integrating various LLMs (e.g., Anthropic’s Claude series [1]) as its primary reasoning engine to propose novel disconnections. This approach allows the LLM to creatively explore chemical space while the MCTS algorithm systematically builds and expands the synthesis tree, grounding the generative power of the LLM within a structured search process.

Recognizing that computational metrics alone cannot capture chemical feasibility or elegance, DeepRetro employs a multi-faceted evaluation pipeline. Each LLM-proposed step is rigorously validated using stability, validity and hallucination checks before being added to the search tree. While quantitative metrics like Pathway Success Rate and Top-k accuracy are used for benchmarking, the framework’s philosophy emphasizes their limitations; such metrics can penalize novel or more elegant pathways not present in the ground-truth data. Consequently, DeepRetro places a strong emphasis on qualitative Case Study Analysis, where human experts assess the novelty and practical value of generated pathways. This human-in-the-loop validation is critical for navigating complex syntheses and serves as the motivation for developing more direct methods of human guidance.

A fundamental challenge in directing LLMs for retrosynthesis arises from the nature of the SMILES representation. As a linear string, it lacks an inherent mechanism to selectively mark specific atomic sites as non-reactive. This mirrors challenges in other domains; for instance, directly using LLMs as prompt encoders for diffusion models can degrade performance due to a misalignment between the model’s generalist training and the task’s need for discriminative features [3]. Similarly, in our context, simply instructing a model via a prompt to “not react at a specific functional group” is often insufficient, as the LLM can struggle to ground such a spatial-chemical concept onto the string representation, leading to hallucinations and strategically flawed suggestions. To overcome this, Protect * introduces a formal mechanism that first automatically identifies protection sites via stable atom mapping, suggests appropriate protecting groups, and then enforces these constraints through prompt engineering and state tracking (Figure 1).

The first stage of Protect * employs RDKit substructure matching on canonically atom-mapped SMILES to automatically identify functional groups that may require protection during synthesis. We maintain a comprehensive database of 55+ SMARTS patterns organized into 10 categories: alcohols (primary, secondary, tertiary, allylic, benzylic, etc.), phenols, diols and polyols, amines, heterocyclic N-H groups (indole, pyrrole, imidazole), carbonyls, carboxylic acids and derivatives, thiols, terminal

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut