Beyond SMILES: Evaluating Agentic Systems for Drug Discovery

Reading time: 4 minute
...

📝 Original Info

  • Title: Beyond SMILES: Evaluating Agentic Systems for Drug Discovery
  • ArXiv ID: 2602.10163
  • Date: 2026-02-10
  • Authors: ** 저자 정보는 원문에 명시되지 않았으므로 “논문 저자 미상”으로 표기합니다. **

📝 Abstract

Agentic systems for drug discovery have demonstrated autonomous synthesis planning, literature mining, and molecular design. We ask how well they generalize. Evaluating six frameworks against 15 task classes drawn from peptide therapeutics, in vivo pharmacology, and resource-constrained settings, we find five capability gaps: no support for protein language models or peptide-specific prediction, no bridges between in vivo and in silico data, reliance on LLM inference with no pathway to ML training or reinforcement learning, assumptions tied to large-pharma resources, and single-objective optimization that ignores safety-efficacy-stability trade-offs. A paired knowledge-probing experiment suggests the bottleneck is architectural rather than epistemic: four frontier LLMs reason about peptides at levels comparable to small molecules, yet no framework exposes this capability. We propose design requirements and a capability matrix for next-generation frameworks that function as computational partners under realistic constraints.

💡 Deep Analysis

📄 Full Content

1 Introduction

Recent agentic AI systems have made tangible progress. Coscientist autonomously plans chemical syntheses [1], ChemCrow orchestrates 18 chemistry tools [2], and ChatInvent completed a deployment at AstraZeneca for molecular design and synthesis planning [3]. PharmAgents integrates knowledge graphs for target identification [4], TxGemma provides therapeutics-focused language understanding [5], while MADD [6] and DiscoVerse [7] promise multi-agent collaboration. The dominant narrative positions agentic systems as the next major advance, moving beyond static models to systems that autonomously navigate literature, design experiments, and propose hypotheses [8,9].

The architectural pattern is consistent: a large language model orchestrates tool calls, synthesizes results, and generates explanations. ChemCrow routes requests to RDKit, PubChem, and reaction prediction APIs. ChatInvent generates molecular designs informed by literature. Coscientist interfaces with laboratory automation. This LLM-centric design works for text-based reasoning tasks: literature review, synthesis enumeration, protocol documentation, and safety analysis. However, these demonstrations are narrowly scoped to specific contexts. Most systems are optimized for small-molecule workflows, high-throughput in vitro assays, and organizations with large datasets and extensive compute. When those assumptions break, performance degrades in ways the demos do not reveal.

An important distinction underlies the analysis that follows. Individual tools addressing aspects of each gap are emerging: peptide-aware generative models, multi-objective optimizers, and omics analysis platforms exist as standalone capabilities. The capability gaps we identify are not the absence of individual tools but the absence of agentic workflow integration that chains these capabilities into end-to-end pipelines supporting iterative design-test cycles, proprietary data, and human-in-the-loop decision-making. To test whether this bias extends to the foundation models themselves, we probe four frontier LLMs on matched small-molecule and peptide questions as a diagnostic: all four models demonstrate competent peptide reasoning, isolating the bottleneck to agent architecture rather than model capability ( §2.5).

Figure 1: The Agent Reality Gap in Drug Discovery. Left panel shows computational workflows where current agents excel: small molecule representations (SMILES strings), databases, literature mining, and virtual screening. Right panel depicts the messy reality of drug discovery: multimodal biological data from animal studies, wet lab iteration, and multi-objective trade-offs. The gap between these contexts represents the architectural limitations addressed in this paper.

However, these systems reveal systematic capability gaps outside their design context: small-molecule discovery at well-resourced pharmaceutical companies. Peptide therapeutics require protein language models like ESM-2 [10] or ProtBERT [11], not molecular fingerprints. Peptides (5 to 50 amino acids) have complex conformational dynamics, aggregation propensities, and protease vulnerabilities absent in small molecules. No current agent supports protein language model finetuning, conformational sampling, or aggregation prediction.

In vivo efficacy studies generate longitudinal, multi-modal data: behavioral scores over weeks, tissue histology, RNA sequencing, and clinical notes. In neurological injury models, efficacy manifests through staged recovery endpoints: early motor improvements, subsequent reduction in neuroinflammation, and longer-term neurogenesis. No agent integrates these temporal data streams for outcome prediction. The result is a gap between in vitro promise and in vivo reality, where most development cost and risk actually sit.

Small biotechs face different constraints than AstraZeneca: 50 to 500 proprietary sequences versus millions, single GPU versus clusters, one person handling design, modeling, and analysis.

Transfer learning and few-shot adaptation are prerequisites for workflows with 50-500 proprietary sequences. Current agents assume abundant resources and long, interactive cycles that do not match small-team workflows.

Real drug discovery navigates multi-objective trade-offs under uncertainty. A peptide with tenfold higher bioactivity may have narrower safety margins or reduced stability. Current agents optimize single metrics or weighted sums, ignoring Pareto frontiers and uncertainty quantification.

Practitioners end up doing this reasoning manually, which slows iteration and increases decision risk.

This paper presents a systematic gap analysis drawing on over a dozen computational projects spanning peptide design, reinforcement learning optimization, in vivo efficacy modeling, behavioral phenotyping via computer vision, RNA-seq analysis, and multi-objective navigation, led by the author at a small biotech serving as both drug designer and computational pr

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut