scBench: Evaluating AI Agents on Single-Cell RNA-seq Analysis
As single-cell RNA sequencing datasets grow in adoption, scale, and complexity, data analysis remains a bottleneck for many research groups. Although frontier AI agents have improved dramatically at software engineering and general data analysis, it remains unclear whether they can extract biological insight from messy, real-world single-cell datasets. We introduce scBench, a benchmark of 394 verifiable problems derived from practical scRNA-seq workflows spanning six sequencing platforms and seven task categories. Each problem provides a snapshot of experimental data immediately prior to an analysis step and a deterministic grader that evaluates recovery of a key biological result. Benchmark data on eight frontier models shows that accuracy ranges from 29-53%, with strong model-task and model-platform interactions. Platform choice affects accuracy as much as model choice, with 40+ percentage point drops on less-documented technologies. scBench complements SpatialBench to cover the two dominant single-cell modalities, serving both as a measurement tool and a diagnostic lens for developing agents that can analyze real scRNA-seq datasets faithfully and reproducibly.
💡 Research Summary
The paper introduces scBench, a comprehensive benchmark designed to evaluate the capability of large‑language‑model (LLM) based AI agents in performing real‑world single‑cell RNA‑sequencing (scRNA‑seq) analyses. scBench comprises 394 verifiable problems extracted from routine scRNA‑seq workflows across six sequencing platforms (Chromium, BD Rhapsody, CSGenetics, Illumina, MissionBio, ParseBio) and seven analytical tasks (quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, cell‑type annotation, differential expression). Each problem presents a data snapshot (typically an AnnData .h5ad file), a natural‑language prompt describing the desired analysis step, and a deterministic grader that checks a structured JSON output against ground‑truth values within predefined tolerances. The graders are deliberately engineered to prevent shortcuts such as reading pre‑computed labels or relying on prior biological knowledge; tolerances are calibrated by running multiple valid pipelines to capture acceptable variation.
Eight frontier models—Claude Opus 4.6/4.5, Claude Sonnet 4.5, OpenAI GPT‑4.5/4.1, xAI Grok‑4.1/4, and Google Gemini 2.5 Pro—were evaluated under a unified mini‑SWE‑agent harness, each run three times per problem. Overall accuracies ranged from 29 % (Gemini) to 52.8 % (Claude Opus 4.6). Task‑wise performance displayed a clear difficulty gradient: normalization was easiest (≈70 % mean accuracy), followed by QC (≈55 %). Differential expression proved hardest (≈27 % mean accuracy), reflecting its reliance on statistical test selection, marker gene identification, and nuanced biological judgment. Cell‑type annotation and clustering fell in the middle (≈35–38 % mean accuracy).
Platform effects were as pronounced as model effects. The mean accuracy across models was 59 % on CSGenetics but dropped to 26 % on MissionBio, a 32.7‑percentage‑point swing that exceeds the 23.6‑point spread between the best and worst models. This suggests that training data are heavily biased toward widely documented platforms (Chromium, Illumina) while less‑documented technologies suffer from poor model generalization. Some models (e.g., Grok‑4) performed relatively better on MissionBio than on more common platforms, indicating that model‑specific training or architecture can mitigate platform bias to a limited extent.
Cost‑accuracy and latency‑accuracy trade‑offs were also examined. GPT‑4.5 achieved near‑top accuracy (45 %) at lower monetary cost and latency, whereas Claude Opus 4.6 led in raw accuracy but incurred higher cost and longer response times. Pareto‑optimal models thus include both GPT‑4.5 (efficient) and Claude Opus 4.6 (most accurate).
The authors compare scBench to their earlier SpatialBench (spatial transcriptomics) benchmark. While scRNA‑seq tasks yielded higher absolute accuracies (top model 52.8 % vs. 38.4 % on SpatialBench), the relative ranking of models was preserved, and similar platform‑dependent swings were observed. This underscores that the abundance of public scRNA‑seq datasets and the dominance of Scanpy‑based tools make the domain somewhat more tractable for current LLMs.
In discussion, the paper emphasizes that current agents, despite impressive coding abilities, lack the contextual, platform‑aware, and scientific reasoning needed for high‑stakes biological inference. Deterministic grading, while enabling reproducible evaluation, inevitably discretizes scientific judgment and does not capture error propagation across multi‑step pipelines. The authors advocate for future work that incorporates multi‑step, long‑horizon scenarios, self‑calibration heuristics, and richer platform‑specific tooling. They envision scBench as an evolving diagnostic suite that will guide both model training and harness engineering toward trustworthy, reproducible AI‑driven scRNA‑seq analysis.
Comments & Academic Discussion
Loading comments...
Leave a Comment