AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.
💡 Research Summary
The paper tackles the long‑standing bottleneck of manually creating scientific illustrations, which are essential for conveying complex concepts in research papers, surveys, blogs, and textbooks. Existing datasets such as Paper2Fig100k, ACL‑Fig, and SciCap+ focus on short captions or existing metadata and do not address the challenge of generating a complete, original illustration from a long scientific document (often >10 k tokens). To fill this gap, the authors introduce two major contributions: (1) FigureBench, the first large‑scale benchmark for long‑context scientific illustration, and (2) AutoFigure, an agentic framework that automatically produces publication‑ready figures from such texts.
FigureBench comprises 3,300 high‑quality text‑figure pairs collected from four sources: research papers, surveys, technical blogs, and textbooks. The test set (300 pairs) was built by first using GPT‑5 to select the most representative illustration from 400 randomly sampled papers, then filtering for conceptual figures (excluding data‑driven charts) and requiring explicit textual descriptions of all visual elements. Two independent annotators validated each pair, achieving a Cohen’s κ of 0.91, resulting in 200 paper‑derived pairs. An additional 100 pairs were manually curated from surveys, blogs, and textbooks to increase diversity. A vision‑language model fine‑tuned on these 300 examples was used as an automated filter to expand the development set to 3,000 pairs. Detailed statistics (average token count, text density, number of colors, components, shapes) illustrate the substantial variability and difficulty of the task.
AutoFigure follows a “Reasoned Rendering” paradigm that splits the generation pipeline into two distinct stages. In Stage I (Semantic Parsing & Layout Planning), a large language model (LLM) parses the long text into a structured semantic graph, extracts key concepts, procedural steps, and relationships, and proposes multiple layout blueprints. A vision‑language model acting as a judge evaluates these candidates on structural coherence, visual balance, and relevance, selecting the optimal layout. In Stage II (Aesthetic Rendering & Text Refinement), the chosen blueprint is fed to a high‑resolution diffusion model. To overcome the common problem of blurry or illegible text in diffusion outputs, AutoFigure employs an “erase‑and‑correct” strategy: text regions are masked after the initial rendering, then an OCR‑guided module re‑generates crisp, correctly aligned text. Style prompts (color palette, font, icon style) are dynamically adjusted to ensure visual consistency.
The authors evaluate AutoFigure against several baselines: standard text‑to‑image diffusion models, PosterAgent, PPT‑Agent, and code‑generation pipelines that focus on geometric correctness. Evaluation combines automated VLM‑as‑judge metrics (structural accuracy, aesthetic quality, layout balance) and human expert judgments. AutoFigure consistently outperforms baselines, achieving a 12‑18 % higher score on automated metrics and a 66.7 % rate of figures judged “publication‑ready,” compared to roughly 38 % for the strongest baseline. Ablation studies demonstrate that removing any component—layout planning, text‑blur correction, or VLM judging—significantly degrades performance, confirming the importance of the full reasoning‑refinement loop.
Limitations noted include dependence on a predefined library of domain‑specific icons and terminology, occasional layout overload for extremely complex flowcharts, and the fact that AutoFigure is currently inference‑only (no end‑to‑end trainable component). Future work will explore domain adapters for automatic icon synthesis, human‑in‑the‑loop feedback, and learnable layout policies to further close the gap between AI‑generated and human‑crafted illustrations.
In summary, the paper delivers a comprehensive benchmark (FigureBench) and a novel, agentic system (AutoFigure) that together advance the state of the art in automatically turning long scientific texts into high‑quality, aesthetically refined figures. This work paves the way for fully autonomous “AI scientists” capable of not only generating textual research content but also producing the visual artifacts necessary for effective scientific communication.
Comments & Academic Discussion
Loading comments...
Leave a Comment