ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Much previous AI research has focused on developing monolithic models to maximize their intelligence, with the primary goal of enhancing performance on specific tasks. In contrast, this work attempts to study using LLM-based agents to design collaborative AI systems autonomously. To explore this problem, we first introduce ComfyBench to evaluate agents’s ability to design collaborative AI systems in ComfyUI. ComfyBench is a comprehensive benchmark comprising 200 diverse tasks covering various instruction-following generation challenges, along with detailed annotations for 3,205 nodes and 20 workflows. Based on ComfyBench, we further develop ComfyAgent, a novel framework that empowers LLM-based agents to autonomously design collaborative AI systems by generating workflows. ComfyAgent is based on two core concepts. First, it represents workflows with code, which can be reversibly converted into workflows and executed as collaborative systems by the interpreter. Second, it constructs a multi-agent system that cooperates to learn from existing workflows and generate new workflows for a given task. While experimental results demonstrate that ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks. LLM-based agents still have a long way to go in autonomously designing collaborative AI systems. Progress with ComfyBench is paving the way for more intelligent and autonomous collaborative AI systems.

💡 Research Summary

The paper tackles the emerging challenge of automatically designing collaborative AI systems—complex pipelines that combine multiple models and tools—using large language model (LLM) based agents. To evaluate such agents, the authors introduce ComfyBench, a benchmark built on the open‑source visual generation platform ComfyUI. ComfyBench comprises 200 task instructions spanning three difficulty levels (vanilla, complex, creative), detailed documentation for 3,205 nodes, and 20 “curriculum” workflows that serve as teaching examples. Each task requires an agent to generate a ComfyUI workflow (a directed acyclic graph of nodes) that, when executed, produces the desired image or video output. Evaluation is performed in two stages: (1) Pass Rate, checking whether the generated workflow is syntactically and semantically valid; (2) Resolve Rate, checking whether the visual output satisfies the task description. The latter is automated using GPT‑4o as a visual‑language model (VLM) that judges image/video compliance.

Building on this benchmark, the authors propose ComfyAgent, a novel framework that enables LLM agents to design collaborative systems autonomously. The key innovations are: (1) Code‑based workflow representation—workflows are translated into a Python‑like domain‑specific language (DSL) that can be round‑tripped back into JSON for execution. This makes the structural dependencies of the pipeline more accessible to LLM reasoning. (2) Multi‑agent architecture consisting of four specialized agents:

PlanAgent creates a global plan from the task instruction.
RetrievalAgent searches the node documentation and existing workflow code to extract relevant concepts.
CombineAgent merges retrieved code fragments, and AdaptAgent fine‑tunes parameters to meet the specific task.
The agents cooperate iteratively, producing a final DSL script that is converted to a ComfyUI workflow and executed.

Experiments compare ComfyAgent against several baseline agents (GPT‑4o, Claude, LLaMA‑2 based agents) using both Pass Rate and Resolve Rate. Results show that ComfyAgent achieves a Pass Rate comparable to the state‑of‑the‑art “o1‑preview” model and significantly outperforms other baselines across all difficulty levels. However, its Resolve Rate on the creative subset (40 tasks requiring novel combinations or new reasoning) is only 15 %, indicating that current LLMs still struggle with genuine creativity and complex logical composition.

The authors analyze failure modes: mis‑interpreting node dependencies, incorrect parameter settings, and occasional hallucinations in the VLM‑based evaluation. They argue that representing workflows as code reduces the ambiguity inherent in raw JSON, and that a multi‑agent system provides richer, staged reasoning than a single‑prompt approach. Nonetheless, limitations remain: (i) the reliance on VLMs for automatic resolve evaluation introduces subjectivity; (ii) long‑range memory and retrieval capabilities of LLMs are still constrained; (iii) the benchmark focuses on visual generation, leaving other modalities unexplored.

In the discussion, the paper outlines future directions: integrating reinforcement learning with human feedback to refine failed designs, expanding the toolchain to include web browsing, code execution, or 3D modeling for richer multimodal pipelines, and improving memory‑augmented retrieval to handle larger node libraries. Moreover, developing more objective visual quality metrics and human‑in‑the‑loop evaluations would strengthen benchmark reliability.

In summary, ComfyBench provides the first systematic testbed for assessing LLM agents’ ability to construct and run collaborative AI pipelines, and ComfyAgent demonstrates that code‑centric, multi‑agent designs can substantially close the gap to expert performance. Yet, the low success on creative tasks underscores that autonomous design of truly novel AI systems remains an open research frontier.

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

💡 Research Summary

Comments & Academic Discussion

Leave a Comment