JudgeFlow: Agentic Workflow Optimization via Block Judge
Optimizing LLM-based agentic workflows is challenging for scaling AI capabilities. Current methods rely on coarse, end-to-end evaluation signals and lack fine-grained signals on where to refine, often resulting in inefficient or low-impact modifications. To address these limitations, we propose JudgeFlow, an Evaluation-Judge-Optimization-Update pipeline. We incorporate reusable, configurable logic blocks into agentic workflows to capture fundamental forms of logic. On top of this abstraction, we design a dedicated Judge module that inspects execution traces particularly failed runs and assigns rank-based responsibility scores to problematic blocks. These fine-grained diagnostic signals are then leveraged by an LLM-based optimizer, which focuses modifications on the most problematic block in the workflow. Our approach improves sample efficiency, enhances interpretability through block-level diagnostics, and provides a scalable foundation for automating increasingly complex agentic workflows. We evaluate JudgeFlow on mathematical reasoning and code generation benchmarks, where JudgeFlow achieves superior performance and efficiency compared to existing methods.
💡 Research Summary
JudgeFlow addresses the challenge of optimizing large‑language‑model (LLM)‑driven agentic workflows by introducing a four‑stage pipeline: Evaluation, Judge, Optimization, and Update. Rather than treating a workflow as an opaque monolith, the authors decompose it into reusable, configurable logic blocks that capture three fundamental control structures—sequential execution, loops, and conditionals. Each block orchestrates one or more operators (e.g., generate, self‑refine, test), preserving the expressive power of code‑based workflows while providing a higher‑level abstraction that is both interpretable and amenable to systematic analysis.
During the Evaluation stage, the workflow processes each query from a benchmark dataset and is scored against ground‑truth answers. When a score falls below a predefined success threshold, the execution trace—including intermediate states for every block—is packaged into a “failure quadruple.” The Judge module then interrogates a large language model with carefully crafted prompts to assign a rank‑based responsibility score to every block in the failing instance. The block receiving rank 1 is deemed the most responsible for the error; its identifier and the associated failure context are logged for later use. This rank‑based diagnostic provides fine‑grained feedback that is absent in prior approaches, which typically rely solely on end‑to‑end metrics.
The Optimization stage leverages an LLM‑based optimizer that receives the block‑level responsibility information. Instead of exploring the entire space of possible workflow modifications, the optimizer focuses exclusively on the most problematic block, adjusting its internal operators, prompt templates, or block‑level hyperparameters (e.g., loop iteration limits). By constraining the search to a single block per failure, JudgeFlow dramatically improves sample efficiency and reduces the computational cost of each optimization iteration.
After the optimizer proposes a revised workflow, the Update stage re‑evaluates it on the dataset. New failures trigger another round of judging, creating an iterative feedback loop that progressively refines the workflow.
Empirical evaluation on two challenging benchmarks—MATH (mathematical reasoning) and HumanEval (code generation)—demonstrates that JudgeFlow outperforms several strong baselines, including Prompt‑Gradients (a gradient‑based prompt optimizer), Monte‑Carlo Tree Search‑based workflow search, and full‑model fine‑tuning. With the same budget of LLM calls, JudgeFlow achieves 3–5 percentage‑point higher accuracy, especially on tasks that involve complex conditional branching and iterative reasoning. The authors attribute this gain to the precise error localization afforded by block‑level responsibility scores.
The paper also discusses limitations. Currently only three block types are supported, which may not capture more sophisticated control flow such as recursion or dynamic memory management. Moreover, because the Judge itself is an LLM, its own biases can propagate into the responsibility rankings. Future work is suggested to expand the block taxonomy, incorporate ensemble judging for more robust attribution, and develop meta‑optimizers that can reason about interactions between multiple blocks.
In summary, JudgeFlow introduces a novel abstraction and diagnostic mechanism that transforms the optimization of agentic workflows from a coarse, global search into a targeted, interpretable process. By coupling block‑level responsibility attribution with focused LLM‑driven refinement, it offers a scalable foundation for automating increasingly complex AI agents while maintaining transparency and efficiency.
Comments & Academic Discussion
Loading comments...
Leave a Comment