Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction
The rapid evolution of agentic workflows has demonstrated strong performance of LLM-based agents in addressing complex reasoning tasks. However, existing workflow optimization methods typically formulate workflow synthesis as a static, one-shot code-centric generation problem. This paradigm imposes excessive constraints on the model’s coding capabilities and restricts the flexibility required for dynamic problem-solving. In this paper, we present Workflow-R1, a framework that reformulates workflow construction as a multi-turn, natural language-based sequential decision-making process. To resolve the optimization granularity mismatch inherent in such multi-turn interactions, we introduce Group Sub-sequence Policy Optimization (GSsPO). While explicitly tailored to align with the interleaved Think-Action dynamics of agentic reasoning, GSsPO fundamentally functions as a structure-aware RL algorithm generalizable to a broad class of multi-turn agentic sequential decision-making tasks. By recalibrating the optimization unit to the composite sub-sequence, specifically the atomic Think-Action cycle, it aligns gradient updates with the semantic boundaries of these interactions, ensuring robust learning in complex multi-turn reasoning tasks. Through extensive experiments on multiple QA benchmarks, Workflow-R1 outperforms competitive baselines, validating GSsPO as a generalized solution for sequential reasoning and establishing Workflow-R1 as a promising new paradigm for automated workflow optimization.
💡 Research Summary
The paper addresses a fundamental limitation of current automated workflow synthesis for large language model (LLM) agents, which it calls the “Static Execution Trap.” Existing methods generate a complete, code‑based workflow in a single pass before any operator is executed, thereby decoupling planning from execution and preventing the agent from adapting to intermediate observations. To overcome this, the authors propose Workflow‑R1, a framework that reconceptualizes workflow construction as a multi‑turn, natural‑language interaction between the agent and its environment. In each turn the agent produces a Think‑Action pair wrapped in <think> and <tool> tags; the execution engine returns results in <info> tags, which the agent then observes before the next thinking step. This closed‑loop design enables dynamic adjustment of the workflow topology based on real‑time feedback, eliminating the rigidity of static code generation.
The central technical contribution is Group Sub‑sequence Policy Optimization (GSsPO), a reinforcement‑learning (RL) algorithm that aligns the optimization granularity with the atomic decision unit of the interaction—the Think‑Action sub‑sequence. Prior RL approaches for LLMs operate either at the token level (e.g., GRPO) or at the full‑sequence level (e.g., GSPO). Token‑level updates ignore the semantic coherence of a decision, while sequence‑level updates treat an entire multi‑turn interaction as a monolithic unit, obscuring which specific steps contributed to success or failure. GSsPO parses each generated response into a set of contiguous sub‑sequences, each corresponding to one turn’s Think‑Action pair. For each sub‑sequence s, it computes an importance‑sampling ratio r_s(θ) as the geometric mean of token‑level probability ratios, and assigns the same advantage b_A_s (derived from group‑wise reward normalization) to all tokens within s. The loss aggregates over all sub‑sequences, normalizing by the number of sub‑sequences |S_i| to neutralize length bias. This design ensures that gradient updates respect the logical boundaries of the agent’s reasoning steps, providing a middle ground between overly fine‑grained and overly coarse‑grained optimization.
Reward design combines two components: (1) Format Reward (R_Format), a rule‑based penalty that checks the correct ordering and presence of the required tags (<think>, <tool>, <info>, <answer>), encouraging strict adherence to the interface protocol; and (2) Outcome Reward (R_Outcome), the Exact Match (EM) score between the final answer and the ground truth, directly measuring task success. The total reward is the sum of these terms, guiding the model to produce both syntactically valid workflows and correct answers.
Experiments were conducted with two strong LLM backbones—Qwen2.5‑32B‑Instruct and DeepSeek‑V3.2—across seven QA benchmarks (NQ, TriviaQA, PopQA, HpQA, Wiki, Musique, Bamb). Workflow‑R1 was compared against several strong baselines: static workflow generators (AFlow, MaAS, SC‑MedPrompt), token‑level GRPO, and sequence‑level GSPO. Across all datasets, Workflow‑R1 achieved statistically significant improvements, typically 3–5 absolute percentage points higher than the best baselines, with especially large gains (up to 7 points) on datasets requiring multi‑step reasoning (PopQA, Musique). These results demonstrate that aligning the RL update granularity with the Think‑Action sub‑sequence yields more efficient learning and better final performance.
The authors acknowledge several limitations. The current sub‑sequence parser assumes a simple linear turn structure (<think> → <tool>), which may not capture more complex workflow patterns such as parallel branches, conditional loops, or nested tool calls. The evaluation is limited to QA tasks; applying the framework to other domains (e.g., code generation pipelines, data‑processing workflows) remains future work. Finally, RL training is computationally expensive; improving sample efficiency through on‑policy methods, curriculum learning, or human‑in‑the‑loop feedback could further enhance practicality.
In summary, Workflow‑R1 introduces a paradigm shift from static code synthesis to dynamic, language‑driven workflow construction, and proposes GSsPO as a generalizable structure‑aware RL algorithm. By treating each Think‑Action pair as the fundamental optimization unit, the method bridges the granularity mismatch that has hampered prior approaches, achieving state‑of‑the‑art results on diverse multi‑turn reasoning benchmarks and opening new avenues for flexible, adaptive LLM‑driven automation.
Comments & Academic Discussion
Loading comments...
Leave a Comment