Learning to Compose for Cross-domain Agentic Workflow Generation
Automatically generating agentic workflows – executable operator graphs or codes that orchestrate reasoning, verification, and repair – has become a practical way to solve complex tasks beyond what single-pass LLM generation can reliably handle. Yet what constitutes a good workflow depends heavily on the task distribution and the available operators. Under domain shift, current systems typically rely on iterative workflow refinement to discover a feasible workflow from a large workflow space, incurring high iteration costs and yielding unstable, domain-specific behavior. In response, we internalize a decompose-recompose-decide mechanism into an open-source LLM for cross-domain workflow generation. To decompose, we learn a compact set of reusable workflow capabilities across diverse domains. To recompose, we map each input task to a sparse composition over these bases to generate a task-specific workflow in a single pass. To decide, we attribute the success or failure of workflow generation to counterfactual contributions from learned capabilities, thereby capturing which capabilities actually drive success by their marginal effects. Across stringent multi-domain, cross-domain, and unseen-domain evaluations, our 1-pass generator surpasses SOTA refinement baselines that consume 20 iterations, while substantially reducing generation latency and cost.
💡 Research Summary
The paper tackles the problem of generating agentic workflows—executable graphs or code that orchestrate reasoning, verification, and repair—across multiple domains without resorting to costly iterative refinement. Existing approaches either rely on manually crafted workflow templates or treat workflow generation as an external optimization loop that repeatedly samples, evaluates, and mutates candidate workflows. While effective, these methods incur high inference latency, large computational costs, and exhibit unstable performance when the task distribution shifts to a new domain.
To overcome these limitations, the authors propose CapFlow, a framework that internalizes a “decompose‑recompose‑decide” mechanism directly into an open‑source large language model (LLM). The core idea is to learn a small set of workflow capability bases—latent, reusable factors that capture recurring patterns such as multifaceted analysis, verification/repair loops, and aggregation—across diverse domains. Each base is implemented as a low‑rank adapter (ΔBₖ = cₖ Uₖ Vₖᵀ) injected into selected linear layers (e.g., attention projections) of a frozen backbone LLM.
During inference, a task‑conditioned capability composer maps an input query q to a sparse composition over these bases. The composer predicts a binary mask and a set of scaling coefficients, effectively selecting only a few bases (typically 3‑4) that are most relevant to the current task. This sparse selection is trained jointly with three objectives: (1) supervised imitation of successful workflows, (2) a preference‑based loss that contrasts successful and failed workflow instances, and (3) an L₁ regularizer encouraging sparsity.
The decide stage introduces a counterfactual attribution mechanism. For each base, the model measures the marginal change in overall success rate when the base is removed or its scale is set to zero. These marginal effects are used to update the per‑base scaling parameters, ensuring that only bases that truly contribute to success are reinforced. This attribution also provides interpretability, revealing which capabilities drive performance in each domain.
The authors curate a multi‑domain dataset comprising over 180 unique workflows (both successful and failed) spanning coding, mathematics, and general reasoning tasks. Experiments evaluate three scenarios: (1) multi‑domain (training and testing on the same domains), (2) cross‑domain (testing on a domain unseen during training), and (3) unseen‑domain (completely novel task types). CapFlow, operating in a single pass, consistently outperforms strong baselines that require up to 20 refinement iterations (e.g., AFlow, MASS, ScoreFlow). Across all settings, CapFlow achieves higher success rates (≈ 5‑10 percentage points), reduces average inference latency from ~8 seconds to ~1.2 seconds, and cuts computational cost by over 80 %.
Ablation studies confirm the importance of each component: removing the capability bases and fine‑tuning the whole model degrades performance; using dense (non‑sparse) composition inflates cost without accuracy gains; and omitting counterfactual attribution leads to biased base scaling and poorer generalization. Visualization of learned bases shows that certain bases specialize in verification/repair, while others capture domain‑agnostic analysis, supporting the hypothesis that workflow generation can be reduced to recombining a handful of latent factors.
Limitations include the need to manually set the number of bases (K) and their rank (r), and the reliance on sufficient success/failure examples for stable counterfactual estimation. The authors suggest future work on automatic basis expansion, more sample‑efficient causal attribution, and deployment in real‑world multi‑agent systems.
In summary, CapFlow demonstrates that by learning reusable workflow capability bases and a sparse, task‑conditioned composition mechanism, an LLM can generate high‑quality, executable agentic workflows in a single pass, achieving substantial gains in efficiency, cost, and cross‑domain robustness compared to traditional iterative refinement pipelines.
Comments & Academic Discussion
Loading comments...
Leave a Comment