World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems

World of Workflows: A Benchmark for Bringing World Models to Enterprise Systems
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Frontier large language models (LLMs) excel as autonomous agents in many domains, yet they remain untested in complex enterprise systems where hidden workflows create cascading effects across interconnected databases. Existing enterprise benchmarks evaluate surface-level agentic task completion similar to general consumer benchmarks, ignoring true challenges in enterprises, such as limited observability, large database state, and hidden workflows with cascading side effects. We introduce World of Workflows (WoW), a realistic ServiceNow-based environment incorporating 4,000+ business rules and 55 active workflows embedded in the system, alongside WoW-bench, a benchmark of 234 tasks evaluating constrained agentic task completion and enterprise dynamics modeling capabilities. We reveal two major takeaways: (1) Frontier LLMs suffer from dynamics blindness, consistently failing to predict the invisible, cascading side effects of their actions, which leads to silent constraint violations, and (2) reliability in opaque systems requires grounded world modeling, where agents must mentally simulate hidden state transitions to bridge the observability gap when high-fidelity feedback is unavailable. For reliable and useful enterprise agents, WoW motivates a new paradigm to explicitly learn system dynamics. We release our GitHub for setting up and evaluating WoW.


💡 Research Summary

**
The paper introduces World of Workflows (WoW), a high‑fidelity ServiceNow‑based enterprise environment that embeds over 4,000 business rules and 55 active workflows, thereby reproducing the hidden, cascading dynamics typical of real‑world corporate IT systems. On top of this environment the authors build WoW‑bench, a benchmark suite comprising 234 tasks organized into four categories: (1) autonomous task completion, (2) data‑level constraint understanding, (3) dynamics prediction, and (4) tool prediction. Each task presents a natural‑language user query, explicit constraints (e.g., “a user may not hold assets whose clearance level exceeds the user’s clearance”), and requires the agent to interact with the system via MCP (Multi‑Channel Platform) tools.

The authors formalize WoW as a Partially Observable Markov Decision Process (POMDP). The state space S is the entire relational database, which is intractably large; the observation space O can be either (i) standard tool‑response only, which returns success/failure messages and fetched records, or (ii) an “oracle” observation that augments the tool response with a table‑audit log describing every column‑wise change caused by the action, including those triggered by hidden workflows. This dual‑observation design enables an ablation study of how much state visibility influences agent performance.

Experiments evaluate several frontier large language models (GPT‑4, Claude‑3, Llama‑2‑70B, etc.) acting as zero‑shot agents. The results reveal two dominant failure modes. First, “dynamics blindness”: agents consistently miss the invisible side‑effects of workflow‑driven updates, leading to silent constraint violations. For example, assigning an asset to a user may trigger a workflow that decrements the user’s clearance level; the agent, seeing only the tool’s success message, proceeds unaware that a subsequent constraint (“clearance must be ≥ asset level”) is now broken. Second, the gap between the two observation modes is stark: providing audit logs boosts task success rates by up to sevenfold, demonstrating that explicit visibility into state transitions dramatically improves reliability. However, audit logs are costly, latency‑prone, and often require privileged access, making them unrealistic for many production settings.

The paper also positions WoW‑bench against prior enterprise benchmarks (WorkArena++, CRMArena‑Pro, SCUBA, etc.). Existing suites either focus on UI navigation, lack realistic workflow orchestration, or do not evaluate constraint satisfaction and world‑modeling. In contrast, WoW‑bench explicitly tests an agent’s ability to infer hidden dynamics, respect multi‑hop constraints, and predict downstream effects—all essential for trustworthy enterprise automation.

Limitations acknowledged include the focus on API‑based tool calls (ignoring complex UI interactions) and the reliance on an oracle audit log that may not be deployable in practice. The authors suggest future research directions: (1) developing efficient state‑inference mechanisms that operate under limited observability, (2) learning meta‑models of workflows so agents can predict hidden transitions without explicit logs, and (3) designing low‑overhead, privacy‑preserving alternatives to full audit logs (e.g., event‑stream summaries).

In conclusion, WoW and WoW‑bench expose a critical gap in current LLM‑driven enterprise agents: the inability to model and anticipate hidden workflow dynamics. The benchmark provides a realistic testbed for advancing world‑model learning, constraint‑aware planning, and robust tool use, thereby charting a path toward reliable, production‑grade AI assistants for complex enterprise environments.


Comments & Academic Discussion

Loading comments...

Leave a Comment