DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability – the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.


💡 Research Summary

The paper introduces DDL2PropBank, a novel benchmark designed to evaluate the developer experience of multi‑agent frameworks (MAFs) in a controlled, reproducible setting. The task requires mapping relational database schemas, expressed in Data Definition Language (DDL), to PropBank rolesets. For each table, the system must identify appropriate PropBank frames, assign columns to semantic arguments (ARG0, ARG1, …), and produce a confidence score reflecting the semantic fit. This demands fine‑grained linguistic reasoning over table names, column names, and foreign‑key relationships, making the benchmark sufficiently complex to stress‑test LLM‑assisted code generation while remaining novel with respect to typical LLM training data.

To isolate framework‑specific effects, the authors implement the same “Agent‑as‑a‑Tool” architecture across ten popular MAFs, including vendor‑maintained SDKs (Claude SDK, OpenAI Agents, Google ADK) and community projects (Pydantic AI, Agno, DSPy, LangChain, Smolagents, Microsoft Agent Framework, AgentScope). The architecture consists of three agent types: an Orchestrator that drives the overall workflow, a Coordinator that tracks progress and ensures idempotent execution, and parallel Table Mapper agents that process individual tables. All agents interact with shared Model Context Protocol (MCP) servers that expose PropBank queries and filesystem operations, guaranteeing that any observed differences stem from framework design rather than functional variation.

The evaluation proceeds along two orthogonal dimensions. First, code complexity is measured via static analysis of the reference implementations, using logical lines of code (LLOC) and cyclomatic complexity (CCN) as proxies for developer cognitive load. Results reveal a three‑tier spectrum: Pydantic AI and Agno achieve the lowest LLOC (52–54) and CCN (3–4), indicating minimal boilerplate; Smolagents, LangChain, OpenAI Agents, Google ADK, and Claude SDK occupy a middle tier; DSPy and AgentScope exhibit the highest complexity (LLOC up to 88, CCN up to 20) due to lack of native MCP support and extensive framework‑specific imports. The authors note a 1.7× variation in LLOC and a 3.3× variation in import count across frameworks, underscoring the impact of API design, unified tool registration, and native MCP integration on developer effort.

Second, AI‑assistability is assessed by tasking GitHub Copilot (representative of modern AI coding assistants) with generating the full DDL2PropBank implementation for each framework from identical starter code and public documentation alone. Two metrics are collected: (i) structural alignment, i.e., how closely the generated code matches the human‑written reference in terms of idiomatic API usage and control flow; and (ii) functional correctness, measured as pass@1 – the proportion of generated implementations that execute end‑to‑end and produce correct schema‑to‑PropBank mappings on a held‑out test database. For frameworks with a single canonical pattern (Agno, Claude SDK, OpenAI Agents), structural alignment reliably predicts runtime success, yielding 83% pass@1. In contrast, multi‑pattern frameworks such as LangChain show lower alignment scores yet often still succeed at runtime, indicating that alignment underestimates true assistability when flexibility is high.

Overall, Agno emerges as the strongest performer, combining the lowest code complexity, highest structural alignment, and 83% pass@1, making it especially well‑suited for AI‑augmented development. Pydantic AI follows closely in complexity but lags slightly in alignment. The study’s contributions are threefold: (1) the first benchmark (DDL2PropBank) for systematic MAF evaluation; (2) a dual‑dimensional methodology that jointly measures static code burden and AI‑assistability; and (3) an open‑source PropBank MCP server and associated tooling for semantic annotation of database schemas. The findings suggest that framework designers should prioritize native MCP support, minimalistic and consistent APIs, and high‑quality documentation to lower developer effort and maximize the benefits of LLM‑based coding assistants.


Comments & Academic Discussion

Loading comments...

Leave a Comment