기업 데이터 파이프라인 자동화 종합 벤치마크

Reading time: 6 minute
...

📝 Abstract

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decisionoriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at da-comp.github.io.

💡 Analysis

Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decisionoriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at da-comp.github.io.

📄 Content

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle Fangyu Lei1,2,3∗§, Jinxiang Meng1,2∗, Yiming Huang5, Junjie Zhao3, Yitong Zhang6, Jianwen Luo1,2, Xin Zou3, Ruiyi Yang3, Wenbo Shi3, Yan Gao3, Shizhu He1,2, Zuo Wang3, Qian Liu4, Yang Wang3, Ke Wang3,†, Jun Zhao1,2, Kang Liu1,2,† 1Institute of Automation, CAS 2University of Chinese Academy of Sciences, 3ByteDance Seed 4TikTok 5UC San Diego 6NUS ∗Equal Contribution, §Work done at ByteDance Seed, †Corresponding authors Abstract Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision- oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at da-comp.github.io. Date: December 5, 2025 Correspondence: Kang Liu at kliu@nlpr.ia.ac.cn, Ke Wang at wangke@bytedance.com Project Page: da-comp.github.io 1 Introduction Data intelligence, the process of transforming raw and fragmented data into actionable insights, has become a cornerstone of modern enterprises. The remarkable reasoning and code generation capabilities of Large Language Models (LLMs) [1, 8, 23] have opened new avenues for automating data intelligence tasks. LLM-based agents have demonstrated considerable promise across a wide range of applications, including text-to-SQL [18, 20, 39], software engineering [3, 14], and general computer control [31, 33, 42]. However, the advancement of these agents into enterprise data intelligence remains constrained by the absence of benchmarks that 1 arXiv:2512.04324v1 [cs.CL] 3 Dec 2025

Table #12

  • name: fct_opportunity_health
  • description: Scores each opportunity’s health to identify ‘zombie’ deals based on sales engagement.
  • source_tables: [int_opportunities_with_age, int_activities_per_deal]
  • columns:
    • name: opportunity_health_score
    • description: Prioritized Rules for Health Score [0-100]
      1. High Engagement (>=5 activities in 30d) => 90
      2. Stale Deal (no activity > 30d) => 10
      3. Stuck Deal (in stage > 90d) => 25
      4. default => 60 Table #11
  • name: int_activity_summary … +n tables ❶ DE: Architecture Write the design document Design Specifications ❷ DE: Implementation Write SQLs to build DE-DAG ❸ DE: Evolution Write SQLs to fix/update DE-DAG ❹ DA: Insight Generation Write Python / SQL to generate open-ended analysis Semantic-Layer Data
    Open-Ended Analysis Raw Data Sources

Sales Velocity & Funnel Conversion? Codebase of DE-DAG Python/ SQL & Thinking DE-DAG Analytical Report Our Q3 analysis shows that 40% of the sales pipeline value is comprised of “unhealthy” opportunities that have seen no sales activity in over 30 days. Key Insight Stale deals are inflating the forecast, which masks a critical slowdown in sales velocity. intermediate int_opportunity_velocity.sql Fix 3 SQLs Add 4 SQLs fct_funnel_conversion.sql sqls marts fct_sales_velocity.sql sqls marts intermediate revenue_forecast.sql staging int_opportunity_pipeline.sql stg_opportunity.sql Please help me to identify “zombie” opportunities in Salesforce via engagement analysis. Visualization Figure 1 DAComp aims to evaluate LLMs on full-lifecycle data intelligence workflows, encompassing repository-level data engineering (DE) and open-ended data analysis (DA). faithfully reflect real-world complexity. This gap between existing benchmarks and real enterprise practice calls for a benchmark that evaluates agents along two distinct axes: Hard (engineering realism) and Soft (analytical openness). The Hard axis reflects t

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut