RAG-Driven Data Quality Governance for Enterprise ERP Systems

Reading time: 5 minute
...

📝 Original Info

  • Title: RAG-Driven Data Quality Governance for Enterprise ERP Systems
  • ArXiv ID: 2511.16700
  • Date: 2025-11-24
  • Authors: Researchers from original ArXiv paper

📝 Abstract

Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality challenges when human resources departments perform decentralized manual entry across multiple languages. We present an end-to-end pipeline combining automated data cleaning with LLM-driven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multi-stage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% schema compliance, and 90.7\% semantic accuracy on 2,847 production queries. The system reduces query turnaround time from 2.3 days to under 5 seconds while maintaining 99.2% uptime, with GPT-4o achieving 46% lower latency and 68% cost reduction versus GPT-3.5. This modular architecture provides a reproducible framework for AI-native enterprise data governance, demonstrating real-world viability at enterprise scale with 4.3/5.0 user satisfaction.

💡 Deep Analysis

Deep Dive into RAG-Driven Data Quality Governance for Enterprise ERP Systems.

Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality challenges when human resources departments perform decentralized manual entry across multiple languages. We present an end-to-end pipeline combining automated data cleaning with LLM-driven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multi-stage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% sch

📄 Full Content

RAG-Driven Data Quality Governance for Enterprise ERP Systems Sedat Bin Vedat∗, Enes Kutay Yarkan∗, Meftun Akarsu∗, Recep Kaan Karaman∗, Arda Sar∗ Çağrı Çelikbilek, Savaş Saygılı ∗Hagia Labs {sedat, larusa, meftun, kaan, arda}@hagiaproject.com Abstract—Enterprise ERP systems managing hundreds of thousands of employee records face critical data quality chal- lenges when human resources departments perform decentralized manual entry across multiple languages. We present an end- to-end pipeline combining automated data cleaning with LLM- driven SQL query generation, deployed on a production system managing 240,000 employee records over six months. The system operates in two integrated stages: a multi- stage cleaning pipeline that performs translation normalization, spelling correction, and entity deduplication during periodic synchronization from Microsoft SQL Server to PostgreSQL; and a retrieval-augmented generation framework powered by GPT-4o that translates natural-language questions in Turkish, Russian, and English into validated SQL queries. The query engine employs LangChain orchestration, FAISS vector similarity search, and few-shot learning with 500+ validated examples. Our evaluation demonstrates 92.5% query validity, 95.1% schema compliance, and 90.7% semantic accuracy on 2,847 production queries. The system reduces query turnaround time from 2.3 days to under 5 seconds while maintaining 99.2% uptime, with GPT-4o achieving 46% lower latency and 68% cost reduction versus GPT-3.5. This modular architecture provides a reproducible framework for AI-native enterprise data gover- nance, demonstrating real-world viability at enterprise scale with 4.3/5.0 user satisfaction. Index Terms—Data quality automation, ERP systems, natural language to SQL, large language models, retrieval augmented generation, few-shot learning, multilingual data processing I. Introduction When an HR analyst at a multinational construction com- pany needs to answer "How many civil engineers are work- ing on the GPP project in Moscow?", the seemingly sim- ple question becomes a multi-day ordeal. The analyst must contact the IT department, explain the request, wait while IT staff navigate inconsistent data where "Moscow" appears as "Moskva," "Moscow," and "Moskva" in Cyrillic script, manually reconcile project codes stored as "GPP," "Gpp," and "gpp project," and filter between payroll employees and contractors using undocumented business rules. Two days later, the answer arrives—potentially outdated. This scenario, repeated thousands of times annually in organizations managing 240,000+ employee records, reveals a critical enterprise challenge: data quality degradation and accessibility barriers prevent organizations from leveraging their own information. The problem has two interconnected roots: (1) decentralized manual data entry by HR departments across multiple languages creates severe inconsistencies, and (2) SQL expertise requirements create bottlenecks that delay routine analytics by days. In the studied environment, more than 240,000 employee records were distributed across multiple HR-managed tables within Microsoft SQL Server (MSSQL), later migrated to PostgreSQL for higher flexibility and scalability. The lack of strict schema discipline and the presence of user-defined fields resulted in extensive anomalies, including miscatego- rized contractor ("non-payroll") data, misplaced foreign keys, and conflicting entries in project and location fields. To address these issues, we designed and implemented a fully automated data-cleaning and intelligent query-generation pipeline, built upon Large Language Models (LLMs) and retrieval-augmented few-shot learning. The system performs continuous data cleaning and translation across multilingual fields, followed by automatic SQL query generation from natural-language inputs. Our contributions are threefold: 1) Multilingual Data Quality Pipeline: We introduce an automated AI-driven cleaning system that resolves language-mixed inconsistencies across Turkish, Russian, and English text fields, achieving 97.8% accuracy on 240,000 real-world HR records—addressing a challenge existing tools like Deequ and HoloClean cannot handle due to their focus on numerical anomalies rather than multilingual semantic deduplication. 2) Schema-Constrained RAG for Enterprise SQL: We develop a retrieval-augmented few-shot framework that achieves 92.5% query validity on real enterprise schemas, exceeding commercial systems (68-78% ac- curacy) and prior academic work (52% on enterprise data [1]) through explicit business logic encoding and dynamic example retrieval—avoiding the 10,000+ train- ing examples required for fine-tuning approaches. 3) Production-Grade Deployment Evidence: We pro- vide six-month deployment metrics including 2,847 real queries, 99.2% uptime, 99.1% reduction in query turnaround time, and detailed cost analysis ($0.042/query), addressing the deployment gap where most NL2SQL research reports

…(Full text truncated)…

📸 Image Gallery

cover.png image4.png page_2.webp page_3.webp

Reference

This content is AI-processed based on ArXiv data.

Start searching

Enter keywords to search articles

↑↓
ESC
⌘K Shortcut