Reading time: 27 minute
...

๐Ÿ“ Original Info

  • Title:
  • ArXiv ID: 2512.18265
  • Date:
  • Authors: Unknown

๐Ÿ“ Abstract

Manufacturing planners face complex operational challenges that require seamless collaboration between human expertise and intelligent systems to achieve optimal performance in modern production environments. Traditional approaches to analyzing simulation-based manufacturing data often create barriers between human decision-makers and critical operational insights, limiting effective partnership in manufacturing planning. Our framework establishes a collaborative intelligence system integrating Knowledge Graphs and Large Language Model-based agents to bridge this gap, empowering manufacturing professionals through natural language interfaces for complex operational analysis. The system transforms simulation data into semantically rich representations, enabling planners to interact naturally with operational insights without specialized expertise. A collaborative LLM agent works alongside human decision-makers, employing iterative reasoning that mirrors human analytical thinking while generating precise queries for knowledge extraction and providing transparent validation. This partnership approach to manufacturing bottleneck identification, validated through operational scenarios, demonstrates enhanced performance while maintaining human oversight and decision authority. For operational inquiries, the system achieves near-perfect accuracy through natural language interaction. For investigative scenarios requiring collaborative analysis, we demonstrate the framework's effectiveness in supporting human experts to uncover interconnected operational issues that enhance understanding and decision-making. This work advances collaborative manufacturing by creating intuitive methods for actionable insights, reducing cognitive load while amplifying human analytical capabilities in evolving manufacturing ecosystems.

๐Ÿ“„ Full Content

Modern manufacturing systems represent complex ecosystems where human expertise and advanced technology must work in seamless partnership to achieve optimal performance. As Industry 4.0 and smart manufacturing initiatives transform production environments, the challenge of effective human-machine collaboration has become paramount (Zhong et al. 2017a;Fragapane et al. 2022). Human decision-makers in manufacturing planning bring irreplaceable domain knowledge, contextual understanding and adaptive problem-solving capabilities, while AI systems offer computational power, pattern recognition and data processing at scale. However, realizing the full potential of this partnership requires bridging the gap between human cognitive processes and complex technological systems, particularly in the analysis of intricate operational data. Manufacturing facilities, including warehouses, distribution centers and production floors, are characterized by sophisticated interactions between personnel, automated equipment, processes and physical layouts (de Koster, Le-Duc, and Roodbergen 2007; Lee et al. 2017;Gu, Goetschalckx, and Mcginnis 2007) . The complexity of these human-machine collaborative environments demands decision support systems that can interpret vast amounts of operational data while remaining accessible and interpretable to human planners. Discrete Event Simulation (DES) has emerged as a fundamental tool for modeling these systems (Banks 2005;Law, Kelton, and Kelton 2000), enabling stakeholders to evaluate performance, test design alternatives and understand system dynamics. However, a critical challenge persists: how can human decision-makers effectively collaborate with AI systems to extract actionable insights from the voluminous and highly granular output data that DES generates? Traditional approaches to simulation analysis often create barriers between human expertise and technological capabilities. Conventional methods, frequently reliant on manual inspection of aggregate statistics (Law, Kelton, and Kelton 2000) or development of custom scripts tailored to specific simulation outputs, are not only time-intensive and errorprone but also limit the ability of human planners to engage in meaningful collaboration with AI systems. These approaches fail to leverage the complementary strengths of human intuition and machine computation, often requiring specialized technical expertise that creates silos between operational planners and analytical tools. As manufacturing environments become increasingly complex and dynamic, there is an urgent need for human-centric AI systems that can facilitate effective collaboration between human decisionmakers and advanced analytical capabilities. The emergence of Artificial Intelligence (AI) in manufac-turing and logistics (Ivanov, Dolgui, and Sokolov 2019;Zhong et al. 2017a;Drissi Elbouzidi et al. 2023) presents unprecedented opportunities to create collaborative intelligence systems that augment rather than replace human expertise. However, realizing these opportunities requires addressing fundamental challenges in human-AI interaction: How can AI systems interpret user intentions while managing procedural uncertainties? How can humans and machines engage in bidirectional learning? How can collaborative partnerships be designed to achieve mutually beneficial goals in evolving manufacturing ecosystems? To address these human-centric manufacturing challenges, our work proposes a novel framework that integrates Knowledge Graphs (KGs) (Hogan et al. 2021) and Large Language Model (LLM)s (Zhao et al. 2023;Pan et al. 2023) to enable effective human-AI collaboration in manufacturing planning through natural language interfaces for complex operational analysis. Our approach recognizes that the core challenge is not merely technical analysis, but rather creating systems that facilitate meaningful partnership between human planners and AI capabilities. The foundation of our human-centric approach lies in transforming complex data generated by DES into semantically rich Knowledge Graphs that both humans and AI systems can effectively utilize. By representing simulation output as graphs, we enable the intricate dependencies and flows within manufacturing systems to be explicitly captured and collaboratively explored by human planners and AI agents. While KGs are increasingly applied to analyze real-world industrial and supply chain data for enhanced visibility and risk management (Noy et al. 2019;Kosasih et al. 2024), their application to creating human-AI collaborative interfaces for simulation analysis remains relatively unexplored. Building upon this structured representation, our framework employs LLM-based agents to create intuitive natural language interfaces that allow human planners to engage in collaborative analysis without requiring specialized technical expertise. This approach democratizes access to complex analytical capabilities, enabling operations analysts, industrial engineers and manufacturing planners to engage in meaningful dialogue with AI systems using natural language queries. The LLM agent serves not as a replacement for human expertise, but as an intelligent partner that can translate human intentions into precise analytical operations while maintaining transparency and interpretability. Our collaborative framework features an iterative reasoning mechanism (Luo et al. 2023) that enables bidirectional learning between human planners and AI systems. Given complex queries from human users, the AI agent autonomously generates sequences of sub-questions, each informed by both accumulated evidence and potential human feedback. This creates opportunities for human experts to guide the analytical process, validate intermediate findings and contribute domain knowledge that enhances the quality of insights. The agent's ability to perform self-reflection (Huang et al. 2022;Madaan et al. 2023) and correct its analytical pathway creates a transparent collaborative process where humans can understand, validate and improve AI reasoning. This synthesis moves beyond traditional automation paradigms to create collaborative intelligence that leverages the unique strengths of both human expertise and AI capabilities. Human planners bring contextual understanding, strategic thinking and adaptive problem-solving, while AI systems contribute computational power, pattern recognition and systematic analysis. The result is a manufacturing planning assistant that facilitates reciprocal partnership, where humans and machines co-develop solutions, engage in mutual learning and achieve shared goals in dynamic manufacturing environments.

Modern manufacturing systems require sophisticated analytical approaches to manage complex interactions between personnel, equipment and processes (Zhong et al. 2017b). DES has become a fundamental tool for modeling these systems, enabling stakeholders to evaluate performance and test design alternatives (Banks 2005;Law, Kelton, and Kelton 2000). In the Industry 4.0 context, DES serves as a core component of Digital Twins, providing real-time operational insights for manufacturing decision-making (Rasheed, San, and Kvamsdal 2020;Leng et al. 2019) . Despite its analytical power, extracting actionable insights from voluminous DES output data remains challenging. Traditional approaches rely on manual inspection of aggregate statistics or custom analytical scripts, which are timeintensive and often fail to uncover complex system behaviors (Negahban and Smith 2014). This has created demand for AI-driven approaches that can unlock deeper insights from simulation data while remaining accessible to manufacturing professionals.

Knowledge Graphs have gained prominence as powerful tools for representing and reasoning over complex industrial data (Noy et al. 2019). In manufacturing contexts, KGs enable structured representation of operational relationships, supporting applications in supply chain visibility, risk management and inventory optimization (Saidi et al. 2025). Also this work has applied KGs to warehouse robot operations (Kattepur and P 2019) and production logistics (Zhao et al. 2022), demonstrating their potential for operational analysis. However, the application of KGs to structure and analyze DES output data remains relatively unexplored. While KGs excel at representing complex relational data, their specific use for simulation-based manufacturing planning represents a significant opportunity for enhancing analytical capabilities.

The integration of LLMs with Knowledge Graphs has emerged as a powerful approach for creating accessible AI systems (Pan et al. 2023;Zhao et al. 2023). This integration addresses key challenges: KGs ground LLMs with structured facts, reducing hallucinations and improving reliability (Agrawal et al. 2023), while LLMs make KG information accessible through natural language interfaces, eliminating the need for specialized query languages. Several integration patterns have emerged. KG-enhanced LLMs leverage structured knowledge during inference, often through Retrieval-Augmented Generation approaches. LLM-augmented KGs use language models for construction and enrichment. Synergized approaches feature deeper integration with LLM-based agents that reason over and interact with KGs for complex, multi-step tasks (Jiang et al. 2024;Luo et al. 2023).

Recent developments in manufacturing include enhanced querying capabilities for accessing operational knowledge through natural language (Hoฤevar and Kenda 2024), AIdriven logistics optimization (Ieva et al. 2025) and domainspecific question answering systems (Li et al. 2024). Frameworks like SparqLLM (Arazzi et al. 2025) have improved KG querying reliability for industrial applications. Despite these advances, the application of KG-LLM systems to analyzing DES output data remains largely unexplored. Critical research questions include: how effectively can LLM-based agents transform natural language queries about simulated performance into precise KG queries, iteratively refine analytical pathways based on retrieved evidence and synthesize disparate information to diagnose operational issues with explainable reasoning. Our work addresses this gap by proposing a novel framework that integrates KGs and LLM-based agents for iterative analysis of DES data, specifically targeting bottleneck identification and root cause analysis in manufacturing operations. This approach bridges traditional simulation modeling with modern AI capabilities, providing an intuitive interface for complex operational analysis.

We implement a comprehensive two-stage framework that transforms raw DES output data into actionable warehouse planning insights through Knowledge Graph construction and Large Language Model-based reasoning. This framework addresses the critical challenge of analyzing voluminous simulation data by creating a semantically rich representation that enables natural language querying and sophisticated diagnostic analysis. The methodology bridges traditional simulation modeling with modern AI capabilities, providing warehouse planners with an intuitive interface for complex operational analysis.

Our evaluation foundation consists of a detailed DES model representing a warehouse facility engaged in unloading, internal transport and storage operations. The simulation replicates real-world logistics environments with high fidelity, capturing the complex interactions between personnel, equipment and operational processes that characterize modern warehouse operations.

The simulated warehouse environment incorporates five distinct resource categories operating in coordinated fashion. External suppliers arrive as trucks carrying package shipments, with each supplier transporting between 30-35 packages sampled from a uniform distribution. These suppliers operate at 20 km/hr movement speed, with the facility supporting a maximum of three simultaneous unloading operations to reflect real-world dock capacity constraints.

Warehouse personnel consist of twelve employees organized into specialized teams of four workers each, with exclusive assignment to individual suppliers during unloading operations. Workers operate at 2 km/hr movement speed and handle single packages, reflecting standard warehouse safety and efficiency protocols. This team-based structure ensures systematic coordination while maintaining operational flexibility.

The automated transportation infrastructure includes twenty Automated Guided Vehicles (AGVs) operating at 3.5 km/hr with dynamic dispatch capabilities. AGVs follow First-In-First-Out scheduling protocols and traverse an average distance of 140 meters per transport operation. This reflects the sophisticated material handling automation increasingly common in modern warehouse facilities.

Five forklift units provide vertical storage capabilities, operating at 5 km/hr with block-specific assignments. Each forklift follows FIFO package handling protocols and incorporates stochastic storage operations ranging from 60-90 seconds to represent real-world variability in storage placement activities. The storage infrastructure consists of five blocks, each containing fifteen bays with three shelves, providing total capacity for 225 packages across the facility.

The operational process follows a structured four-stage flow architecture as shown in Figure 1. Initially, supplier trucks proceed to parking areas upon arrival, awaiting dock assignment based on availability before moving to designated unloading positions. Worker teams then transfer packages from suppliers to predetermined waiting points, with handling times determined by distance calculations and walking speed parameters. Subsequently, AGVs transport packages to appropriate block-specific pickup points, with travel times calculated based on distance and vehicle speed specifications. Finally, block-dedicated forklifts collect packages and store them in available bays, incorporating both calculated travel time and stochastic storage duration components.

The simulation captures comprehensive operational data including process-specific timestamps for arrival, initiation and completion events across all stages. Equipment utilization metrics and waiting times are recorded for each resource type, while package-level tracking maintains unique identifiers throughout the entire process flow. Resource state transitions and queue statistics provide additional analytical depth for performance assessment and bottleneck identification.

We developed a specialized ontology tailored specifically for warehouse simulation data representation, capturing both static resource configurations and dynamic package flow relationships. The schema design prioritizes semantic richness while maintaining query efficiency for complex analytical operations. More information about Node schema and Relationship schema is mentioned in the appendix .

The automated Knowledge Graph construction pipeline transforms DES output logs through a systematic five-step process. Initially, simulation event logs undergo parsing and temporal data extraction, ensuring accurate timestamp representation across all operational events. Resource nodes are then created with their associated properties, establishing the foundational entity structure. Package flow relationships are established with timestamp annotations, creating the dynamic process flow representation. Data integrity and relationship consistency undergo comprehensive validation to ensure analytical reliability. Finally, graph structure optimization occurs for efficient Cypher querying, including appropriate indexing and relationship organization for performance enhancement.

Stage 2: Dual-Path Query Processing System Our framework implements a sophisticated dual-path architecture initiated by automated query classification to address both direct operational inquiries and complex investigative diagnostic scenarios as shown in Figure 2. This architectural approach recognizes the fundamental difference between information retrieval tasks and analytical reasoning requirements in warehouse planning contexts.

The query classification module employs natural language processing to automatically categorize incoming queries into two distinct pathways. Operational queries represent direct information retrieval scenarios requiring structured data access and straightforward analytical operations. These queries typically seek specific performance metrics, resource utilization statistics or process timing information. Investigative queries encompass complex diagnostic scenarios requiring iterative reasoning, hypothesis testing and systematic bottleneck identification through multi-stage analytical processes.

For operational queries, the framework implements a structured QA Chain comprising four specialized processing modules that work in coordinated fashion to ensure accurate and comprehensive responses.

The Step Generation Module decomposes complex natural language queries into structured analytical steps, recognizing that warehouse operational questions often contain multiple information requirements. This module breaks down multi-faceted questions into targeted sub-queries, with each step designed for single, focused Cypher query execution. This decomposition strategy significantly improves query success rates by reducing complexity and enabling systematic error detection and correction.

The Cypher Generation Module translates structured analytical steps into precise graph database queries, leveraging the native graph structure for expressive relationship traversal operations. This approach avoids complex SQLstyle joins through graph-native query patterns that more naturally represent warehouse operational flows. The module incorporates domain-specific query templates optimized for warehouse data patterns while maintaining flexibility for novel query structures.

The Query Execution and Self Reflection Module implements robust error-handling loops for query validation and automatic correction. This module provides automatic syntax correction and retry mechanisms, ensuring query execu-tion reliability through self-correction capabilities. The error handling includes both syntactic correction for malformed queries and semantic validation for logically inconsistent operations, significantly improving overall system reliability.

The Answer Synthesis Module processes query results beyond simple data presentation, interpreting patterns and synthesizing coherent responses that provide meaningful insights for warehouse planning decisions. This module aggregates multiple query results for comprehensive answers and contextualizes numerical findings within operational frameworks that facilitate practical decision-making.

For complex investigative scenarios, the framework activates an Iterative Reasoning Chain that enables systematic diagnostic analysis through evidence-based hypothesis testing and refinement.

The Reasoning Module decomposes main diagnostic problems into sequential sub-questions, generating contextaware inquiries based on accumulated evidence from previous analytical steps. This module dynamically refines the analytical pathway through iterative questioning, ensuring that investigation depth and breadth adapt to the specific characteristics of each diagnostic scenario. The sub-question generation incorporates domain knowledge about warehouse operations to ensure investigative relevance and efficiency.

Evidence Collection and Integration operates through the complete QA Chain infrastructure for each generated subquestion, enabling focused evidence gathering through targeted Cypher queries. This approach builds comprehensive evidence bases through iterative information accumulation while maintaining analytical coherence across investigation stages. The integration process includes temporal correlation analysis and causal relationship identification to support robust diagnostic conclusions.

Self-Reflection and Validation mechanisms implement continuous validation of analytical findings throughout the investigation process. These mechanisms correct reasoning errors through self-reflection procedures and ensure diagnostic accuracy through multi-stage verification processes. The validation includes both logical consistency checking and factual accuracy verification against the knowledge graph data.

Sufficiency Assessment determines when adequate evidence has been gathered to support reliable diagnostic conclusions. This assessment triggers final synthesis when investigation completeness criteria are met while preventing over-analysis that might introduce unnecessary complexity or confusion. The sufficiency criteria incorporate both breadth of coverage across relevant operational dimensions and depth of analysis for identified bottleneck areas.

The technical implementation leverages state-of-the-art technologies and frameworks to ensure robust, scalable and reliable operation across diverse warehouse planning scenarios.

LLM Integration utilizes OpenAI’s GPT-5 (OpenAI 2025) through the LangGraph framework, with each processing module implemented through independent Large Language Model calls. This modular architecture enables specialized function optimization while maintaining system coherence. Temperature ranging from 0.0 to 0.3 ensures investigative accuracy for diagnostic scenarios while allowing appropriate creativity for complex reasoning tasks. Robust error handling and retry mechanisms provide system reliability even under challenging operational conditions.

Graph Database Integration through Neo4J (Neo4j 2025) provides efficient Cypher query execution capabilities optimized for warehouse operational patterns. The integration supports scalable graph traversal operations that maintain performance even with large-scale simulation datasets. Optimized relationship-based data access patterns ensure rapid response times for both simple operational queries and complex investigative analyses.

Adaptive Reasoning Parameters enable dynamic question generation based on intermediate findings, ensuring that investigation pathways remain relevant and efficient as evidence accumulates. Context-aware query refinement adjusts analytical approaches based on evolving understanding of operational scenarios. Evidence-driven investigation pathway adjustment ensures that diagnostic processes adapt to the specific characteristics and requirements of each bottleneck scenario.

The integration of structured simulation data with advanced AI reasoning capabilities creates a powerful framework for transforming complex operational data into actionable warehouse planning insights that support both tactical and strategic decision-making processes.

This study is based on the data generated by an in-house DES model that includes operations of a warehouse facility engaged in the unloading, internal transport and storage of incoming packages. The simulation is designed to replicate real-world warehouse logistics, capturing the interactions between key resources such as suppliers (trucks), workers, automated guided vehicles (AGVs), forklifts and storage infrastructure. The details of the simulation including equipment-resource specifications, process flow, operational assumptions and data captured can be found in appendix .

Based on data generated from the DES operating scenario, two types of questions were formulated for the evaluation of analytical capabilities of our framework as following:

Operational Questions: A set of 25 distinct operational questions (see Appendix ) was created to assess the proficiency in retrieving specific factual information and performing analyses using the simulation output. These questions were designed to cover various aspects of the simulated operation, with an approximately uniform distribution across key entities and stages such as supplier interactions,

P@1 P@2 P@1 P@2 P@1 P@2 P@1 P@2 P@1 P@2 P@1 P@2 Baseline 0.4 0.5 0.33 0.33 0.67 0.83 0.9 1 0.65 0.9 0.59 0.71 Baseline+ SR 0.7 0.8 0.42 0.5 0.75 1 0.95 1 0.7 0.8 0.70 0.82 Our framework 1.00 1.00 0.75 1.00 0.92 1.00 1.00 1.00 0.95 1.00 0.92 1.00

Table 1: Performance on Operational QA by Method and Stage (Pass@k Scores). Baseline: Single-pass Cypher query generation followed by answer synthesis. SR: Self-Reflection. Our Framework (Guided Iterative Steps): Question decomposition for structured step generation; each step involves (Cypher query + Answer Generation + Self-Reflection).

worker activities, AGV and forklift utilization and package flow.

Investigative Questions: To specifically evaluate the capabilities in identifying operational bottlenecks, three distinct investigative scenarios simulated. Each scenario introduced a specific type of inefficiency into the baseline model, mirroring potential real-world disruptions:

โ€ข Scenario 1: Delay in Stage Transfer: For a particular supplier, a specific process inefficiency was simulated, primarily introducing significant delays at a specific stage. This initial bottleneck led to significantly prolonged overall discharge times for their packages and created a downstream imbalance in equipment utilization. โ€ข Scenario 2: Degraded Forklift Performance: One specific forklift was modeled to operate with reduced efficiency throughout its designated shift, leading to localized congestion and delays in tasks reliant on that particular forklift. โ€ข Scenario 3: Supplier-Specific Processing Delay: For a particular supplier, targeted inefficiencies were simulated, introducing increased handling and suboptimal task allocation within the unloading and package processing stages, leading to significantly prolonged processing times. For each of these three systematically perturbed scenarios, a unique investigative question was formulated. The objective of each such question was to task the framework with identifying the primary operational bottleneck or pinpointing the most significant performance degradation resulting from the deliberately introduced inefficiency.

For operational queries performance was measured using the pass@k metric (Chen et al. 2021) to assess answer accuracy across several attempts. This was benchmarked against two baselines: (i) single-pass Cypher generation with answer synthesis and (ii) an enhanced version adding post-answer self-reflection. For investigative scenarios, our iterative Reasoning chain refining each step based on accumulated evidence was evaluated by a human expert.

The experimental results for operational question answering presented in Table 1 highlight the significant advantages of our proposed approach. While incorporating a self-reflection (SR) mechanism into a direct question-answering pipeline (Baseline: Direct QA + SR) does offer a substantial improvement over a simple, single-pass baseline: Direct QA, our proposed method consistently outperforms both baselines, particularly in achieving comprehensive correctness as indicated by the maximum Pass@2 scores across all operational stages.

The results directly validate our hypothesis, contrasting the brittleness of the baseline’s single-pass approach with the robustness of our guided, iterative framework. The baseline’s low average Pass@1 (P@1) score of 0.59 confirms that its complex, monolithic queries are highly prone to failure, dropping as low as 0.33 for the “Worker” category. The improved baseline with self-reflection(SR) also shows minor improvement of 0.42 P@1 in this category. In sharp contrast, our framework achieves an average P@1 score of 0.92. This demonstrates that our method, which decomposes the query and embeds self-reflection at each step, is vastly more likely to produce a correct solution on its very first attempt

The Pass@2 (P@2) scores further underscore this robustness. Our framework achieves a perfect average P@2 of 1.00, meaning that when generating two independent solution paths, at least one of them was correct 100% of the time. The baseline, however, only scores 0.71 P@2. This is a critical finding: even when the baseline is given two independent chances to generate a monolithic query, it still fails to produce a single correct answer 22% of the time.

The combination of a 0.92 P@1 and a 1.00 P@2 strongly suggests that our agent’s step-wise, reflective process consistently generates high-quality, correct solutions. In the rare 8% of cases where its first independent attempt fails, its second independent attempt succeeds, proving the fundamental stability and reliability of our decompositional approach. The ability to decompose, validate and refine sub-questions at each stage enhances the agent’s capability in both direct question answering and complex diagnostic scenarios.

We present three case studies to evaluate the agent’s effectiveness in handling investigative QA. Due to space constraints, only one is discussed in detail. Others can be found in Appendix . The CamelCargo investigation (scenario 1) represents a complex diagnostic scenario involving significant discharge delays that require systematic analysis across multiple operational stages. The investigation revealed (see Table 2) that CamelCargo’s total discharge time of 6,870 seconds significantly exceeded the global average discharge time of 4,955 seconds, representing a 39% performance Note: The LLM Agent formulates its own sequence of questions based on its reasoning framework and evolving evidence from the KG. Human expert iterative questions for this investigation: (1) What is the total discharge time for CamelCargo compared to the global average? (2) What are the equipment utilization rates for AGVs and forklifts and do they show an imbalance? (3) To find the bottleneck, what are the package-level waiting times at each distinct process stage? (4) Does any specific stage show a significant deviation from the global average and is this the primary cause of the delay and the equipment utilization imbalance? The LLM generations have been condensed to fit inside the table. Cypher queries are highly condensed representations for brevity. What are the package waiting times at each stage of the discharge process for supplier CamelCargo and how do they compare to the global average waiting times at each stage?

For each package from CamelCargo, show its waiting time at each process stage alongside the global average for that stage.

Package Wait Times (vs. Global Avg) -Wait for Worker: Significantly Higher (e.g., one package waited 3,654s vs. the 1,375s avg) Wait for AGV, Forklift, Storage: At or below average CALL {MATCH (s:SUPPLIER)-[stw:

SUPPLIER TO WORKER]->(w:WORKER) … WITH DIS-TINCT stw.package id AS package id, s, stw, wta, atf, fts … gl avrg wait to storage sec OR-DER BY package id Agent Summary: Supplier CamelCargo’s discharge process was the longest, taking 6,870 seconds (1.39 times the global average), primarily due to extreme “Wait to Worker” times. Several packages waited over double the average time for a worker (e.g., one package waited 3,654 seconds vs. the 1,375-second average). This initial bottleneck created an imbalance downstream, with AGV utilization being higher than average (0.54 vs. 0.42) and forklift utilization being much lower (0.24 vs. 0.40).

Table 2: Scenario 1: Why did CamelCargo’s discharge take longer than others? degradation (1.39 times the global average). Expert analysis through the framework identified that the primary contributing factor was extreme inefficiencies in the worker stage, where packages experienced significantly prolonged waiting times before worker assignment. Specific instances revealed packages waiting 3,654 seconds compared to the global average of 1,375 seconds, creating a critical bottleneck that cascaded through the entire discharge process.

Further investigation of the CamelCargo scenario demonstrates sophisticated diagnostic capabilities of the workflow through a systematic sub-question generation and evidencebased analysis (see Table 2.)

Initial Investigation Phase: The investigation began with a comparative analysis: “What is the total unload time for supplier CamelCargo in the Discharge Flow and how does it compare to the global average unload time for all suppliers?” This revealed a significant issue, with CamelCargo’s unload time of 6, 870 seconds exceeding the global average of 4, 955 seconds.

Progressive Diagnostic Refinement: The framework demonstrated sophisticated reasoning progression through strategically generated sub-questions that systematically explored potential bottleneck sources. The investigation progressed through multiple phases:

โ€ข Supplier Waiting Evidence Integration and Causal Analysis: The framework demonstrated advanced evidence integration capabilities by identifying that while waiting times for AGV, forklift and storage operations remained at or below global averages, the “Wait to Worker” stage showed extreme variability and delays. This created a cascading effect where the initial bottleneck led to resource imbalances downstream, explaining the higher AGV utilization and lower forklift utilization patterns. The investigation conclusively identified that CamelCargo’s performance issues stemmed from worker assignment inefficiencies rather than equipment-related bottlenecks, with specific packages experiencing wait times more than double the global average, fundamentally disrupting the entire discharge flow efficiency. A similar performance was observed against other two investigative scenarios as shown in Appendix .

In addition to this, a comprehensive human evaluation of our framework across seven critical quality dimensions demonstrates exceptional performance, with scores ranging from 7.96 to 9.0 on a 10-point scale (see Table 3). These results represent human expert assessment of over 12 investigation analysis logs produced by our intelligent assistant based on the 3 aforementioned scenarios, providing robust validation of diagnostic quality across diverse warehouse bottleneck scenarios. For more details refer to appendix . This multi-dimensional assessment approach, inspired by recent advances in dialectical evaluation frameworks for LLM reasoning chains (Anghel et al. 2025), provides a holistic view of system performance that extends beyond traditional accuracy metrics to encompass the nuanced quality requirements essential for practical warehouse diagnostic applications.

Our framework transforms warehouse planning by providing intuitive access to complex simulation insights across multiple operational horizons. The high pass@k scores (Table 1) enable planners to obtain precise, real-time visibility into supplier interactions, resource utilization and package flows, supporting both daily operations and tactical adjustments without requiring specialized technical expertise.

Most significantly, the investigative capabilities move beyond surface-level reporting to deliver meaningful diagnostic insights. By systematically querying simulationderived Knowledge Graphs through LLM-driven reasoning, the framework isolates root causes of performance issues and reveals subtle bottlenecks often missed by traditional analytics. This creates a more powerful and interpretable warehouse digital twin that enables targeted, data-driven interventions such as process redesign and resource reallocation, ultimately supporting more adaptive and informed warehouse planning.

Several limitations require consideration for operational deployment. The initial Knowledge Graph schema design demands substantial domain expertise and engineering investment, though this can be mitigated through standardized schema templates and automated generation tools for common warehouse configurations.

Despite robust self-correction capabilities, the absolute reliability of Cypher query generation and synthesized explanations requires ongoing validation, particularly for novel operational scenarios. This necessitates continuous monitoring procedures and confidence scoring mechanisms for critical diagnostic conclusions.

The framework’s generalizability, validated primarily in warehouse unloading contexts, requires demonstration across broader warehouse processes and different simulation frameworks. Future work must systematically evaluate performance across picking, packing, inventory management and other operational contexts.

Extracting actionable insights from voluminous DES data is a significant challenge for timely warehouse decisionmaking. To solve this, we propose a novel framework integrating KGs with a reasoning-capable LLM agent, providing a more intuitive and powerful way to interact with simulation data. The architecture combines a QA chain with step-wise guidance and an iterative reasoning chain equipped with sub-questioning, Cypher query generation and self-reflection. This enables both high-accuracy operational queries and deeper, evidence-driven investigations into system inefficiencies. Experimental evaluations demonstrate the framework’s proficiency in accurately answering operational questions and, more significantly, its robust capability in performing iterative, evidence-driven investigations to identify operational bottlenecks in simulated scenarios, surpassing traditional baseline methods.

In the future, we plan to integrate advanced reasoning LLM architectures to enhance the agent’s diagnostic depth and efficiency. To validate robustness, we will expand the framework’s application beyond unloading to a wider array of warehouse operations (e.g., slotting, picking, loading, inventory management), integrating both simulated and onfield data. This expansion necessitates developing rigorous benchmarking methodologies to formally quantify performance. Ultimately, this work is a stepping stone toward developing a fully autonomous industrial planning and scheduling agent.

The significance of these human-evaluated results cannot be overstated, as they represent validation by domain experts who can assess not only technical correctness but also practical utility and interpretability for real-world warehouse planning applications. These results align with findings from dialectical evaluation research (Anghel et al. 2025), which emphasizes that multi-dimensional assessment frameworks provide more comprehensive and actionable insights than traditional single-metric evaluations. The consistently high human evaluation scores across all dimensions validate our framework’s advancement over baseline approaches and demonstrate readiness for practical deployment in warehouse planning applications where diagnostic reliability and explanation quality are critical success factors as assessed by human practitioners. The strong human evaluation performance across both technical dimensions (Query Quality, Factuality) and communicative dimensions (Clarity, Coherence) indicates successful integration of sophisticated AI capabilities with practical usability requirements that meet human expert standards, supporting our framework’s potential for widespread adoption in industrial warehouse management contexts where human decision-makers must rely on and trust the diagnostic insights provided by the system.

Note: The LLM Agent formulates its own sequence of questions. Potential human expert iterative questions for a forklift delay investigation might include: (1) KPIs per forklift (tasks, durations, idle times)? (2) Which forklifts show consistently longer task/idle times? (3) Operational time breakdown for suspect forklifts? (4) Queue/wait times for packages per forklift? (5) Are delays linked to specific zones/tasks/times for suspect forklifts? (6) How does a suspect forklift’s throughput compare? (7) Any error codes/issues reported? Cypher queries are highly condensed conceptual representations. Agent Summary: The differences in forklift waiting times during the discharge flow reveal variations in efficiency and utilization among the forklifts. Forklift FL 00 has the highest average waiting time of 363.78 seconds, which is significantly higher than the global average waiting time of 221.33 seconds. This suggests potential inefficiencies or bottlenecks in its operation. In contrast, forklifts FL 01, FL 02 and FL 03 have average waiting times of 205.91 seconds, 165.41 seconds and 182.89 seconds respectively, all of which are lower than the global average, indicating more efficient operations. Forklift FL 04 has a slightly higher waiting time of 225.91 seconds compared to the global average, suggesting a minor inefficiency. Forklift FL 00, with the highest waiting time, has the lowest utilization at 30.26%. Conversely, FL 02, which has the lowest waiting time, shows the highest utilization at 42.45%. Forklifts FL 01, FL 03 and FL 04 have utilizations of 36.31%, 40.16% and 35.05% respectively, reflecting their moderate efficiency levels. These differences in waiting times and utilization shows the varying levels of operational efficiency among the forklifts during the discharge flow.

Table 5: Scenario 2: What do the differences in forklift waiting times reveal about the discharge flow? Note: Human expert iterative questions for this investigation: (1) What is the total discharge time for AuroraFarms and how does it compare to the global average? (2) What are the average package waiting times and worker operation times for AuroraFarms compared to the global average? (3) What are the equipment utilization rates (AGV, forklift) for AuroraFarms and how do they compare to the global averages? (4) What is the average number of active AGVs and forklifts servicing AuroraFarms compared to the global average? (5) Are there specific bottlenecks in resource deployment or management (e.g., low utilization despite high equipment availability) causing the delay? The LLM generations have been condensed to fit inside the table. Cypher queries are highly condensed conceptual representations for brevity.

Reference

This content is AI-processed based on open access ArXiv data.

Start searching

Enter keywords to search articles

โ†‘โ†“
โ†ต
ESC
โŒ˜K Shortcut