Assessing the Business Process Modeling Competences of Large Language Models
The creation of Business Process Model and Notation (BPMN) models is a complex and time-consuming task requiring both domain knowledge and proficiency in modeling conventions. Recent advances in large language models (LLMs) have significantly expanded the possibilities for generating BPMN models directly from natural language, building upon earlier text-to-process methods with enhanced capabilities in handling complex descriptions. However, there is a lack of systematic evaluations of LLM-generated process models. Current efforts either use LLM-as-a-judge approaches or do not consider established dimensions of model quality. To this end, we introduce BEF4LLM, a novel LLM evaluation framework comprising four perspectives: syntactic quality, pragmatic quality, semantic quality, and validity. Using BEF4LLM, we conduct a comprehensive analysis of open-source LLMs and benchmark their performance against human modeling experts. Results indicate that LLMs excel in syntactic and pragmatic quality, while humans outperform in semantic aspects; however, the differences in scores are relatively modest, highlighting LLMs’ competitive potential despite challenges in validity and semantic quality. The insights highlight current strengths and limitations of using LLMs for BPMN modeling and guide future model development and fine-tuning. Addressing these areas is essential for advancing the practical deployment of LLMs in business process modeling.
💡 Research Summary
The paper addresses the emerging need to evaluate how well large language models (LLMs) can generate Business Process Model and Notation (BPMN) diagrams from natural‑language descriptions. While recent advances have shown that LLMs can produce structured artifacts such as BPMN, systematic, multi‑dimensional assessments of their output quality have been lacking. To fill this gap, the authors introduce BEF4LLM, a comprehensive evaluation framework that extends the established SIQ (syntactic, pragmatic, semantic) model with an additional “validity” dimension, resulting in 39 concrete metrics spanning four quality perspectives: syntactic correctness, pragmatic usability, semantic fidelity, and technical validity of the generated BPMN XML.
The study benchmarks 17 open‑source LLMs of varying sizes (0.5 B to 235 B parameters) and context windows (8 K to 128 K tokens) on a curated dataset of 105 text‑to‑BPMN pairs covering diverse business domains. Each model receives the same prompt template, with temperature fixed at 0.2 to ensure reproducibility. Human BPMN experts also model the same textual scenarios, providing a baseline for comparison. All outputs are evaluated automatically using the BEF4LLM metrics, and statistical analyses explore correlations between model size, parameter count, and quality scores.
Key findings reveal that LLMs perform comparably to humans on syntactic (average score ~92 vs 95) and pragmatic dimensions (88 vs 90), indicating that they can reliably adhere to BPMN syntax rules, produce well‑structured diagrams, and keep complexity within acceptable bounds. However, gaps emerge in semantic fidelity (79 vs 85) and especially in validity (71 vs 88). Only about 22 % of the LLM‑generated BPMN files pass strict XML parsing without errors; the remainder suffer from missing tags, malformed connections, or inconsistent flow logic. Moreover, model size shows only a weak correlation with quality: larger models marginally improve syntactic scores but do not substantially close the semantic or validity gaps, suggesting that sheer parameter count is insufficient for deep process understanding.
Error analysis highlights systematic issues: LLMs often over‑generalize activity labels, select inappropriate gateway types for conditional branches, and sometimes omit essential sequence flows, leading to logical inconsistencies that would affect downstream execution or analysis. These shortcomings are critical because BPMN models are used for communication, automation, and compliance; inaccuracies in meaning or validity can propagate costly errors in real‑world business settings.
The authors conclude that BEF4LLM provides a robust, reproducible methodology for assessing LLM‑driven BPMN generation and that current open‑source LLMs, while promising in surface‑level quality, still require targeted improvements to reach expert‑level performance. Future work is suggested in three main areas: (1) domain‑specific fine‑tuning to enhance semantic alignment, (2) advanced prompting or chain‑of‑thought techniques to guide more accurate structural decisions, and (3) post‑generation validation pipelines that automatically detect and correct XML or logical errors. By addressing these avenues, LLMs could become reliable assistants for non‑expert users, accelerating BPMN modeling and expanding access to process‑oriented digital transformation.
Comments & Academic Discussion
Loading comments...
Leave a Comment