DeepQuali: Initial results of a study on the use of large language models for assessing the quality of user stories
Generative artificial intelligence (GAI), specifically large language models (LLMs), are increasingly used in software engineering, mainly for coding tasks. However, requirements engineering - particularly requirements validation - has seen limited application of GAI. The current focus of using GAI for requirements is on eliciting, transforming, and classifying requirements, not on quality assessment. We propose and evaluate the LLM-based (GPT-4o) approach “DeepQuali”, for assessing and improving requirements quality in agile software development. We applied it to projects in two small companies, where we compared LLM-based quality assessments with expert judgments. Experts also participated in walkthroughs of the solution, provided feedback, and rated their acceptance of the approach. Experts largely agreed with the LLM’s quality assessments, especially regarding overall ratings and explanations. However, they did not always agree with the other experts on detailed ratings, suggesting that expertise and experience may influence judgments. Experts recognized the usefulness of the approach but criticized the lack of integration into their workflow. LLMs show potential in supporting software engineers with the quality assessment and improvement of requirements. The explicit use of quality models and explanatory feedback increases acceptance.
💡 Research Summary
The paper introduces DeepQuali, a novel approach that leverages the GPT‑4o large language model (LLM) to automatically assess the quality of user stories in agile software development. While generative AI has become commonplace for coding‑related tasks, its use in requirements validation—particularly the evaluation of user‑story quality—remains limited. DeepQuali addresses this gap by explicitly grounding its assessment in established quality models such as INVEST, IN‑VEST, DoR, and ISO/IEC 29148, and by providing structured, explainable output (numeric scores, textual explanations, and improvement suggestions) in a JSON schema.
The authors conducted an empirical study with two German small‑medium enterprises (SMEs): one developing an online health‑course portal and the other a vehicle‑validity data‑management system. From each project’s backlog, five user stories were selected to represent low, medium, and high complexity/quality. Four domain experts (two per company) performed a detailed labeling survey, rating each story on a four‑point scale across six INVEST criteria and a summative “Ready‑to‑Implement” (RTI) criterion. Each criterion was operationalized through multiple statements, forcing experts to choose a clear stance (strongly disagree to strongly agree) and thereby avoiding neutral bias.
DeepQuali was then applied to the same stories. The system’s prompt design consisted of a system prompt that defined the task and context, and a user prompt that specified the exact quality criteria to be evaluated. Model parameters were tuned during development; the final configuration used temperature 0 (to maximise determinism), a fixed seed, and token limits. The LLM generated for each story a set of scores (1–4) together with concise explanations and concrete improvement recommendations.
Results show that DeepQuali’s overall quality scores and explanations align closely with expert judgments, achieving roughly 78 % agreement on average. The RTI aggregate score exhibited the strongest correlation, while individual criteria displayed more variance, reflecting known differences among the human experts themselves. In some cases the LLM produced overly optimistic assessments, highlighting the need for careful prompt engineering and possibly post‑processing checks.
User acceptance was evaluated through workshops and an online survey. Six of the eight participants reported that the feedback was useful for their daily work, and five indicated that automated quality assessment could save time and reduce rework. However, the majority expressed frustration that DeepQuali was not yet integrated into their existing development toolchain, suggesting that a plug‑in or API‑based integration would be essential for practical adoption.
The authors discuss several threats to validity: the small sample size (10 stories, 4 experts) limits external generalisability; anonymisation and removal of sensitive metadata may have altered story content; and reliance on a single LLM version (GPT‑4o) means results could change with future model updates or different prompting strategies. Moreover, the study focuses on predefined quality criteria, so extending the approach to organization‑specific models would require additional customization.
In conclusion, DeepQuali demonstrates that large language models can reliably evaluate user‑story quality when guided by explicit quality frameworks and can produce human‑readable explanations that increase stakeholder trust. The study validates both the technical accuracy (RQ1) and perceived usefulness (RQ2) of the approach, while also revealing integration challenges that affect overall acceptance (RQ3). Future work should explore larger, more diverse datasets, tighter integration with agile tooling (e.g., JIRA, Azure DevOps), and adaptive prompting techniques that allow the system to learn and align with a company’s bespoke quality standards.
Comments & Academic Discussion
Loading comments...
Leave a Comment